-
公开(公告)号:US08380681B2
公开(公告)日:2013-02-19
申请号:US12970839
申请日:2010-12-16
IPC分类号: G06F17/00
CPC分类号: G06F17/30091 , G06F17/3007
摘要: The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.
摘要翻译: 主题公开针对由模块化重复数据消除管道的阶段/模块执行的重复数据删除(优化)。 在每个阶段,流水线允许模块被替换,选择或扩展,例如,可以根据所处理的数据类型将不同的算法用于分组或压缩。 该管道有助于安全数据处理,批量处理和并行处理。 基于反馈可以调整流水线,例如通过选择模块来增加重复数据删除的质量,性能和/或吞吐量。 还描述的是,例如基于文件和/或文件数据集和/或内部或外部反馈的属性和/或统计属性来选择,过滤,排序和/或分组文件以进行重复数据删除。
-
公开(公告)号:US20120158672A1
公开(公告)日:2012-06-21
申请号:US12970839
申请日:2010-12-16
IPC分类号: G06F17/30
CPC分类号: G06F17/30091 , G06F17/3007
摘要: The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.
摘要翻译: 主题公开针对由模块化重复数据消除管道的阶段/模块执行的重复数据删除(优化)。 在每个阶段,流水线允许模块被替换,选择或扩展,例如,可以根据所处理的数据类型将不同的算法用于分组或压缩。 该管道有助于安全数据处理,批量处理和并行处理。 基于反馈可以调整流水线,例如通过选择模块来增加重复数据删除的质量,性能和/或吞吐量。 还描述的是,例如基于文件和/或文件数据集和/或内部或外部反馈的属性和/或统计属性来选择,过滤,排序和/或分组文件以进行重复数据删除。
-
公开(公告)号:US10394757B2
公开(公告)日:2019-08-27
申请号:US12949391
申请日:2010-11-18
申请人: Chun Ho (Ian) Cheung , Paul Adrian Oltean , Ran Kalach , Abhishek Gupta , James Robert Benton , Ronakkumar Desai
发明人: Chun Ho (Ian) Cheung , Paul Adrian Oltean , Ran Kalach , Abhishek Gupta , James Robert Benton , Ronakkumar Desai
IPC分类号: G06F16/11 , G06F16/174
摘要: Data streams may be stored in a chunk store in the form of stream maps and data chunks. Data chunks corresponding to a data stream may be stored in a chunk container, and a stream map corresponding to the data stream may point to the data chunks in the chunk container. Multiple stream maps may be stored in a stream container, and may point to the data chunks in the chunk container in a manner that duplicate data chunks are not present. Techniques are provided herein for localizing the storage of related data chunks in such chunk containers, for locating data chunks stored in chunk containers, for storing data streams in chunk stores in localized manners that enhance locality and decrease defragmentation, and for reorganizing stored data streams in chunks stores.
-
公开(公告)号:US08990171B2
公开(公告)日:2015-03-24
申请号:US13223484
申请日:2011-09-01
申请人: Ran Kalach , Kashif Hasan , Paul Adrian Oltean , James Robert Benton , Chun Ho Cheung , Abhishek Gupta
发明人: Ran Kalach , Kashif Hasan , Paul Adrian Oltean , James Robert Benton , Chun Ho Cheung , Abhishek Gupta
IPC分类号: G06F17/30
CPC分类号: G06F17/30159
摘要: The subject disclosure is directed towards transforming a file having at least one undeduplicated portion into a fully deduplicated file. For each of the at least one undeduplicated portion, a deduplication mechanism defines at least one chunk between file offsets associated with the at least one undeduplicated portion. Chunk boundaries associated with the at least one chunk are stored within deduplication metadata. The deduplication mechanism aligns the at least one chunk with chunk boundaries of at least one deduplicated portion of the file. Then, the at least one chunk is committed to a chunk store.
摘要翻译: 主题公开涉及将具有至少一个未重复部分的文件变换为完全重复数据删除的文件。 对于所述至少一个未经复制的部分中的每一个,重复数据删除机制定义与所述至少一个未经复制的部分相关联的文件偏移之间的至少一个块。 与至少一个块相关联的块边界存储在重复数据删除元数据中。 重复数据删除机制将至少一个块与文件的至少一个重复数据删除部分的块边界对齐。 然后,至少一个块被提交到一个块存储。
-
公开(公告)号:US20130060739A1
公开(公告)日:2013-03-07
申请号:US13223484
申请日:2011-09-01
申请人: Ran Kalach , Kashif Hasan , Paul Adrian Oltean , James Robert Benton , Chun Ho Cheung , Abhishek Gupta
发明人: Ran Kalach , Kashif Hasan , Paul Adrian Oltean , James Robert Benton , Chun Ho Cheung , Abhishek Gupta
IPC分类号: G06F17/30
CPC分类号: G06F17/30159
摘要: The subject disclosure is directed towards transforming a file having at least one undeduplicated portion into a fully deduplicated file. For each of the at least one undeduplicated portion, a deduplication mechanism defines at least one chunk between file offsets associated with the at least one undeduplicated portion. Chunk boundaries associated with the at least one chunk are stored within deduplication metadata. The deduplication mechanism aligns the at least one chunk with chunk boundaries of at least one deduplicated portion of the file. Then, the at least one chunk is committed to a chunk store.
摘要翻译: 主题公开涉及将具有至少一个未重复部分的文件变换为完全重复数据删除的文件。 对于所述至少一个未经复制的部分中的每一个,重复数据删除机制定义与所述至少一个未经复制的部分相关联的文件偏移之间的至少一个块。 与至少一个块相关联的块边界存储在重复数据删除元数据中。 重复数据删除机制将至少一个块与文件的至少一个重复数据删除部分的块边界对齐。 然后,至少一个块被提交到一个块存储。
-
6.
公开(公告)号:US20120166401A1
公开(公告)日:2012-06-28
申请号:US12979748
申请日:2010-12-28
申请人: Jin Li , Sudipta Sengupta , Ran Kalach , Ronakkumar N. Desai , Paul Adrian Oltean , James Robert Benton
发明人: Jin Li , Sudipta Sengupta , Ran Kalach , Ronakkumar N. Desai , Paul Adrian Oltean , James Robert Benton
IPC分类号: G06F17/30
CPC分类号: G06F17/30371 , G06F17/30156 , G06F17/30303 , G06F17/30327 , G06F17/3033 , G06F17/30489
摘要: The subject disclosure is directed towards a data deduplication technology in which a hash index service's index is partitioned into subspace indexes, with less than the entire hash index service's index cached to save memory. The subspace index is accessed to determine whether a data chunk already exists or needs to be indexed and stored. The index may be divided into subspaces based on criteria associated with the data to index, such as file type, data type, time of last usage, and so on. Also described is subspace reconciliation, in which duplicate entries in subspaces are detected so as to remove entries and chunks from the deduplication system. Subspace reconciliation may be performed at off-peak time, when more system resources are available, and may be interrupted if resources are needed. Subspaces to reconcile may be based on similarity, including via similarity of signatures that each compactly represents the subspace's hashes.
摘要翻译: 本发明涉及一种数据重复数据删除技术,其中散列索引服务的索引被分割成子空间索引,其中小于整个散列索引服务的索引来缓存存储器。 访问子空间索引以确定数据块是否已经存在或需要进行索引和存储。 索引可以根据与索引的数据相关联的标准被划分为子空间,例如文件类型,数据类型,最后使用时间等等。 还描述了子空间协调,其中检测子空间中的重复条目,以便从重复数据删除系统中删除条目和块。 当更多的系统资源可用时,子空间协调可以在非高峰时间执行,并且如果需要资源,则可能被中断。 调和的子空间可以基于相似性,包括通过每个紧密地表示子空间的散列的签名的相似性。
-
7.
公开(公告)号:US09110936B2
公开(公告)日:2015-08-18
申请号:US12979748
申请日:2010-12-28
申请人: Jin Li , Sudipta Sengupta , Ran Kalach , Ronakkumar N. Desai , Paul Adrian Oltean , James Robert Benton
发明人: Jin Li , Sudipta Sengupta , Ran Kalach , Ronakkumar N. Desai , Paul Adrian Oltean , James Robert Benton
IPC分类号: G06F17/30
CPC分类号: G06F17/30371 , G06F17/30156 , G06F17/30303 , G06F17/30327 , G06F17/3033 , G06F17/30489
摘要: The subject disclosure is directed towards a data deduplication technology in which a hash index service's index is partitioned into subspace indexes, with less than the entire hash index service's index cached to save memory. The subspace index is accessed to determine whether a data chunk already exists or needs to be indexed and stored. The index may be divided into subspaces based on criteria associated with the data to index, such as file type, data type, time of last usage, and so on. Also described is subspace reconciliation, in which duplicate entries in subspaces are detected so as to remove entries and chunks from the deduplication system. Subspace reconciliation may be performed at off-peak time, when more system resources are available, and may be interrupted if resources are needed. Subspaces to reconcile may be based on similarity, including via similarity of signatures that each compactly represents the subspace's hashes.
摘要翻译: 本发明涉及一种数据重复数据删除技术,其中散列索引服务的索引被分割成子空间索引,其中小于整个散列索引服务的索引来缓存存储器。 访问子空间索引以确定数据块是否已经存在或需要进行索引和存储。 索引可以根据与索引的数据相关联的标准被划分为子空间,例如文件类型,数据类型,上次使用的时间等等。 还描述了子空间协调,其中检测子空间中的重复条目,以便从重复数据删除系统中删除条目和块。 当更多的系统资源可用时,子空间协调可以在非高峰时间执行,并且如果需要资源,则可能被中断。 调和的子空间可以基于相似性,包括通过每个紧密地表示子空间的散列的签名的相似性。
-
公开(公告)号:US20120131025A1
公开(公告)日:2012-05-24
申请号:US12949391
申请日:2010-11-18
申请人: Chun Ho (Ian) Cheung , Paul Adrian Oltean , Ran Kalach , Abhishek Gupta , James Robert Benton , Ronakkumar Desai
发明人: Chun Ho (Ian) Cheung , Paul Adrian Oltean , Ran Kalach , Abhishek Gupta , James Robert Benton , Ronakkumar Desai
CPC分类号: G06F16/122 , G06F16/1752
摘要: Data streams may be stored in a chunk store in the form of stream maps and data chunks. Data chunks corresponding to a data stream may be stored in a chunk container, and a stream map corresponding to the data stream may point to the data chunks in the chunk container. Multiple stream maps may be stored in a stream container, and may point to the data chunks in the chunk container in a manner that duplicate data chunks are not present. Techniques are provided herein for localizing the storage of related data chunks in such chunk containers, for locating data chunks stored in chunk containers, for storing data streams in chunk stores in localized manners that enhance locality and decrease defragmentation, and for reorganizing stored data streams in chunks stores.
摘要翻译: 数据流可以以流映射和数据块的形式存储在块存储中。 对应于数据流的数据块可以存储在块容器中,并且对应于数据流的流映射可以指向块容器中的数据块。 多个流映射可以存储在流容器中,并且可以以不存在重复数据块的方式指向块容器中的数据块。 本文提供了用于将存储在这样的块容器中的相关数据块定位的技术,用于定位存储在块容器中的数据块,用于以局部方式将数据流存储在块存储中,以增强局部性并减少碎片整理,以及重新组织存储的数据流 大块商店。
-
9.
公开(公告)号:US20100274750A1
公开(公告)日:2010-10-28
申请号:US12427755
申请日:2009-04-22
申请人: Paul Adrian Oltean , Clyde Law , Judd Hardy , Nir Ben-Zvi , Ran Kalach
发明人: Paul Adrian Oltean , Clyde Law , Judd Hardy , Nir Ben-Zvi , Ran Kalach
IPC分类号: G06N5/02
CPC分类号: G06F16/16 , G06F16/122
摘要: Described is a technology in which data items (e.g., files) are processed through an extensible data processing pipeline, including a classification pipeline, to facilitate management of the data items based upon their classifications. A discovery module locates data items to process. An independent classification pipeline obtains metadata (properties) associated with each discovered data item, and one or more classifiers classify the data item based on the metadata. An independent policy module applies policy to each data item based upon its classification. Multiple classifiers may be invoked, based upon various criteria. Predefined ordering of the classifiers, authoritative classifiers and/or an aggregation mechanism handle any classification conflicts. Different types of classifiers may be provided, and each classifier may correspond to automatic classification rules; the classifier may directly change a property, (e.g., set the classification) or return a result to a corresponding rule mechanism for changing a property.
摘要翻译: 描述了一种技术,其中通过包括分类流水线的可扩展数据处理流水线处理数据项(例如文件),以便于基于它们的分类来管理数据项。 发现模块定位要处理的数据项。 独立分类管道获取与每个发现的数据项相关联的元数据(属性),并且一个或多个分类器基于元数据对数据项进行分类。 独立的策略模块根据分类对每个数据项应用策略。 可以基于各种标准来调用多个分类器。 分类器,权威分类器和/或聚合机制的预定义排序可以处理任何分类冲突。 可以提供不同类型的分类器,并且每个分类器可以对应于自动分类规则; 分类器可以直接改变属性(例如,设置分类)或将结果返回到用于改变属性的相应规则机制。
-
公开(公告)号:US07401089B2
公开(公告)日:2008-07-15
申请号:US11206425
申请日:2005-08-17
IPC分类号: G06F17/30
CPC分类号: G06F17/30067 , Y10S707/99942
摘要: Described is a storage reports scanner that works to generate reports of storage usage in computer systems in an efficient manner. The scanner receives a set of namespaces for a file system volume from a storage reports engine. The scanner scans file system metadata to construct a directory table of entries corresponding to a directory tree of nodes representative of the hierarchy of directories of the file system volume. Each node corresponding to a namespace in the namespace set is marked as included. A second scan of the file system metadata determines, for each file, whether that file is in or under an included directory by accessing the directory table. For each file that is in or is under an included directory, file information is returned to the engine. The engine may request the scanner to provide full path information, which the scanner determines via the directory table.
摘要翻译: 描述了一种存储报告扫描器,用于以有效的方式生成计算机系统中的存储使用的报告。 扫描仪从存储报告引擎接收一组文件系统卷的命名空间。 扫描仪扫描文件系统元数据以构成与表示文件系统卷的目录的层次结构的节点的目录树相对应的条目的目录表。 与命名空间集中的命名空间相对应的每个节点都被标记为包含。 对于每个文件,文件系统元数据的第二次扫描是通过访问目录表来确定该文件是否在所包含的目录中或之下。 对于位于或位于所包含的目录中的每个文件,文件信息将返回引擎。 引擎可以请求扫描仪提供完整路径信息,扫描仪通过目录表确定。
-
-
-
-
-
-
-
-
-