Extensible pipeline for data deduplication
    1.
    发明授权
    Extensible pipeline for data deduplication 有权
    用于重复数据删除的可扩展管道

    公开(公告)号:US08380681B2

    公开(公告)日:2013-02-19

    申请号:US12970839

    申请日:2010-12-16

    IPC分类号: G06F17/00

    CPC分类号: G06F17/30091 G06F17/3007

    摘要: The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.

    摘要翻译: 主题公开针对由模块化重复数据消除管道的阶段/模块执行的重复数据删除(优化)。 在每个阶段,流水线允许模块被替换,选择或扩展,例如,可以根据所处理的数据类型将不同的算法用于分组或压缩。 该管道有助于安全数据处理,批量处理和并行处理。 基于反馈可以调整流水线,例如通过选择模块来增加重复数据删除的质量,性能和/或吞吐量。 还描述的是,例如基于文件和/或文件数据集和/或内部或外部反馈的属性和/或统计属性来选择,过滤,排序和/或分组文件以进行重复数据删除。

    Extensible Pipeline for Data Deduplication
    2.
    发明申请
    Extensible Pipeline for Data Deduplication 有权
    可重复数据删除的可扩展管道

    公开(公告)号:US20120158672A1

    公开(公告)日:2012-06-21

    申请号:US12970839

    申请日:2010-12-16

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30091 G06F17/3007

    摘要: The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.

    摘要翻译: 主题公开针对由模块化重复数据消除管道的阶段/模块执行的重复数据删除(优化)。 在每个阶段,流水线允许模块被替换,选择或扩展,例如,可以根据所处理的数据类型将不同的算法用于分组或压缩。 该管道有助于安全数据处理,批量处理和并行处理。 基于反馈可以调整流水线,例如通过选择模块来增加重复数据删除的质量,性能和/或吞吐量。 还描述的是,例如基于文件和/或文件数据集和/或内部或外部反馈的属性和/或统计属性来选择,过滤,排序和/或分组文件以进行重复数据删除。

    Optimization of a partially deduplicated file
    4.
    发明授权
    Optimization of a partially deduplicated file 有权
    优化部分重复数据删除的文件

    公开(公告)号:US08990171B2

    公开(公告)日:2015-03-24

    申请号:US13223484

    申请日:2011-09-01

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30159

    摘要: The subject disclosure is directed towards transforming a file having at least one undeduplicated portion into a fully deduplicated file. For each of the at least one undeduplicated portion, a deduplication mechanism defines at least one chunk between file offsets associated with the at least one undeduplicated portion. Chunk boundaries associated with the at least one chunk are stored within deduplication metadata. The deduplication mechanism aligns the at least one chunk with chunk boundaries of at least one deduplicated portion of the file. Then, the at least one chunk is committed to a chunk store.

    摘要翻译: 主题公开涉及将具有至少一个未重复部分的文件变换为完全重复数据删除的文件。 对于所述至少一个未经复制的部分中的每一个,重复数据删除机制定义与所述至少一个未经复制的部分相关联的文件偏移之间的至少一个块。 与至少一个块相关联的块边界存储在重复数据删除元数据中。 重复数据删除机制将至少一个块与文件的至少一个重复数据删除部分的块边界对齐。 然后,至少一个块被提交到一个块存储。

    SCALABLE CHUNK STORE FOR DATA DEDUPLICATION
    5.
    发明申请
    SCALABLE CHUNK STORE FOR DATA DEDUPLICATION 审中-公开
    可扩展存储器用于数据重复

    公开(公告)号:US20120131025A1

    公开(公告)日:2012-05-24

    申请号:US12949391

    申请日:2010-11-18

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F16/122 G06F16/1752

    摘要: Data streams may be stored in a chunk store in the form of stream maps and data chunks. Data chunks corresponding to a data stream may be stored in a chunk container, and a stream map corresponding to the data stream may point to the data chunks in the chunk container. Multiple stream maps may be stored in a stream container, and may point to the data chunks in the chunk container in a manner that duplicate data chunks are not present. Techniques are provided herein for localizing the storage of related data chunks in such chunk containers, for locating data chunks stored in chunk containers, for storing data streams in chunk stores in localized manners that enhance locality and decrease defragmentation, and for reorganizing stored data streams in chunks stores.

    摘要翻译: 数据流可以以流映射和数据块的形式存储在块存储中。 对应于数据流的数据块可以存储在块容器中,并且对应于数据流的流映射可以指向块容器中的数据块。 多个流映射可以存储在流容器中,并且可以以不存在重复数据块的方式指向块容器中的数据块。 本文提供了用于将存储在这样的块容器中的相关数据块定位的技术,用于定位存储在块容器中的数据块,用于以局部方式将数据流存储在块存储中,以增强局部性并减少碎片整理,以及重新组织存储的数据流 大块商店。

    Using index partitioning and reconciliation for data deduplication
    6.
    发明授权
    Using index partitioning and reconciliation for data deduplication 有权
    使用索引分区和对帐进行重复数据删除

    公开(公告)号:US09110936B2

    公开(公告)日:2015-08-18

    申请号:US12979748

    申请日:2010-12-28

    IPC分类号: G06F17/30

    摘要: The subject disclosure is directed towards a data deduplication technology in which a hash index service's index is partitioned into subspace indexes, with less than the entire hash index service's index cached to save memory. The subspace index is accessed to determine whether a data chunk already exists or needs to be indexed and stored. The index may be divided into subspaces based on criteria associated with the data to index, such as file type, data type, time of last usage, and so on. Also described is subspace reconciliation, in which duplicate entries in subspaces are detected so as to remove entries and chunks from the deduplication system. Subspace reconciliation may be performed at off-peak time, when more system resources are available, and may be interrupted if resources are needed. Subspaces to reconcile may be based on similarity, including via similarity of signatures that each compactly represents the subspace's hashes.

    摘要翻译: 本发明涉及一种数据重复数据删除技术,其中散列索引服务的索引被分割成子空间索引,其中小于整个散列索引服务的索引来缓存存储器。 访问子空间索引以确定数据块是否已经存在或需要进行索引和存储。 索引可以根据与索引的数据相关联的标准被划分为子空间,例如文件类型,数据类型,上次使用的时间等等。 还描述了子空间协调,其中检测子空间中的重复条目,以便从重复数据删除系统中删除条目和块。 当更多的系统资源可用时,子空间协调可以在非高峰时间执行,并且如果需要资源,则可能被中断。 调和的子空间可以基于相似性,包括通过每个紧密地表示子空间的散列的签名的相似性。

    Optimization of a Partially Deduplicated File
    7.
    发明申请
    Optimization of a Partially Deduplicated File 有权
    部分重复数据删除文件的优化

    公开(公告)号:US20130060739A1

    公开(公告)日:2013-03-07

    申请号:US13223484

    申请日:2011-09-01

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30159

    摘要: The subject disclosure is directed towards transforming a file having at least one undeduplicated portion into a fully deduplicated file. For each of the at least one undeduplicated portion, a deduplication mechanism defines at least one chunk between file offsets associated with the at least one undeduplicated portion. Chunk boundaries associated with the at least one chunk are stored within deduplication metadata. The deduplication mechanism aligns the at least one chunk with chunk boundaries of at least one deduplicated portion of the file. Then, the at least one chunk is committed to a chunk store.

    摘要翻译: 主题公开涉及将具有至少一个未重复部分的文件变换为完全重复数据删除的文件。 对于所述至少一个未经复制的部分中的每一个,重复数据删除机制定义与所述至少一个未经复制的部分相关联的文件偏移之间的至少一个块。 与至少一个块相关联的块边界存储在重复数据删除元数据中。 重复数据删除机制将至少一个块与文件的至少一个重复数据删除部分的块边界对齐。 然后,至少一个块被提交到一个块存储。

    Using Index Partitioning and Reconciliation for Data Deduplication
    8.
    发明申请
    Using Index Partitioning and Reconciliation for Data Deduplication 有权
    使用索引分区和调整进行重复数据删除

    公开(公告)号:US20120166401A1

    公开(公告)日:2012-06-28

    申请号:US12979748

    申请日:2010-12-28

    IPC分类号: G06F17/30

    摘要: The subject disclosure is directed towards a data deduplication technology in which a hash index service's index is partitioned into subspace indexes, with less than the entire hash index service's index cached to save memory. The subspace index is accessed to determine whether a data chunk already exists or needs to be indexed and stored. The index may be divided into subspaces based on criteria associated with the data to index, such as file type, data type, time of last usage, and so on. Also described is subspace reconciliation, in which duplicate entries in subspaces are detected so as to remove entries and chunks from the deduplication system. Subspace reconciliation may be performed at off-peak time, when more system resources are available, and may be interrupted if resources are needed. Subspaces to reconcile may be based on similarity, including via similarity of signatures that each compactly represents the subspace's hashes.

    摘要翻译: 本发明涉及一种数据重复数据删除技术,其中散列索引服务的索引被分割成子空间索引,其中小于整个散列索引服务的索引来缓存存储器。 访问子空间索引以确定数据块是否已经存在或需要进行索引和存储。 索引可以根据与索引的数据相关联的标准被划分为子空间,例如文件类型,数据类型,最后使用时间等等。 还描述了子空间协调,其中检测子空间中的重复条目,以便从重复数据删除系统中删除条目和块。 当更多的系统资源可用时,子空间协调可以在非高峰时间执行,并且如果需要资源,则可能被中断。 调和的子空间可以基于相似性,包括通过每个紧密地表示子空间的散列的签名的相似性。

    GARBAGE COLLECTION AND HOTSPOTS RELIEF FOR A DATA DEDUPLICATION CHUNK STORE
    9.
    发明申请
    GARBAGE COLLECTION AND HOTSPOTS RELIEF FOR A DATA DEDUPLICATION CHUNK STORE 审中-公开
    GARBAGE收藏和休息用于数据重复存储商店

    公开(公告)号:US20120159098A1

    公开(公告)日:2012-06-21

    申请号:US12971694

    申请日:2010-12-17

    IPC分类号: G06F12/02 G06F12/16

    CPC分类号: G06F12/0261

    摘要: Techniques for garbage collecting unused data chunks in storage are provided. According to one implementation, data chunks stored in a chunk container that are unused are identified based an analysis of one or more stream map chunks indicated as deleted. The identified data chunks are indicated as deleted. The storage space in the chunk container filled by the data chunks indicated as deleted may then be reclaimed. Techniques for selectively backing up data chunks are also provided. According to one implementation, a data chunk is received for storing in a chunk container. A backup copy of the received data chunk is stored in a backup container if the received data chunk is in a predetermined top percentage of most referenced data chunks in the chunk container and has a number of references greater than a predetermined reference threshold.

    摘要翻译: 提供垃圾收集存储器中未使用的数据块的技术。 根据一个实施方式,基于被指示为已删除的一个或多个流映射块的分析来识别存储在未使用的块容器中的数据块。 标识的数据块被表示为已删除。 然后可以回收由指定为已删除的数据块填充的块容器中的存储空间。 还提供了用于选择性地备份数据块的技术。 根据一个实施方式,接收用于存储在块容器中的数据块。 如果接收到的数据块处于块容器中大多数引用的数据块的预定最大百分比并且具有大于预定参考阈值的引用数量,则将所接收的数据块的备份副本存储在备份容器中。

    Partial recall of deduplicated files
    10.
    发明授权
    Partial recall of deduplicated files 有权
    部分召回重复数据删除的文件

    公开(公告)号:US08645335B2

    公开(公告)日:2014-02-04

    申请号:US12970848

    申请日:2010-12-16

    IPC分类号: G06F7/00 G06F17/00

    CPC分类号: G06F17/30156

    摘要: The subject disclosure is directed towards changing a file from a fully deduplicated state to a partially deduplicated state in which some of the file data is deduplicated in a chunk store, and some is recalled into the file, that is, in the file's storage volume. A partial recall mechanism such as in a file system filter tracks (e.g., via a bitmap in a file reparse point) whether file data is maintained in the chunk store or has been recalled to the file. Data is recalled from the chunk store as needed, and committed (e.g., flushed) to the file. Also described is efficiently returning the file to a fully deduplicated state by using the tracking information to determine which parts of the file are already deduplicated into the chunk store so as to avoid their further deduplication processing.

    摘要翻译: 主题公开涉及将文件从完全重复数据删除的状态改变为部分重复数据删除的状态,其中一些文件数据在块存储中被重复数据删除,并且一些文件被调回到该文件中,即在该文件的存储卷中。 诸如在文件系统过滤器中的部分恢复机制跟踪(例如,通过文件重分析点中的位图)文件数据是否被保存在块存储器中或已被调用到文件中。 根据需要从块存储器中调用数据,并将其提交(例如,刷新)到文件中。 还描述了通过使用跟踪信息来确定文件的哪些部分已被重复数据删除到块存储器中以有效地将文件返回到完全重复数据删除的状态,以避免其进一步的重复数据消除处理。