Backup and restore strategies for data deduplication

    公开(公告)号:US09823981B2

    公开(公告)日:2017-11-21

    申请号:US13045692

    申请日:2011-03-11

    IPC分类号: G06F11/14 G06F3/06

    摘要: Techniques for backup and restore of optimized data streams are described. A chunk store includes each optimized data stream as a plurality of chunks including at least one data chunk and corresponding optimized stream metadata. The chunk store includes data chunks in a deduplicated manner. Optimized data streams stored in the chunk store are identified for backup. At least a portion of the chunk store is stored in backup storage according to an optimized backup technique, an un-optimized backup technique, an item level backup technique, or a data chunk identifier backup technique. Optimized data streams stored in the backup storage may be restored. A file reconstructor includes a callback module that generates calls to a restore application to request optimized stream metadata and any referenced data chunks from the backup storage. The file reconstructor reconstructs the data streams from the referenced data chunks.

    Alternate data stream cache for file classification
    2.
    发明授权
    Alternate data stream cache for file classification 有权
    用于文件分类的备用数据流缓存

    公开(公告)号:US08805837B2

    公开(公告)日:2014-08-12

    申请号:US12605451

    申请日:2009-10-26

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30115 G06F17/30598

    摘要: Described is caching classification-related metadata for a file in an alternate data stream of that file. When a file is classified (e.g., for data management), the classification properties are cached in association with the file, along with classification-related metadata that indicates the state of the file at the time of caching. The classification-related metadata in the alternate data stream is then useable in determining whether the classification properties are valid and up-to-date when next accessed, or whether the file needs to be reclassified. If the properties are valid and up-to-date, they may be used without requiring the computationally costly steps of reclassification. Also described is using more than one alternate data stream for the cache, and extending the classification-related metadata through a defined extension mechanism.

    摘要翻译: 描述了该文件的备用数据流中文件的缓存分类相关元数据。 当文件被分类(例如,用于数据管理)时,分类属性与文件相关联地缓存,以及指示缓存时文件状态的分类相关元数据。 备用数据流中的分类相关元数据可用于确定下次访问时分类属性是否有效和最新,还是文件是否需要重新分类。 如果属性是有效和最新的,则可以使用它们,而不需要重新分类的计算上昂贵的步骤。 还描述了为缓存使用多于一个备用数据流,并通过定义的扩展机制来扩展与分类有关的元数据。

    Extensible Pipeline for Data Deduplication
    3.
    发明申请
    Extensible Pipeline for Data Deduplication 有权
    可重复数据删除的可扩展管道

    公开(公告)号:US20120158672A1

    公开(公告)日:2012-06-21

    申请号:US12970839

    申请日:2010-12-16

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30091 G06F17/3007

    摘要: The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.

    摘要翻译: 主题公开针对由模块化重复数据消除管道的阶段/模块执行的重复数据删除(优化)。 在每个阶段,流水线允许模块被替换,选择或扩展,例如,可以根据所处理的数据类型将不同的算法用于分组或压缩。 该管道有助于安全数据处理,批量处理和并行处理。 基于反馈可以调整流水线,例如通过选择模块来增加重复数据删除的质量,性能和/或吞吐量。 还描述的是,例如基于文件和/或文件数据集和/或内部或外部反馈的属性和/或统计属性来选择,过滤,排序和/或分组文件以进行重复数据删除。

    ALTERNATE DATA STREAM CACHE FOR FILE CLASSIFICATION
    4.
    发明申请
    ALTERNATE DATA STREAM CACHE FOR FILE CLASSIFICATION 有权
    用于文件分类的替代数据流缓存

    公开(公告)号:US20110099152A1

    公开(公告)日:2011-04-28

    申请号:US12605451

    申请日:2009-10-26

    IPC分类号: G06F17/00 G06F12/00 G06F12/08

    CPC分类号: G06F17/30115 G06F17/30598

    摘要: Described is caching classification-related metadata for a file in an alternate data stream of that file. When a file is classified (e.g., for data management), the classification properties are cached in association with the file, along with classification-related metadata that indicates the state of the file at the time of caching. The classification-related metadata in the alternate data stream is then useable in determining whether the classification properties are valid and up-to-date when next accessed, or whether the file needs to be reclassified. If the properties are valid and up-to-date, they may be used without requiring the computationally costly steps of reclassification. Also described is using more than one alternate data stream for the cache, and extending the classification-related metadata through a defined extension mechanism.

    摘要翻译: 描述了该文件的备用数据流中文件的缓存分类相关元数据。 当文件被分类(例如,用于数据管理)时,分类属性与文件相关联地缓存,以及指示缓存时文件状态的分类相关元数据。 备用数据流中的分类相关元数据可用于确定下次访问时分类属性是否有效和最新,还是文件是否需要重新分类。 如果属性是有效和最新的,则可以使用它们,而不需要重新分类的计算上昂贵的步骤。 还描述了为缓存使用多于一个备用数据流,并通过定义的扩展机制来扩展与分类有关的元数据。

    Method and system for efficient generation of storage reports
    5.
    发明授权
    Method and system for efficient generation of storage reports 失效
    有效生成存储报告的方法和系统

    公开(公告)号:US07552115B2

    公开(公告)日:2009-06-23

    申请号:US11107977

    申请日:2005-04-15

    IPC分类号: G06F17/30 G06F7/08 G06F7/16

    摘要: Described is a method and system by which reports of storage usage in computer systems are generated in an efficient manner by consolidating multiple requests for reports into a minimal number of volume scans, including by intelligently selecting a scanning method (e.g., of file system metadata versus find-first/find-next) and by performing parallel scans on different volumes. Namespace consolidation scans namespaces together, so as to generate multiple reports from the same set of files, reducing the number of volumes scans required to collect the data. Each volume scan may be a find-first, find next directory-based scan, or a volume metadata database scan. Time consolidation groups independent storage report generations together, such as storage report requests received within an administrator-specified interval. Parallel scans of different volumes may be performed, subject to I/O and processing resource limitations, and so that volumes partitioned on the same spindle are not scanned in parallel.

    摘要翻译: 描述了一种方法和系统,通过该方法和系统通过将报告的多个请求整合到最小数量的卷扫描中,包括通过智能地选择扫描方法(例如,文件系统元数据与 find-first / find-next),并对不同卷执行并行扫描。 命名空间合并将命名空间一起扫描,以便从同一组文件生成多个报告,从而减少收集数据所需的卷扫描次数。 每个卷扫描可以是查找优先,查找下一个基于目录的扫描或卷元数据数据库扫描。 时间合并将独立的存储报告代码组合在一起,例如在管理员指定的时间间隔内收到的存储报告请求。 可能会执行不同卷的并行扫描,受到I / O和处理资源限制的影响,因此不会同时扫描在同一主轴上分区的卷。

    System and method for data migration
    6.
    发明授权
    System and method for data migration 有权
    用于数据迁移的系统和方法

    公开(公告)号:US07284015B2

    公开(公告)日:2007-10-16

    申请号:US10935789

    申请日:2004-09-08

    IPC分类号: G06F17/30 G06F12/00

    摘要: A method for concurrent data migration includes classifying files to be migrated into plural jobs, selecting media to which to migrate each job, and using plural drives concurrently to write the jobs to the media. The selection of a medium is performed in a way that prevents the number of writeable media from exceeding the number of available drives, unless no allocated medium has sufficient space to store any files in a migration job. A medium is preferentially selected that has already been allocated for writing, has space to store at least one file in the job, is not in use for another job, and can be robotically mounted on a drive. If such a medium-does not exist, then the set of available media is canvassed to locate an alternative medium.

    摘要翻译: 并行数据迁移的方法包括将要迁移到多个作业的文件进行分类,选择要迁移到每个作业的介质,并且同时使用多个驱动器将作业写入介质。 介质的选择以防止可写入介质数量超过可用驱动器数量的方式执行,除非没有分配的介质具有足够的空间来存储迁移作业中的任何文件。 优先选择已经被分配用于写入的介质,具有在作业中存储至少一个文件的空间,不用于另一作业,并且可以机械地安装在驱动器上。 如果不存在这样的介质,则可以使用一组可用介质来定位替代介质。

    Storage reports file system scanner
    7.
    发明申请
    Storage reports file system scanner 有权
    存储报告文件系统扫描器

    公开(公告)号:US20070043747A1

    公开(公告)日:2007-02-22

    申请号:US11206425

    申请日:2005-08-17

    IPC分类号: G06F7/00

    摘要: Described is a storage reports scanner that works to generate reports of storage usage in computer systems in an efficient manner. The scanner receives a set of namespaces for a file system volume from a storage reports engine. The scanner scans file system metadata to construct a directory table of entries corresponding to a directory tree of nodes representative of the hierarchy of directories of the file system volume. Each node corresponding to a namespace in the namespace set is marked as included. A second scan of the file system metadata determines, for each file, whether that file is in or under an included directory by accessing the directory table. For each file that is in or is under an included directory, file information is returned to the engine. The engine may request the scanner to provide full path information, which the scanner determines via the directory table.

    摘要翻译: 描述了一种存储报告扫描器,用于以有效的方式生成计算机系统中的存储使用的报告。 扫描仪从存储报告引擎接收一组文件系统卷的命名空间。 扫描仪扫描文件系统元数据以构成与表示文件系统卷的目录的层次结构的节点的目录树相对应的条目的目录表。 与命名空间集中的命名空间相对应的每个节点都被标记为包含。 对于每个文件,文件系统元数据的第二次扫描是通过访问目录表来确定该文件是否在所包含的目录中或之下。 对于位于或位于所包含的目录中的每个文件,文件信息将返回引擎。 引擎可以请求扫描仪提供完整路径信息,扫描仪通过目录表确定。

    Generating storage reports using volume snapshots
    8.
    发明申请
    Generating storage reports using volume snapshots 失效
    使用卷快照生成存储报告

    公开(公告)号:US20060235892A1

    公开(公告)日:2006-10-19

    申请号:US11107119

    申请日:2005-04-15

    IPC分类号: G06F17/30

    摘要: Described is a method and system by which storage reports are generated from a volume snapshot set rather than the live volume or volumes, wherein a volume snapshot set comprises a representation or copy of one or more volume at a single point-in-time. By scanning the snapshot, a consistent file system image is obtained. Scanning may take place by enumerating a volume's directories of files, or, when available, by accessing a file system metadata of file information (e.g., a master file table) separately maintained on the volume. With some (e.g., hardware-based) snapshot technologies, the snapshot can be transported to another computing system for scanning by that other computing system, thereby avoiding burdening a live system's resources when scanning. Accurate and consistent storage reports are thus obtained at a single point in time, independent of the number of volumes being scanned.

    摘要翻译: 描述了一种通过其从卷快照集而不是实际卷或卷生成存储报告的方法和系统,其中卷快照集合包括在单个时间点的一个或多个卷的表示或副本。 通过扫描快照,获得一致的文件系统映像。 可以通过枚举卷的文件目录,或者在可用时通过访问单独维护在卷上的文件信息(例如,主文件表)的文件系统元数据来进行扫描。 利用一些(例如基于硬件的)快照技术,快照可以被传送到另一个计算系统,以便由其他计算系统进行扫描,从而避免在扫描时负担现场系统的资源。 因此,在单个时间点上获得了准确和一致的存储报告,与被扫描的卷数无关。

    System and method for data migration

    公开(公告)号:US20050033932A1

    公开(公告)日:2005-02-10

    申请号:US10935789

    申请日:2004-09-08

    IPC分类号: G06F17/30 G06F12/00

    摘要: A method for concurrent data migration includes classifying files to be migrated into plural jobs, selecting media to which to migrate each job, and using plural drives concurrently to write the jobs to the media. The selection of a medium is performed in a way that prevents the number of writeable media from exceeding the number of available drives, unless no allocated medium has sufficient space to store any files in a migration job. A medium is preferentially selected that has already been allocated for writing, has space to store at least one file in the job, is not in use for another job, and can be robotically mounted on a drive. If such a medium does not exist, then the set of available media is canvassed to locate an alternative medium. The attributes of each medium are evaluated to determine which medium can be selected most consistently with the goals of (1) preventing the number of media from exceeding the number of drives, and (2) providing sufficient media to allow plural drives to be used concurrently. The technique can be embodied in a file management environment that transparently migrates files meeting certain criteria and stores the location of the migrated file in a reparse point provided by the file system.

    Using index partitioning and reconciliation for data deduplication
    10.
    发明授权
    Using index partitioning and reconciliation for data deduplication 有权
    使用索引分区和对帐进行重复数据删除

    公开(公告)号:US09110936B2

    公开(公告)日:2015-08-18

    申请号:US12979748

    申请日:2010-12-28

    IPC分类号: G06F17/30

    摘要: The subject disclosure is directed towards a data deduplication technology in which a hash index service's index is partitioned into subspace indexes, with less than the entire hash index service's index cached to save memory. The subspace index is accessed to determine whether a data chunk already exists or needs to be indexed and stored. The index may be divided into subspaces based on criteria associated with the data to index, such as file type, data type, time of last usage, and so on. Also described is subspace reconciliation, in which duplicate entries in subspaces are detected so as to remove entries and chunks from the deduplication system. Subspace reconciliation may be performed at off-peak time, when more system resources are available, and may be interrupted if resources are needed. Subspaces to reconcile may be based on similarity, including via similarity of signatures that each compactly represents the subspace's hashes.

    摘要翻译: 本发明涉及一种数据重复数据删除技术,其中散列索引服务的索引被分割成子空间索引,其中小于整个散列索引服务的索引来缓存存储器。 访问子空间索引以确定数据块是否已经存在或需要进行索引和存储。 索引可以根据与索引的数据相关联的标准被划分为子空间,例如文件类型,数据类型,上次使用的时间等等。 还描述了子空间协调,其中检测子空间中的重复条目,以便从重复数据删除系统中删除条目和块。 当更多的系统资源可用时,子空间协调可以在非高峰时间执行,并且如果需要资源,则可能被中断。 调和的子空间可以基于相似性,包括通过每个紧密地表示子空间的散列的签名的相似性。