摘要:
A system and method for partitioning a data stream into a plurality of segments of varying sizes. A data stream manager partitions a received data stream into segments which are then conveyed to a deduplication engine for processing. The data stream received by the data stream manager includes metadata corresponding to the data stream. Based upon the metadata, which may include an indication as to a type of data included in the data stream, the data stream is partitioned into segments for further processing. A size of a segment used for partitioning given data is based at least in part on a type of data being partitioned. The variable segment sizes may be chosen to balance between maximizing the deduplication ratio and minimizing both the segment count and the size of the fingerprint index.
摘要:
Containers that store data objects that were written to those containers during a particular backup are accessed. Then, a subset of the containers is identified; the containers in the subset have less than a threshold number of data objects associated with the particular backup. Data objects that are in containers in that subset and that are associated with the backup are copied to one or more other containers. Those other containers are subsequently used to restore data objects associated with the backup.
摘要:
The present disclosure provides for efficiently creating a full backup image of a client device by efficiently communicating backup data to a backup server using a change tracking log, or track log. A present full backup image can be created using a track log that is associated with a previous full backup image. The client device can determine whether files, which were included in the previous full backup image, have or have not changed using the track log. The client device can transmit changed file data to the backup server for inclusion in the present full backup image. The client device can also transmit metadata identifying unchanged file data to the backup server. The backup server can use the metadata to extract a copy of the unchanged file data from the previous full backup image for inclusion in the present full backup image.
摘要:
A computer-implemented method for providing increased scalability in deduplication storage systems may include (1) identifying a database that stores a plurality of reference objects, (2) determining that at least one size-related characteristic of the database has reached a predetermined threshold, (3) partitioning the database into a plurality of sub-databases capable of being updated independent of one another, (4) identifying a request to perform an update operation that updates one or more reference objects stored within at least one sub-database, and then (5) performing the update operation on less than all of the sub-databases to avoid processing costs associated with performing the update operation on all of the sub-databases. Various other systems, methods, and computer-readable media are also disclosed.
摘要:
The present disclosure provides for defragmenting deduplicated data, such as one or more backup image files, stored in a deduplicated data store. A defragmentation module can be implemented on a deduplication server to reduce fragmentation of backup images and improve processing time for restoring a backup image. A defragmentation module can be configured to defragment a backup image file by migrating portions of data of the backup image file that are stored in various containers at non-contiguous locations throughout deduplicated data store. A defragmentation module can contiguously write the portions to one or more containers, which are stored at one or more new locations in the deduplicated data store. A defragmentation module can be configured to evaluate whether portions of a backup image file meet criteria for defragmentation. A defragmentation module can also be configured to update location information about the portions that are migrated to the new container(s).
摘要:
A system and method for creating an inverted index is disclosed. The inverted index is created from indexing information received by a deduplication server. This indexing information is collected by a deduplication client during a backup operation and includes a list of keywords and a plurality of values. Once the indexing information is received, the index is constructed and includes a list of keywords. Each of the keywords is mapped to a value, each value represents a section of a document, and each section of the document includes at least a portion of a keyword.
摘要:
A system and method for caching fingerprints in a client cache is provided. A data object that comprises a set of data segments and describes a backup process is identified. Thereafter, a request referencing the data object is made to a deduplication server to request that a task identifier be added to the data object. If the deduplication server is able to successfully add the task identifier to the data object, then an active identifier is added to each data segment from the set of data segments in a cache that is within a client system.
摘要:
A method for restoring deduplicated data may include receiving a request to restore a set of deduplicated data segments to a client system, where each data segment in the set of deduplicated data segments is referred to by one or more deduplication references. The method may also include procuring reference data that indicates, for each data segment in the set of deduplicated data segments, the number of deduplication references that point to the data segment. The method may further include using the reference data to select one or more data segments from the set of deduplicated data segments for client-side caching, caching the one or more data segments in a cache on the client system, and restoring the one or more data segments from the cache on the client system. Various other methods, systems, and computer-readable media are also disclosed.
摘要:
Systems and methods for providing efficient storage and retrieval of data are disclosed. A two-level segment labeling mechanism may be employed to ensure that unique data segments from particular backup data sets are stored together in a storage container. The two-level segment labeling may facilitate preservation of the relative positions of segments within the backup stream during compaction operations. Also, backup data restoration performance may be improved by use of multiple read threads that are localized to particular storage containers.
摘要:
A system and method for improving performance within a storage system employing deduplication techniques using address manipulation are disclosed. A data segment within a storage object is identified from among a number of data segments within a storage object. The data segment represents data stored in a storage device. Some or all of the data represented by the data segment is stored in a data block that is associated with the data segment. The storage object is then compacted. Compaction includes reordering data segments, including the identified data segment, by performing address manipulation on a data block address of the data block (e.g., an address of the data block within the storage device). The reordering of the data segments changes the order of the data segments within the storage object.