Abstract:
A computer-implemented method for compressing a data set, the method comprising receiving a first data block of the data set, selecting automatically by a compression management module a compression module from a plurality of compression modules to apply to the first data block based on projected compression efficacy or resource utilization, and compressing the first data block with the selected compression module to generate a first compressed data block.
Abstract:
Techniques for evaluating deduplication effectiveness of data chunks in a storage system are described herein. In one embodiment, metadata of first data chunks associated with a deduplicated storage system is examined, where the first data chunks have been partitioned according to a first chunk size. A second chunk size is calculated based on the examination of the metadata of first data chunks. Metadata of the first data chunks is merged according to the second chunk size to represent second data chunks to which the first data chunks would have been merged. A deduplication rate of the second data chunks is determined based on the merged metadata.
Abstract:
A computer-implemented method and system for deduplicating sub-chunks in a data storage system selects a data chunk to deduplicate and generates a sketch for the selected data chunk. A similar data chunk is searched for using the sketch. A set of fingerprints corresponding to sub-chunks of the similar data chunk is loaded. The set of fingerprints for the similar data chunk is compared to a set of fingerprints of the selected data chunk and the selected chunk is encoded as a set of references to identical sub-chunks of the similar data chunk and at least one unmatched sub-chunk.
Abstract:
A computer-implemented method and system for improving efficiency in a delta compression process in a data storage system selects a data chunk to delta compress and selects a set of candidate data chunks using a first selection mechanism. Throughput or resource utilization is monitored. A change is made to a second selection mechanism that increases similarity of the set of candidates with the selected data chunk to improve compression in response to determining high resource availability or high throughput level. A change is made to a third selection mechanism that increases throughput of the delta compression process in response to determining low resources availability or low throughput.
Abstract:
A system and method for generating synthetic data to simulate backing up data between a primary storage system and a protection storage system is presented. In one embodiment, a first track in a set of tracks is selected at random. Having selected a first track, at least a first block in the first track is modified. Subsequently, it is determined, based on a track run probability, whether to modify a second track that is consecutive to the first track or a third track that is selected randomly. Depending on the determination, at least one block is modified at either the second or third track. Other embodiments are also described herein.
Abstract:
A computer-implemented method and system for performing garbage collection in a delta compressed data storage system selects a file recipe to traverse to identify live data chunks and selects a chunk identifier from the file recipe. The chunk identifier is added to a set of live data chunks. Delta references in the file metadata corresponding to the chunk identifier are added to the set of live data chunks. Data chunks in a data storage system not identified by the set of live data chunks are then discarded.
Abstract:
A method for storing data in a data storage system by partitioning the data into a plurality of data chunks and generating representative data for each of the plurality of chunks by applying a predetermined algorithm to each chunk of the plurality of chunks. Subsequently, the representative data is compared and sorted. Representative data for base data chunks and representative data for other data chunks that can be stored relative to the base data chunks are identified by evaluating the sorted set of representative data. Finally, each of the other data chunks identified as those that can be stored relative to a base data chunk are stored in the data storage system as the difference between the data chunk and a base data chunk.
Abstract:
Embodiments of this invention are directed to a system and method for characterizing and modeling a virtual synthetic file system workload. In one embodiment, a virtual synthetic system is adapted to select a first location in a prior generation dataset of a first cluster and generate a first offset using a distance distribution function. Thereafter, the virtual synthetic system selects a second location in the prior generation dataset of a second cluster, wherein the second location is offset from the first cluster by the first offset. Finally, the virtual synthetic system modifies each cluster selected on the prior generation dataset thereby creating a next generation dataset. This process is repeated to generate multiple generations of a dataset. Other embodiments are also described herein.
Abstract:
A computer-implemented method and system for performing garbage collection in a delta compressed data storage system selects a file recipe to traverse to identify live data chunks and selects a chunk identifier from the file recipe. The chunk identifier is added to a set of live data chunks. Delta references in an entry of an index corresponding to the chunk identifier are added to the set of live data chunks. Data chunks in a data storage system not identified by the set of live data chunks are then discarded.
Abstract:
A computer-implemented method and system for improving efficiency in a delta compression process selects a data chunk to delta compress and generates a sketch for the selected data chunk. A set of candidate data chunks with a matching sketch is searched for. The set of candidate data chunks with at least a minimum degree of similarity is ranked by location status data. Tie-breaking of the set of candidate data chunks is done using a degree of sketch similarity for each candidate and the selected data chunk is delta compressed with a selected candidate data chunk.