摘要:
An extent map (EMAP) database may include one or more extent map entries configured to map extent IDs to PVBNs. Each extent ID may be apportioned into a most significant bit (MSB) portion, i.e., checksum bits, and a least significant bit (LSB) portion, i.e., duplicate bits. A hash may be applied to the data of the extent to calculate the checksum bits, which illustratively represent a fingerprint of the data. The duplicate bits may be configured to denote any reoccurrence of the checksum bits in the EMAP database, i.e., whether there is an existing extent with potentially identical data in a volume of the aggregate. Each extent map entry may be inserted on a node having one or more key/value pairs, wherein the key is the extent ID and the value is the PVBN. The EMAP database may be scanned and utilized to perform data deduplication.
摘要:
Multiple key-value stores may be employed to smooth out random updates (based on the extent ID) to the EMAP database. The updates to the EMAP database occur in a two-stage manner: (i) using an append-only log store for the first stage and (ii) using an on-disk hash store for the second stage. The append-only log store is used to convert the random updates to sequential write operations on the EMAP database. Once full, the contents of the log store are sorted and moved to the on-disk hash store, which holds the updates for a transient period of time to enable batching of the updates. Once sufficient batching of the extent map entries are accumulated, those entries are sorted and moved to the EMAP database. Thereafter, the EMAP database can be scanned to find extent map entries having identical checksum bits to perform data deduplication.
摘要:
A storage server is coupled to a storage device that stores blocks of data, and generates a fingerprint for each data block stored on the storage device. The storage server creates a fingerprints datastore that is divided into a primary datastore and a secondary datastore. The primary datastore comprises a single entry for each unique fingerprint and the secondary datastore comprises an entry having an identical fingerprint as an entry in the primary datastore. The storage server merges entries in a changelog with the entries in the primary datastore to identify duplicate data blocks in the storage device and frees the identified duplicate data blocks in the storage device. The storage server stores the entries that correspond to the freed data blocks to a third datastore and overwrites the primary datastore with the entries from the merged data that correspond to the unique fingerprints to create an updated primary datastore.
摘要:
Multiple key-value stores may be employed to smooth out random updates (based on the extent ID) to the EMAP database. The updates to the EMAP database occur in a two-stage manner: (i) using an append-only log store for the first stage and (ii) using an on-disk hash store for the second stage. The append-only log store is used to convert the random updates to sequential write operations on the EMAP database. Once full, the contents of the log store are sorted and moved to the on-disk hash store, which holds the updates for a transient period of time to enable batching of the updates. Once sufficient batching of the extent map entries are accumulated, those entries are sorted and moved to the EMAP database. Thereafter, the EMAP database can be scanned to find extent map entries having identical checksum bits to perform data deduplication.
摘要:
An extent map (EMAP) database may include one or more extent map entries configured to map extent IDs to PVBNs. Each extent ID may be apportioned into a most significant bit (MSB) portion, i.e., checksum bits, and a least significant bit (LSB) portion, i.e., duplicate bits. A hash may be applied to the data of the extent to calculate the checksum bits, which illustratively represent a fingerprint of the data. The duplicate bits may be configured to denote any reoccurrence of the checksum bits in the EMAP database, i.e., whether there is an existing extent with potentially identical data in a volume of the aggregate. Each extent map entry may be inserted on a node having one or more key/value pairs, wherein the key is the extent ID and the value is the PVBN. The EMAP database may be scanned and utilized to perform data deduplication.
摘要:
Systems, methods, and non-transitory machine readable media for determining block characteristics include one or more processors, a memory for storing instructions for the one or more processors, persistent storage, and a file system implemented in the persistent storage and storing data in the persistent storage using a plurality of blocks. When the stored instructions are executed by the one or more processors, the one or more processors are configured to traverse the plurality of blocks, read contents of a first block selected from the plurality of blocks, determine one or more characteristics of the first block from metadata within the block, and selectively perform or not perform a storage operation with respect to the first data block in response to determining the one or more characteristics. In some embodiments, the storage operation is a replication operation or a deduplication operation.
摘要:
Multiple key-value stores may be employed to smooth out random updates (based on the extent ID) to the EMAP database. The updates to the EMAP database occur in a two-stage manner: (i) using an append-only log store for the first stage and (ii) using an on-disk hash store for the second stage. The append-only log store is used to convert the random updates to sequential write operations on the EMAP database. Once full, the contents of the log store are sorted and moved to the on-disk hash store, which holds the updates for a transient period of time to enable batching of the updates. Once sufficient batching of the extent map entries are accumulated, those entries are sorted and moved to the EMAP database. Thereafter, the EMAP database can be scanned to find extent map entries having identical checksum bits to perform data deduplication.