Abstract:
A system can efficiently removes ranges of entries from a flat sorted data structure that represent stale fingerprints As part of fingerprint verification during deduplication, the system performs an attributes intersect range calculation (AIRC) procedure on the stale fingerprint data structure to compute a set of non-overlapping and latest consistency point (CP) ranges. During the AIRC procedure, an inode associated with a data container is selected and the FBN tuple of each deleted data block in the file is sorted in a predefined FBN order. The AIRC procedure then identifies the most recent fingerprint associated with a deleted data block. The set of non-overlapping and latest CP ranges is then used to remove stale fingerprints associated with that deleted block from the fingerprint database. A single pass through the fingerprint database identifies the set of non-overlapping and latest CP ranges, thereby improving efficiency of the storage system.
Abstract:
A system and method efficiently removes ranges of entries from a flat sorted data structure, such as a fingerprint database, of a storage system. The ranges of entries represent fingerprints that have become stale, i.e., are not representative of current states of corresponding blocks in the file system, due to various file system operations such as, e.g., deletion of a data block without overwriting its contents. A deduplication module of a file system executing on the storage system performs a fingerprint verification procedure to remove the stale fingerprints from the fingerprint database. As part of the fingerprint verification procedure, the deduplication module performs an attributes intersect range calculation (AIRC) procedure on the stale fingerprint data structure to compute a set of non-overlapping and latest consistency point (CP) ranges. During the AIRC procedure, an inode associated with a data container, e.g., a file, is selected and the FBN tuple of each deleted data block in the file is sorted in a predefined, e.g., increasing, FBN order. The AIRC procedure then identifies the most recent fingerprint associated with a deleted data block. The output from the AIRC procedure, i.e., the set of non-overlapping and latest CP ranges, is then used to remove stale fingerprints associated with that deleted block (as well as each other deleted data block) from the fingerprint database. Notably, only a single pass through the fingerprint database is required to identify the set of non-overlapping and latest CP ranges, thereby improving efficiency of the storage system.