Abstract:
In one embodiment, a clustered storage system is configured to reduce parity overhead of Redundant Array of Independent Disks (RAID) groups, as well as to facilitate distribution and servicing of the storage containers among storage systems (nodes) of the cluster. The storage containers may be stored on one or more storage arrays of storage devices, such as solid state drives (SSDs), connected to the nodes of the cluster. The RAID groups may be formed from slices (i.e., portions) of storage spaces of the SSDs instead of the entire storage spaces of the SSDs. That is, each RAID group may be formed “horizontally” across a set of SSDs as slices (i.e., one slice of storage space from each SSD in the set). Accordingly, a plurality of RAID groups may co-exist (i.e., be stacked) on the same set of SSDs.
Abstract:
A flash-optimized, log-structured layer of a file system of a storage input/output (I/O) stack executes on one or more nodes of a cluster. The log-structured layer of the file system provides sequential storage of data and metadata (i.e., a log-structured layout) on solid state drives (SSDs) of storage arrays in the cluster to reduce write amplification, while leveraging variable compression and variable length data features of the storage I/O stack. The data may be organized as an arbitrary number of variable-length extents of one or more host-visible logical units (LUNs) served by the nodes. The metadata may include mappings from host-visible logical block address ranges (i.e., offset ranges) of a LUN to extent keys, as well as mappings of the extent keys to SSD storage locations of the extents. The storage location of an extent on SSD is effectively “virtualized” by its mapped extent key (i.e., extent store layer mappings) such that relocation of the extent on SSD does require update to volume layer metadata (i.e., the extent key sufficiently identifies the extent).
Abstract:
A technique reduces an amount of metadata stored in a memory of a node in a cluster. An extent store layer of a storage input/output (I/O) stack executing on the node stores key-value pairs in a plurality of data structures, e.g., cuckoo hash tables, resident in the memory. The cuckoo hash table embodies metadata that describes an extent and, as such, may be organized to associate a location on disk with a value that identifies the location on disk. The value may be embodied as a locator that includes a reference count used to support deduplication functionality of the extent store layer with respect to the extent. The reference count is divided into two portions: a delta count portion stored in memory for each slot of the hash table and an overflow count portion stored on disk in a header of each extent. One bit of the delta count portion is reserved as an overflow bit that indicates whether the in-memory reference count has overflowed. Another bit of the delta count portion is reserved as a sign bit that indicates whether the value of the remaining delta count portion, which stores the “delta” of the reference count, is positive or negative. Overflow updates to the overflow count portion on disk are postponed until all of the bits of the delta count portion are consumed as negative/positive transitions.
Abstract:
A technique restores a file system of a storage input/output (I/O) stack to a deterministic point-in-time state in the event of failure (loss) of non-volatile random access memory (NVRAM) of a node. The technique enables restoration of the file system to a safepoint stored on storage devices, such solid state drives (SSD), of the node with minimum data and metadata loss. The safepoint is a point-in-time during execution of I/O requests (e.g., write operations) at which data and related metadata of the write operations prior to the point-in-time are safely persisted on SSD such that the metadata relating to an image of the file system on SSD (on-disk) is consistent and complete. Upon reboot after NVRAM loss, the technique identifies (i) the most recent safepoint, as well as (ii) the inflight writes that were persistently stored on disk after the most recent safepoint. The data and metadata of those inflight writes are then deleted to place the on-disk file system to its state at the most recent safepoint.
Abstract:
Data consistency and availability can be provided at the granularity of logical storage objects in storage solutions that use storage virtualization in clustered storage environments. To ensure consistency of data across different storage elements, synchronization is performed across the different storage elements. Changes to data are synchronized across storage elements in different clusters by propagating the changes from a primary logical storage object to a secondary logical storage object. To satisfy the strictest RPOs while maintaining performance, change requests are intercepted prior to being sent to a filesystem that hosts the primary logical storage object and propagated to a different managing storage element associated with the secondary logical storage object.
Abstract:
A flash-optimized, log-structured layer of a file system of a storage input/output (I/O) stack executes on one or more nodes of a cluster. The log-structured layer of the file system provides sequential storage of data and metadata (i.e., a log-structured layout) on solid state drives (SSDs) of storage arrays in the cluster to reduce write amplification, while leveraging variable compression and variable length data features of the storage I/O stack. The data may be organized as an arbitrary number of variable-length extents of one or more host-visible logical units (LUNs) served by the nodes. The metadata may include mappings from host-visible logical block address ranges (i.e., offset ranges) of a LUN to extent keys, as well as mappings of the extent keys to SSD storage locations of the extents. The storage location of an extent on SSD is effectively “virtualized” by its mapped extent key (i.e., extent store layer mappings) such that relocation of the extent on SSD does require update to volume layer metadata (i.e., the extent key sufficiently identifies the extent).
Abstract:
In one embodiment, one or more storage arrays of solid state drives (SSDs) that include a plurality of segments are organized as one or more redundant array of independent disks (RAID) groups, where the RAID groups provides data redundancy for the segments. A node executing a layered file system of a storage input/output (I/O) stack performs segment cleaning to clean the segments. It further initiates rebuild of a RAID configuration of the SSDs on a segment-by-segment basis in response to the segment cleaning. In such a configuration, each segment includes one or more RAID stripes that provide a level of data redundancy as well as RAID organization for the segment.
Abstract:
A technique perturbs an extent key to compute a candidate extent key in the event of a collision with metadata (i.e., two extents having different data that yield identical hash values) stored in a memory of a node in a cluster. The perturbing technique may be used to compute a candidate extent key that is not previously stored in an extent store instance. The candidate extent key may be computed from a hash value of an extent using a perturbing algorithm, i.e., a hash collision computation, which illustratively adds a perturb value to the hash value. The perturb value is illustratively sufficient to ensure that the candidate extent key resolves to a same hash bucket and node (extent store instance) as the original extent key. In essence, the technique ensures that the original extent key is perturbed in a deterministic manner to generate the candidate extent key, so that the original extent and candidate extent key “decode” to the same hash bucket and extent store instance.
Abstract:
In one embodiment, a file system driven RAID rebuild technique is provided. A layered file system may organize storage of data as segments spanning one or more sets of storage devices, such as solid state drives (SSDs), of a storage array, wherein each set of SSDs may form a RAID group configured to provide data redundancy for a segment. The file system may then drive (i.e., initiate) rebuild of a RAID configuration of the SSDs on a segment-by-segment basis in response to cleaning of the segment (i.e., segment cleaning). Each segment may include one or more RAID stripes that provide a level of data redundancy (e.g., single parity RAID 5 or double parity RAID 6) as well as RAID organization (i.e., distribution of data and parity) for the segment. Notably, the level of data redundancy and RAID organization may differ among the segments of the array.
Abstract:
An offset range striping technique increases concurrency of operation execution directed to metadata managed by a volume layer of a storage input/output (I/O) stack, while reducing contention among resources of one or more nodes of a cluster. A logical unit (LUN) may be apportioned into multiple volumes, each of which may be partitioned into multiple regions, wherein each region is represented by a dense tree. The technique increases concurrency of operation execution (e.g., modifications to the metadata at the offset ranges), while reducing contention among the resources (e.g., CPUs and NVLogs) by distributing the offset range operations among the regions and mapping the regions to services and NVLogs. Such increased concurrency and reduction of contention may be achieved by implementation of the technique to (i) apportion each region into disjoint chunks (i.e., stripes) of contiguous offset ranges; (ii) organize a plurality of regions into one or more zones and populate a first zone before allocating a second zone; and (iii) stagger the mapping of services to starting regions of the volumes.