Abstract:
A technique paces and balances a flow of messages related to processing of input/output (I/O) requests between subsystems, such as layers of a storage input/output (I/O) stack, of one or more nodes of a cluster. The I/O requests may be directed to externally-generated user data, e.g., write requests generated by a host coupled to the cluster, and internally-generated metadata, e.g., write and delete requests generated by a volume layer of the storage I/O stack. The user data (and metadata) may be organized as an arbitrary number of variable-length extents of one or more host-visible logical units (LUNs) served by the nodes. The metadata may include mappings from host-visible logical block address ranges (i.e., offset ranges) of a LUN to extent keys, which reference locations of the extents stored on storage devices, such as solid state drivers (SSDs), of a storage array coupled to the nodes. The I/O requests are received at a pacer of the volume layer configured to control delivery of the requests to an extent store layer of the storage I/O stack in a policy-dictated manner to enable processing and sequential storage of the user data and metadata on the SSDs of the storage array.
Abstract:
A technique efficiently creates a snapshot for a logical unit (LUN) served by a storage input/output (I/O) stack executing on a node of a cluster that organizes data as extents referenced by keys. In addition, the technique efficiently creates one or more snapshots for a group of LUNs organized as a consistency group (CG) and served by storage I/O stacks executing on a plurality of nodes of the cluster. To that end, the technique involves a plurality of indivisible operations (i.e., transactions) of a snapshot creation workflow administered by a Storage Area Network (SAN) administration layer (SAL) of the storage I/O stack in response to a snapshot create request issued by a host. The SAL administers the snapshot creation workflow by initiating a set of transactions that includes, inter alia, (i) installation of barriers for LUNs (volumes) across all nodes in the cluster that participate in snapshot creation, (ii) creation of point-in-time (PIT) markers to record those I/O requests that are included in the snapshot, and (iii) updating of records (entries) in snapshot and volume tables of a cluster database (CDB).
Abstract:
A technique enables recovery of storage space trapped in an extent store due to overlapping write requests associated with metadata managed by a volume layer of a storage input/output stack executing on one or more nodes of a cluster. The metadata is organized as a multi-level dense tree metadata structure, wherein each level of the dense tree includes volume metadata entries for storing the metadata. When a level of the dense tree is full, the volume metadata entries of the level are merged with a next lower level of the dense tree in accordance with a dense tree merge operation. The technique may be invoked during the merge operation to process the volume metadata entries associated with the overlapping write requests at each level of the dense tree involved in the merge operation. Processing of the overlapping write requests during the merge operation may manifest as partial overwrites of one or more existing extents which, in turn, may result in logical storage space being trapped in the extent store. The technique may perform read-modify-write (RMW) operations on the partially overwritten extents to recapture that trapped space. The storage space trapped by the partially overwritten extents may be recovered by reading and re-writing one or more valid portions of each extent with storage space lockup through the use of “out-of-band”, i.e., independent of the merge, processing of the RMW operations.
Abstract:
A technique efficiently creates a snapshot for a logical unit (LUN) served by a storage input/output (I/O) stack executing on a node of a cluster that organizes data as extents referenced by keys. In addition, the technique efficiently creates one or more snapshots for a group of LUNs organized as a consistency group (CG) and served by storage I/O stacks executing on a plurality of nodes of the cluster. To that end, the technique involves a plurality of indivisible operations (i.e., transactions) of a snapshot creation workflow administered by a Storage Area Network (SAN) administration layer (SAL) of the storage I/O stack in response to a snapshot create request issued by a host. The SAL administers the snapshot creation workflow by initiating a set of transactions that includes, inter alia, (i) installation of barriers for LUNs (volumes) across all nodes in the cluster that participate in snapshot creation, (ii) creation of point-in-time (PIT) markers to record those I/O requests that are included in the snapshot, and (iii) updating of records (entries) in snapshot and volume tables of a cluster database (CDB).
Abstract:
A snap restore technique efficiently restores snapshots of storage containers served by a storage input/output (I/O) stack executing on one or more nodes of a cluster. A Small Computer Systems Interface administration layer interacts with a volume layer of the storage I/O stack to manage and implement a snap restore procedure to restore one or more snapshots of a storage container. The storage container may be a logical unit (LUN) embodied as parent volume (active volume) and the snapshot may be represented as an independent volume embodied as read-only copy of the active volume. The snap restore procedure may be configured to allow restoration to a single snapshot of a LUN or restoration of a plurality of LUNs organized as a consistency group from a group of snapshots. Restoration of the LUN from a snapshot involves (i) creation of another independent volume embodied as a read-write copy (clone) of the snapshot, (ii) replacement of the (old) active volume with the clone, (iii) deletion of the old active volume, and (iv) mapping of the LUN to the clone (i.e., a new active volume).
Abstract:
A layout of a transaction log enables efficient logging of metadata into entries of the log, as well as efficient reclamation and recovery of the log entries by a volume layer of a storage input/output (I/O) stack executing on one or more nodes of a cluster. The transaction log is illustratively a two stage, append-only logging structure, wherein the first level is non-volatile random access memory (NVRAM) embodied as a NVlog and the second stage is disk, e.g., solid state drive (SSD). During crash recovery, the log entries are examined for consistency and scanned to identify those entries that have completed and those that are active, which require replay. The log entries are walked from oldest to newest (using sequence numbers) searching for the highest sequence number. Partially complete log entries (e.g., log entries in-progress when a crash occurs) may be discarded for failing a checksum (e.g., a CRC error). Old value/new value logs may be used to implement roll-forward or roll-back semantics to replay the log entries and fix any on-disk data structures, first from NVRAM and then from on-disk logs.
Abstract:
A snap restore technique efficiently restores snapshots of storage containers served by a storage input/output (I/O) stack executing on one or more nodes of a cluster. A Small Computer Systems Interface administration layer interacts with a volume layer of the storage I/O stack to manage and implement a snap restore procedure to restore one or more snapshots of a storage container. The storage container may be a logical unit (LUN) embodied as parent volume (active volume) and the snapshot may be represented as an independent volume embodied as read-only copy of the active volume. The snap restore procedure may be configured to allow restoration to a single snapshot of a LUN or restoration of a plurality of LUNs organized as a consistency group from a group of snapshots. Restoration of the LUN from a snapshot involves (i) creation of another independent volume embodied as a read-write copy (clone) of the snapshot, (ii) replacement of the (old) active volume with the clone, (iii) deletion of the old active volume, and (iv) mapping of the LUN to the clone (i.e., a new active volume).
Abstract:
Embodiments herein are directed to efficient crash recovery of persistent metadata managed by a volume layer of a storage input/output (I/O) stack executing on one or more nodes of a cluster. Volume metadata managed by the volume layer is organized as a multi-level dense tree, wherein each level of the dense tree includes volume metadata entries for storing the volume metadata. When a level of the dense tree is full, the volume metadata entries of the level are merged with the next lower level of the dense tree. During a merge operation, two sets of generation IDs may be used in accordance with a double buffer arrangement: a first generation ID for the append buffer that is full (i.e., a merge staging buffer) and a second, incremented generation ID for the append buffer that accepts new volume metadata entries. Upon completion of the merge operation, the lower level (e.g., level 1) to which the merge is directed is assigned the generation ID of the merge staging buffer.
Abstract:
Embodiments herein are directed to efficient crash recovery of persistent metadata managed by a volume layer of a storage input/output (I/O) stack executing on one or more nodes of a cluster. Volume metadata managed by the volume layer is organized as a multi-level dense tree, wherein each level of the dense tree includes volume metadata entries for storing the volume metadata. When a level of the dense tree is full, the volume metadata entries of the level are merged with the next lower level of the dense tree. During a merge operation, two sets of generation IDs may be used in accordance with a double buffer arrangement: a first generation ID for the append buffer that is full (i.e., a merge staging buffer) and a second, incremented generation ID for the append buffer that accepts new volume metadata entries. Upon completion of the merge operation, the lower level (e.g., level 1) to which the merge is directed is assigned the generation ID of the merge staging buffer.
Abstract:
An offset range striping technique increases concurrency of operation execution directed to metadata managed by a volume layer of a storage input/output (I/O) stack, while reducing contention among resources of one or more nodes of a cluster. A logical unit (LUN) may be apportioned into multiple volumes, each of which may be partitioned into multiple regions, wherein each region is represented by a dense tree. The technique increases concurrency of operation execution (e.g., modifications to the metadata at the offset ranges), while reducing contention among the resources (e.g., CPUs and NVLogs) by distributing the offset range operations among the regions and mapping the regions to services and NVLogs. Such increased concurrency and reduction of contention may be achieved by implementation of the technique to (i) apportion each region into disjoint chunks (i.e., stripes) of contiguous offset ranges; (ii) organize a plurality of regions into one or more zones and populate a first zone before allocating a second zone; and (iii) stagger the mapping of services to starting regions of the volumes.