Abstract:
The embodiments described herein are directed to an organization of metadata managed by a volume layer of a storage input/output (I/O) stack executing on one or more nodes of a cluster. The metadata managed by the volume layer, i.e., the volume metadata, is illustratively embodied as mappings from addresses, i.e., logical block addresses (LBAs), of a logical unit (LUN) accessible by a host to durable extent keys maintained by an extent store layer of the storage I/O stack. In an embodiment, the volume layer organizes the volume metadata as a mapping data structure, i.e., a dense tree metadata structure, which represents successive points in time to enable efficient access to the metadata.
Abstract:
In one embodiment, a node coupled to one or more storage devices executes a storage input/output (I/O) stack having a volume layer. The volume layer manages volume metadata embodied as mappings from offsets of a logical unit (LUN) to extent keys associated with storage locations for extents on the one or more storage devices. Volume metadata is maintained as a dense tree metadata structure representing successive points in time. The dense tree metadata structure has multiple levels, wherein a top level of the dense tree metadata structure represents newer volume metadata changes and descending levels of the dense tree metadata structure represent older volume metadata changes. The node accesses a latest version of changes to the volume metadata by searching from the top level to the descending levels in the dense tree metadata structure.
Abstract:
In one embodiment, a node coupled to a plurality of storage devices executes a storage input/output (I/O) stack having a plurality of layers including a persistence layer. A portion of non-volatile random access memory (NVRAM) is configured as one or more logs. The persistence layer cooperates with the NVRAM to employ the log to record write requests received from a host and to acknowledge successful receipt of the write requests to the host. The log has a set of entries, each entry including (i) write data of a write request and (ii) a previous offset referencing a previous entry of the log. After a power loss, the acknowledged write requests are recovered by replay of the log in reverse sequential order using the previous record offset in each entry to traverse the log.
Abstract:
In one embodiment, a node coupled to a plurality of solid state drives (SSDs) executes a storage input/output (I/O) stack having a plurality of layers. Write data associated with one or more write requests to the SSDs is stored in a volatile log. The write data is organized into one or more extents that are copied to the SSDs. The volatile log has a front-end and a set of records with metadata. The metadata includes a head offset referencing an initial record and a tail offset referencing a final record. A portion of the one or more write requests including the write data is copied to a non-volatile log maintained in a non-volatile random access memory (NVRAM). The front-end and the set of records from the volatile log are copied, but the head offset and the tail offset are not, to reduce an amount of metadata copied to the NVRAM.
Abstract:
In one embodiment, a node coupled to a plurality of storage devices executes a storage input/output (I/O) stack having a plurality of layers including a persistence layer. A portion of non-volatile random access memory (NVRAM) is configured as one or more logs. The persistence layer cooperates with the NVRAM to employ the log to record write requests received from a host and to acknowledge successful receipt of the write requests to the host. The log has a set of entries, each entry including (i) write data of a write request and (ii) a previous offset referencing a previous entry of the log. After a power loss, the acknowledged write requests are recovered by replay of the log in reverse sequential order using the previous record offset in each entry to traverse the log.
Abstract:
In one embodiment, a node coupled to one or more storage devices executes a storage input/output (I/O) stack having a volume layer, a persistence layer and an administration layer that interact to create a copy of a parent volume associated with a storage container on the one or more storage devices. A copy create start message is received at the persistence layer from the administration layer. The persistence layer ensures that dirty data for the parent volume is incorporated into the copy of the parent volume. New data for the parent volume received at the persistence layer during creation of the copy of the parent volume is prevented from incorporation into the copy of the parent volume. A reply to the copy create start message is sent from the persistence layer to the administration layer to initiate the creation of the copy of the parent volume at the volume layer.
Abstract:
In one embodiment, a node coupled to one or more storage devices executes a storage input/output (I/O) stack having a volume layer that manages volume metadata. The volume metadata is organized as one or more dense tree metadata structures having a top level residing in memory and lower levels residing on the one or more storage devices. The dense tree metadata structures include a first dense tree metadata structure associated with a parent volume and a second dense tree metadata structure associated with a copy of the parent volume. The top level of the first dense tree metadata structure may be copied to the second dense tree metadata structure. The lower levels of the first dense tree metadata structure are initially shared with the second dense tree metadata structure. The shared lower levels may eventually be split as the parent volume diverges from the copy of the parent volume.
Abstract:
In one embodiment, snapshots and/or clones of storage objects are created and managed by a volume layer of a storage input/output (I/O) stack executing on one or more nodes of a cluster. Illustratively, the snapshots and clones may be represented as independent volumes, and embodied as respective read-only copies (snapshots) and read-write copies (clones) of a parent volume. Volume metadata is illustratively organized as one or more multi-level dense tree metadata structures, wherein each level of the dense tree metadata structure (dense tree) includes volume metadata entries for storing the metadata. Each snapshot/clone may be derived from a dense tree of the parent volume (parent dense tree). Portions of the parent dense tree may be shared with the snapshot/clone.
Abstract:
In one embodiment, a node coupled to one or more storage devices executes a storage input/output (I/O) stack having a volume layer. The volume layer manages volume metadata embodied as mappings from offsets of a logical unit (LUN) to extent keys associated with storage locations for extents on the one or more storage devices. Volume metadata is maintained as a dense tree metadata structure representing successive points in time. The dense tree metadata structure has multiple levels, wherein a top level of the dense tree metadata structure represents newer volume metadata changes and descending levels of the dense tree metadata structure represent older volume metadata changes. The node accesses a latest version of changes to the volume metadata by searching from the top level to the descending levels in the dense tree metadata structure.
Abstract:
A high availability (HA) failover manager maintains data availability of one or more input/output (I/O) resources in a cluster by ensuring that each I/O resource is available (e.g., mounted) on a hosting node of the cluster and that each I/O resource may be available on one or more partner nodes of the cluster if a node (i.e., a local node) were to fail. The HA failover manager (HA manager) processes inputs from various sources of the cluster to determine whether failover is enabled for a local node and each partner node in an HA group, and for triggering failover of the I/O resources to the partner node as necessary. For each I/O resource, the HA manager may track state information including (i) a state of the I/O resource (e.g., mounted or un-mounted); (ii) the partner node(s) ability to service the I/O resource; and (iii) whether a non-volatile log recording I/O requests is synchronized to the partner node(s). The HA manager interacts with various layers of a storage I/O stack to mount and un-mount the I/O resources on one or more nodes of the cluster through the use of well-defined interfaces, e.g., application programming interfaces.