Abstract:
In one embodiment, a node coupled to a plurality of storage devices executes a storage input/output (I/O) stack having a plurality of layers including a persistence layer. A portion of non-volatile random access memory (NVRAM) is configured as one or more logs. The persistence layer cooperates with the NVRAM to employ the log to record write requests received from a host and to acknowledge successful receipt of the write requests to the host. The log has a set of entries, each entry including (i) write data of a write request and (ii) a previous offset referencing a previous entry of the log. After a power loss, the acknowledged write requests are recovered by replay of the log in reverse sequential order using the previous record offset in each entry to traverse the log.
Abstract:
In one embodiment, a node coupled to a plurality of storage devices executes a storage input/output (I/O) stack having a plurality of layers including a persistence layer. A portion of non-volatile random access memory (NVRAM) is configured as one or more logs. The persistence layer cooperates with the NVRAM to employ the log to record write requests received from a host and to acknowledge successful receipt of the write requests to the host. The log has a set of entries, each entry including (i) write data of a write request and (ii) a previous offset referencing a previous entry of the log. After a power loss, the acknowledged write requests are recovered by replay of the log in reverse sequential order using the previous record offset in each entry to traverse the log.
Abstract:
In one embodiment, a node coupled to a plurality of solid state drives (SSDs) executes a storage input/output (I/O) stack having a plurality of layers. Write data associated with one or more write requests to the SSDs is stored in a volatile log. The write data is organized into one or more extents that are copied to the SSDs. The volatile log has a front-end and a set of records with metadata. The metadata includes a head offset referencing an initial record and a tail offset referencing a final record. A portion of the one or more write requests including the write data is copied to a non-volatile log maintained in a non-volatile random access memory (NVRAM). The front-end and the set of records from the volatile log are copied, but the head offset and the tail offset are not, to reduce an amount of metadata copied to the NVRAM.
Abstract:
A novel RDMA connection failover technique that minimizes disruption to upper subsystem modules (executed on a computer node), which create requests for data transfer. A new failover virtual layer performs failover of an RDMA connection in error so that the upper subsystem that created a request does not have knowledge of an error (which is recoverable in software and hardware), or of a failure on the RDMA connection due to the error. Since the upper subsystem does not have knowledge of a failure on the RDMA connection or of a performed failover of the RDMA connection, the upper subsystem continues providing requests to the failover virtual layer without interruption, thereby minimizing downtime of the data transfer activity.
Abstract:
A technique synchronizes de-registration of registered memory and incoming input/output (I/O) data received from an I/O device for storage in a memory of a computer system. Registration and de-registration of the memory with an I/O memory management unit (IOMMU) are illustratively performed by an I/O device driver of the computer system in anticipation of (or in response to) an I/O request to store the incoming I/O data in buffers of the memory. The synchronization technique ensures that storage of the I/O data in the buffers and de-registration of the buffers occur in a coordinated, reliable manner to obviate data corruption or other error conditions that may manifest in response to a race condition between such data storage and memory de-registration. Notably, I/O data which may be in-flight (i.e., inbound) from a sender to the I/O device may be received without error even when active buffers are deregistered. That is, the technique avoids handshaking with the sender before de-registering the active buffers.
Abstract:
In one embodiment, a node coupled to a plurality of solid state drives (SSDs) executes a storage input/output (I/O) stack having a plurality of layers. Write data associated with one or more write requests to the SSDs is stored in a volatile log. The write data is organized into one or more extents that are copied to the SSDs. The volatile log has a front-end and a set of records with metadata. The metadata includes a head offset referencing an initial record and a tail offset referencing a final record. A portion of the one or more write requests including the write data is copied to a non-volatile log maintained in a non-volatile random access memory (NVRAM). The front-end and the set of records from the volatile log are copied, but the head offset and the tail offset are not, to reduce an amount of metadata copied to the NVRAM.
Abstract:
In one embodiment, a parallel (e.g., tiered) logging technique is provided to deliver low latency acknowledgements of input/output (I/O) requests, such as write requests, while avoiding loss of data. Write data may be stored (copied) as a log in a portion of a dynamic random access memory and a non-volatile random access memory (NVRAM). The NVRAM may be configured as, e.g., a persistent write-back cache of the node, while parameters of the request may be stored in another portion of the NVRAM configured as the log (NVLog). The write data may be organized into separate variable length blocks or extents and “written back” out-of-order from the write-back cache to storage devices, such as SSDs, e.g., organized into a data container (intended destination of the write request). The write data may be preserved in the NVlog until each extent is safely stored on SSD.
Abstract:
A technique synchronizes de-registration of registered memory and incoming input/output (I/O) data received from an I/0 device for storage in a memory of a computer system. Registration and de-registration of the memory with an I/O memory management unit (IOMMU) are illustratively performed by an I/O device driver of the computer system in anticipation of (or in response to) an I/O request to store the incoming I/O data in buffers of the memory. The synchronization technique ensures that storage of the I/O data in the buffers and de-registration of the buffers occur in a coordinated, reliable manner to obviate data corruption or other error conditions that may manifest in response to a race condition between such data storage and memory de-registration. Notably, I/O data which may be in-flight (i.e., inbound) from a sender to the I/O device may be received without error even when active buffers are deregistered. That is, the technique avoids handshaking with the sender before de-registering the active buffers.
Abstract:
A novel RDMA connection failover technique that minimizes disruption to upper subsystem modules (executed on a computer node), which create requests for data transfer. A new failover virtual layer performs failover of an RDMA connection in error so that the upper subsystem that created a request does not have knowledge of an error (which is recoverable in software and hardware), or of a failure on the RDMA connection due to the error. Since the upper subsystem does not have knowledge of a failure on the RDMA connection or of a performed failover of the RDMA connection, the upper subsystem continues providing requests to the failover virtual layer without interruption, thereby minimizing downtime of the data transfer activity.