Abstract:
A data storage system includes multiple head nodes and data storage sleds. Volume data is replicated between a primary and one or more secondary head nodes for a volume partition and is further flushed to a set of mass storage devices of the data storage sleds. Volume metadata is maintained in a primary and one or more secondary head nodes for a volume partition and is updated in response to volume data being flushed to the data storage sleds. Also, the primary and secondary head nodes store check-points of volume metadata to the data storage sleds, wherein in response to a failure of a primary or secondary head node for a volume partition, a replacement secondary head node for the volume partition recreates a secondary replica for the volume partition based, at least in part, on a stored volume metadata checkpoint.
Abstract:
Generally described, one or more aspects of the present application correspond to techniques for automatic recovery from dual isolation in which both the primary and secondary replicas of a volume are stored on isolating servers. The disclosed techniques use handshakes between the client and the replicas to determine which has a better health score. The replica with the better health score becomes the primary replica, and confirms that it and the secondary replica are both in an isolating state. In response, the primary replica seeks a solo blessing, undoes the isolating state at the volume level (the server host will still be in isolating state), and continues handling I/O and peer replication until its healthy peer is complete. These techniques can avoid availability drops when the servers hosting the primary and secondary replicas of a volume enter the isolating state at around the same time.
Abstract:
Persistent storage for a master copy is provided using operation numbers. A master copy can include a persistent key-value store such as a B-tree with references to corresponding data. When provisioning a slave copy, the master copy sends a point-in-time copy of the B-tree to the slave copy, which stores a copy of the B-tree, allocates the necessary space, and updates the references of the B-tree to point to a local storage before the data is transferred. When writing the data to persistent storage, a snapshot created on the master copy is an operation that is replicated to the slave copy. The snapshot is generated using a volume view that includes changes to chunks of data of the master copy since a previous snapshot, as determined using the operation number for the previous snapshot. Data (and metadata) for the snapshot is written to persistent storage while new input/output operations are processed.
Abstract:
A slave storage is provisioned using metadata of a master B-tree and updates to references (e.g., offsets) pertaining to data operations of the master B-tree. Master-slave pairs can be used to provide data redundancy, and a master copy can include the master B-tree with references to corresponding data. When provisioning a slave copy, the master sends a B-tree copy to the slave, which stores the slave B-tree copy, allocates the necessary space on local storage, and updates respective offsets of the slave B-tree copy to point to the local storage. Data from the master can then be transferred to the slave and stored according to a note and commit process that ensures operational sequence of the data. Operations received to the master during the process can be committed to the slave copy until the slave is consistent with the master and able to take over as master in the event of a failure.
Abstract:
Techniques for background task scheduling based on shared background bandwidth are described. A method for background task scheduling based on shared background bandwidth may include receiving a request to perform one or more background tasks on a storage server of a storage service in a provider network, determining a priority of each of the one or more background tasks, wherein each background task is associated with a size parameter and a temporal parameter, and wherein the priority of each of the one or more background tasks is based at least on its associated size parameter and temporal parameter, determining a task type associated with each background task, adding each background task to one of a plurality of task queues associated with different task types, wherein each task queue is associated with a bandwidth allocation, and scheduling the one or more background tasks to be performed based on their priority and the bandwidth allocation.
Abstract:
A data storage system includes multiple head nodes and data storage sleds. Volume data is replicated between a primary and one or more secondary head nodes for a volume partition and is further flushed to a set of mass storage devices of the data storage sleds. Volume metadata is maintained in a primary and one or more secondary head nodes for a volume partition and is updated in response to volume data being flushed to the data storage sleds. Also, the primary and secondary head nodes store check-points of volume metadata to the data storage sleds, wherein in response to a failure of a primary or secondary head node for a volume partition, a replacement secondary head node for the volume partition recreates a secondary replica for the volume partition based, at least in part, on a stored volume metadata checkpoint.
Abstract:
A system that hosts computing resources may implement optimistically granting permission to host computing resources. A request for permission to host a computing resource may be received by a control plane. If the control plane determines that the resource host is the first to request permission to host the resource, then the control plane may store an indication of permission that blocks other resource hosts from obtaining permission to host the computing resource and sending an acknowledgement of permission to the resource host that requested permission.
Abstract:
Systems and methods for provisioning a slave copy for redundant data storage and for writing data to persistent storage in a block-based storage system using sequential operation numbers are provided. In one embodiment, the method includes maintaining a master copy and a slave copy of a data volume, the master copy including data generated by a plurality of operations having respective sequential operation numbers, receiving a write instruction for second data to be added to the master copy, and recording the second data as a note that is not readable. The method may further include sending a copy of the note from the master copy to the slave copy, committing the note to the master copy with a sequential operation number, and committing the copy of the note to the slave copy based in part on the sequential operation number. A B-tree may be created based at least in part on an offset for a write instruction associated with the second data, a length, and an operation number included in the note.
Abstract:
Write optimization for block-based storage performing snapshot operations may be implemented. Write requests for a particular data volume may be received for which a snapshot operation is in progress. A determination may be made as to whether a data chunk of the data volume modified as part of the write request has not yet been stored to a remote snapshot data store as part of the snapshot operation. For a data chunk that is to be modified and that has not yet been stored, the data chunk may be stored in a local in-memory volume snapshot buffer. Once the data chunk is stored in the in-memory volume snapshot buffer, the write request may be performed and acknowledged as complete. The data chunk may be sent to the remote snapshot data store asynchronously with regard to the acknowledgment of the write request.
Abstract:
A data storage system includes multiple head nodes and data storage sleds. A control plane of the data storage system designates, for a volume partition, one of the head nodes to function as a primary head node storing a primary replica of the volume partition and designates two or more other head nodes to function as reserve head nodes storing reserve replicas of the volume partition. Additionally, the primary head node causes volume data for the volume partition to be erasure encoded and stored on multiple mass storage devices in different ones of the data storage sleds.