Abstract:
A distributed key-value storage system may include a master node. The key-value store may be distributed among first and second nodes. The master node may receive a publish request to publish one or more key-value pairs. Each key-value pair may be stored in a retransmit buffer and sent to all the first nodes using a communication protocol of a first kind that does not include a retransmit protocol mechanism. Some of the key-value pairs may be sent to one or more second node using a communication protocol of a second kind that includes a retransmit protocol mechanism.
Abstract:
System and method for converting a storage object in a redo-log snapshot format to a single-container snapshot format in a distributed storage system uses a temporary snapshot object, which is created by taking a snapshot of the storage object, and an anchor object, which points to a root object of the storage object. For each object chain of the storage object, each selected object is processed for format conversion. For each selected object, difference data between the selected object and a parent object of the selected object is written to the anchor object, a child snapshot of the anchor object is created in the single-container snapshot format, and the anchor object is updated to point to the selected object. The data of the running point object of the storage object is then copied to the anchor object, and each processed object and the temporary snapshot object are removed.
Abstract:
At the time of deleting a snapshot, a storage system can allocate a buffer in volatile memory, scan a plurality of logical block address (LBA)-to-virtual block address (VBA) mappings included in a first tree metadata structure of the snapshot, and, for each scanned LBA-to-VBA mapping, identify in a second tree metadata structure a VBA-to-physical block address (PBA) mapping referenced by the LBA-to-VBA mapping. If the VBA-to-PBA mapping is exclusively owned by the snapshot, the storage system can add a record to the buffer that includes the VBA specified in the VBA-to-PBA mapping. The storage system can subsequently sort the records added to the buffer in VBA order and sequentially process the sorted records to remove their corresponding VBA-to-PBA mappings from the second tree metadata structure.
Abstract:
The disclosure herein describes tracking changes to a stale component using a synchronization bitmap. A first component of a plurality of mirrored components of the distributed data object becomes available from an unavailable state, and a stale log sequence number (LSN) and a last committed LSN are identified. A synchronization bitmap of the first component associated with a range of LSNs (e.g., from the stale LSN to the last committed LSN) is created and configured to track changes to data blocks of the first component. A second component is identified based on the second component including a tracking bitmap associated with an LSN that matches the stale LSN of the first component. The first component is synchronized with data from the second component based on, wherein the synchronizing includes updating the synchronization bitmap to track changes made to data blocks of the first component.
Abstract:
A method for verifying a consistency of snapshot metadata maintained in an ordered data structure for a plurality of snapshots in a snapshot hierarchy is provided. The method includes identifying a first plurality of nodes maintained in a first ordered data structure for a first snapshot that is a child of a second snapshot; for a first node of the first plurality of nodes, verifying the first node by checking for the first node in a second node map maintained in memory for the second snapshot, wherein the second node map includes a plurality of verified nodes in a second ordered data structure; and based on whether the first node is in the second node map: adding the first node to a first node map maintained in memory for the first snapshot, wherein the first node map includes verified nodes of the first plurality of nodes; or triggering an alarm.
Abstract:
System and method for managing copy-on-write (COW) B tree structures for metadata of storage objects stored in a storage system determine, when a request to modify a target storage object stored in the storage system that requires a modification of a target leaf node in a B tree structure for metadata of the target storage object is received, whether an operation sequence number of the target leaf node is greater than a snapshot sequence number of a parent snapshot of a running point of the B tree structure. When the operation sequence number is greater than the snapshot sequence number, the target leaf mode is modified in place without copying the target leaf node. When the operation sequence number is not greater than the snapshot sequence number, the target leaf node is copied as a new leaf node for the B tree structure and the new leaf node is modified.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for resynchronizing data in a storage system. One of the methods includes determining that a particular disk of a capacity object of a storage system is out-of-sync and that a primary disk is unavailable; and for each segment of one or more segments of the capacity object: generating a first version of the column of the segment corresponding to the unavailable primary disk; determining whether the data integrity token in the column summary of the generated first version is valid; and in response to determining that the data integrity token is valid, resynchronizing the column of the segment corresponding to the particular disk using i) the primary columns of the segment corresponding to each available primary disk and ii) the first version of the column of the segment corresponding to the unavailable primary disk.
Abstract:
Described herein are methods and systems for the efficient resyncing of stale components of a distributed-computing system. One method includes determining that a first base component at a remote site will go offline. After determining that the first base component at the remote site will go offline, a first delta component is created at the remote site. While the first base component at the remote site is offline, data corresponding to the offline component is collected at the first delta component at the remote site. After collecting data at the first delta component, the collected data is sent to a local site. The method includes determining that the first base component has come back online. In response to determining that the first base component has come back online, the collected data is sent from the first delta component to the first base component via an intra-site network.
Abstract:
Techniques for intelligently scheduling resynchronization jobs in a distributed object-based storage system are provided. In one set of embodiments, a storage node of the system can create a resynchronization job for a component of an object maintained by the system, where the resynchronization job defines one or more input/output (I/O) operations to be carried out with respect to the component. If a number of currently running resynchronization jobs on the storage node has reached a threshold, the storage node can further determine a priority level associated with the object; add the resynchronization job to an object queue for the object; and if the added resynchronization job is a first job in the object queue, add the object queue as a new queue entry to a global priority queue corresponding to the priority level associated with the object.
Abstract:
Computer system and method for managing storage requests in a distributed storage system uses congestion data related to processing of storage requests for local storage to adaptively adjust a bandwidth limit for a first class of storage requests to be processed. The bandwidth limit is enforced on the storage requests belonging to the first class of storage requests without enforcing any bandwidth limit on the storage requests belonging to a second class of storage requests.