Abstract:
The disclosed embodiments relate to a data storage system that facilitates efficiently recovering from storage device failures. Upon receiving a request to retrieve a data block from the data storage system, the system uses a hash that identifies the data block to look up a bucket and an associated cell containing the data block. Note that the bucket aggregates a large number of data blocks and is located in the associated cell that comprises a set of object storage devices (OSDs). Within the cell, the system uses the bucket to look up an OSD that contains the bucket in a local bucket database (BDB) for the cell. Within the OSD, the system uses the bucket and the hash to look up an offset and a length for the data block in a write-ahead log that stores data blocks for the bucket. Finally, the system returns the data block from the determined offset.
Abstract:
A distributed computing system that executes a set of long-lived jobs is described. During operation, each worker process performs the following operations. First, the worker process identifies a set of jobs to be executed and a set of worker processes that can execute the set of jobs. Next, the worker process sorts the set of worker processes based on unique identifiers for the worker processes. Then, the worker process assigns jobs to each worker process in the set of worker processes, wherein approximately the same number of jobs is assigned to each worker process, and jobs are assigned to the worker processes in sorted order. While assigning jobs, the worker process uses an identifier for each worker process to seed a pseudorandom number generator, and then uses the pseudorandom number generator to select jobs for each worker process to execute.
Abstract:
An append-only data storage system is described that stores sets of data blocks in extents that are located in storage devices. During operation of the system, upon receiving a request to copy an extent from a source storage device to a destination storage device, the system creates a scratch extent on the destination storage device, and associates the scratch extent with a private identifier, whereby the scratch extent can only be accessed through the private identifier. The system uses the private identifier to perform a copying operation that copies the extent from the source storage device to the scratch extent on the destination storage device. After the copying operation is complete and the scratch extent is closed, the system associates the scratch extent with a public identifier, whereby the copy of the extent on the destination storage device becomes publically accessible to other entities in the data storage system.
Abstract:
A system can serialize moves and mounts across namespaces based on lamport clocks. In some examples, the system obtains a request to move a content item from a source namespace to a destination namespace. The system processes an incoming move at the destination and an outgoing move at the source. The system processes for the content item a delete at the source and an add at the destination. The system assigns a first clock to the incoming move and a second clock to the outgoing move, the first clock being lower than the second clock. The system assigns a third clock to the delete and a fourth clock to the add, the third clock being higher than the second clock and lower than the fourth clock. The system serializes the incoming and outgoing moves, the delete and the add based on the first, second, third and fourth clocks.
Abstract:
A system can serialize moves and mounts across namespaces based on lamport clocks. In some examples, the system obtains a request to move a content item from a source namespace to a destination namespace. The system processes an incoming move at the destination and an outgoing move at the source. The system processes for the content item a delete at the source and an add at the destination. The system assigns a first clock to the incoming move and a second clock to the outgoing move, the first clock being lower than the second clock. The system assigns a third clock to the delete and a fourth clock to the add, the third clock being higher than the second clock and lower than the fourth clock. The system serializes the incoming and outgoing moves, the delete and the add based on the first, second, third and fourth clocks.
Abstract:
A data storage system stores sets of data blocks in extents located on storage devices. During operation, the system performs an erasure-coding operation by obtaining a set of source extents, wherein each source extent is stored on a different machine in the data storage system. The system also selects a set of destination machines for storing destination extents, wherein each destination extent is stored on a different destination machine. Next, the system performs the erasure-coding operation by retrieving data from the set of source extents, performing the erasure-coding operation on the retrieved data to produce erasure-coded data, and then writing the erasure-coded data to the set of destination extents on the set of destination machines. Finally, after the erasure-coding operation is complete, the system commits results of the erasure-coding operation to enable the set of destination extents to be accessed in place of the set of source extents.
Abstract:
A append-only data storage system that stores sets of data blocks in extents that are located in storage devices. When an extent becomes full, the system changes the extent from an open state, wherein data can be appended to the extent, to a closed state, wherein data cannot be appended to the extent. This change involves performing a synchronization operation by: obtaining a list of data blocks in the extent from each storage device that has a copy of the extent; forming a union of the lists; looking up data blocks from the union in a database that maps data blocks to storage devices and extents to determine which data blocks belong in the extent; and if a copy of the extent is missing data blocks that belong in the extent, performing a remedial action before changing the extent from the open state to the closed state.
Abstract:
The disclosed embodiments relate to a data storage system that facilitates efficiently recovering from storage device failures. Upon receiving a request to retrieve a data block from the data storage system, the system uses a hash that identifies the data block to look up a bucket and an associated cell containing the data block. Note that the bucket aggregates a large number of data blocks and is located in the associated cell that comprises a set of object storage devices (OSDs). Within the cell, the system uses the bucket to look up an OSD that contains the bucket in a local bucket database (BDB) for the cell. Within the OSD, the system uses the bucket and the hash to look up an offset and a length for the data block in a write-ahead log that stores data blocks for the bucket. Finally, the system returns the data block from the determined offset.
Abstract:
One variation of a system for implementing a key-value data store includes one or more processors, storage media and instructions stored in the storage media which, when executed by the system cause the system to: receive a request store a particular key-value item; request a first networked distributed data storage system to store the particular key-value item; based on a determination that a set of one or more offload criteria is satisfied: retrieve a first set of key-value items from the first networked distributed data storage system, and request a second networked distributed data storage system to store the first set of key-value items in a first set of one or more data objects. The first networked distributed data storage system can have a lower data write latency and a higher data storage cost than the second networked distributed data storage system.
Abstract:
A distributed computing system that executes a set of long-lived jobs is described. During operation, each worker process performs the following operations. First, the worker process identifies a set of jobs to be executed and a set of worker processes that can execute the set of jobs. Next, the worker process sorts the set of worker processes based on unique identifiers for the worker processes. Then, the worker process assigns jobs to each worker process in the set of worker processes, wherein approximately the same number of jobs is assigned to each worker process, and jobs are assigned to the worker processes in sorted order. While assigning jobs, the worker process uses an identifier for each worker process to seed a pseudorandom number generator, and then uses the pseudorandom number generator to select jobs for each worker process to execute.