Abstract:
Systems and methods for reliably using data storage media. Multiple processors are configured to access a persistent memory. For a given data block corresponding to a write access request from a first processor to the persistent memory, a cache controller prevents any read access of a copy of the given data block in an associated cache. The cache controller prevents any read access while detecting an acknowledgment that the given data block is stored in the persistent memory is not yet received. Until the acknowledgment is received, the cache controller allows write access of the copy of the given data block in the associated cache only for a thread in the first processor that originally sent the write access request. The cache controller invalidates any copy of the given data block in any cache levels below the associated cache.
Abstract:
Techniques are provided for assigning read requests to storage devices in a manner that reduces the likelihood that any storage device will become overloaded or underutilized. Specifically, a read-request handler assigns read requests that are directed to each particular item among the storage devices that have copies of the item based on how busy each of those storage devices is. Consequently, even though certain storage devices may have copies of the same item, there may be times during which one storage device is assigned a disproportionate number of the reads of the item because the other storage device is busy with read requests for other items, and there may be other times during which other storage device is assigned a disproportionate number of the reads of the item because the one storage device is busy with read request for other items. Various techniques for estimating the busyness of storage devices are provided, including fraction-based estimates, interval-based estimates, and the response-time-based estimates. Techniques for smoothing those estimates, and for handicapping devices, are also provided.
Abstract:
A shared-nothing database system is provided in which parallelism and workload balancing are increased by assigning the rows of each table to “slices”, and storing multiple copies (“duplicas”) of each slice across the persistent storage of multiple nodes of the shared-nothing database system. When the data for a table is distributed among the nodes of a shared-nothing system in this manner, requests to read data from a particular row of the table may be handled by any node that stores a duplica of the slice to which the row is assigned. For each slice, a single duplica of the slice is designated as the “primary duplica”. All DML operations (e.g. inserts, deletes, updates, etc.) that target a particular row of the table are performed by the node that has the primary duplica of the slice to which the particular row is assigned. The changes made by the DML operations are then propagated from the primary duplica to the other duplicas (“secondary duplicas”) of the same slice.
Abstract:
Techniques herein store database blocks (DBBs) in byte-addressable persistent memory (PMEM) and prevent tearing without deadlocking or waiting. In an embodiment, a computer hosts a DBMS. A reader process of the DBMS obtains, without locking and from metadata in PMEM, a first memory address for directly accessing a current version, which is a particular version, of a DBB in PMEM. Concurrently and without locking: a) the reader process reads the particular version of the DBB in PMEM, and b) a writer process of the DBMS replaces, in the metadata in PMEM, the first memory address with a second memory address for directly accessing a new version of the DBB in PMEM. In an embodiment, a computer performs without locking: a) storing, in PMEM, a DBB, b) copying into volatile memory, or reading, an image of the DBB, and c) detecting whether the image of the DBB is torn.
Abstract:
Techniques are described herein for making a clean file snapshot of a target file. The techniques may be applied to a single target file, to a set of target files, or to an entire database The techniques involve transitioning the target file through a series of states. During each state, particular actions are performed and/or prevented. In the final state of each approach, a clean file snapshot of the target file exists. Transitioning through the states, only one of which does not allow new changes to be made to the target file, allows the database to remain online and available to a greater extent than is possible with an approach that prevents database changes for the duration of the clean file snapshot creation operation.
Abstract:
A method and apparatus for sending and receiving messages between nodes on a compute cluster is provided. Communication between nodes on a compute cluster, which do not share physical memory, is performed by passing messages over an I/O subsystem. Typically, each node includes a synchronization mechanism, a thread ready to receive connections, and other threads to process and reassemble messages. Frequently, a separate queue is maintained in memory for each node on the I/O subsystem sending messages to the receiving node. Such overhead increases latency and limits message throughput. Due to a specialized coprocessor running on each node, messages on an I/O subsystem are sent, received, authenticated, synchronized, and reassembled at a faster rate and with lower latency. Additionally, the memory structure used may reduce memory consumption by storing messages from multiple sources in the same memory structure, eliminating the need for per-source queues.