Abstract:
One or more engine instances are executed on each host to form an engine cluster. A plurality of control instances are executed on a first set of hosts to form a control cluster and comprise a control instance leader and one or more control instance followers. In response to a first host indicating a failure of a neighbor host, a pair-wise focused investigation is initiated to check peer-to-peer connections between the first host and the neighbor host. In response to one or more additional hosts indicating failures of neighbor hosts while the pair-wise focused investigation is being performed, a wide investigation is performed to check connections between the control cluster and the plurality of hosts. One or more hosts are added to an eviction list and an eviction protocol is performed to evict the one or more hosts from the engine cluster using the eviction list.
Abstract:
In a computer, each of many statement plan trees respectively represents a distinct database statement in a database workload. Each statement plan tree contains a distinct set of tree nodes. A first statement plan tree contains a first subtree and represents a first statement. A second statement plan tree contains a second subtree, and a third statement plan tree contains a third subtree. By agglomeration, a first cluster subplan is generated that represents the first subtree of the first statement plan tree and the second subtree of the second statement plan tree. By subsequent agglomeration, a second cluster subplan is generated that represents the third subtree of the third statement plan tree and the first cluster subplan. Execution of the first database statement uses the second cluster subplan for acceleration. Agglomeration may be decided based on novel net benefit estimation, novel inter-cluster distance, and a novel and tunable compilation cost.
Abstract:
A shared-nothing database system is provided in which parallelism and workload balancing are increased by assigning the rows of each table to “slices”, and storing multiple copies (“duplicas”) of each slice across the persistent storage of multiple nodes of the shared-nothing database system. When the data for a table is distributed among the nodes of a shared-nothing system in this manner, requests to read data from a particular row of the table may be handled by any node that stores a duplica of the slice to which the row is assigned. For each slice, a single duplica of the slice is designated as the “primary duplica”. All DML operations (e.g. inserts, deletes, updates, etc.) that target a particular row of the table are performed by the node that has the primary duplica of the slice to which the particular row is assigned. The changes made by the DML operations are then propagated from the primary duplica to the other duplicas (“secondary duplicas”) of the same slice.
Abstract:
A shared-nothing database system is provided in which parallelism and workload balancing are increased by assigning the rows of each table to “slices”, and storing multiple copies (“duplicas”) of each slice across the persistent storage of multiple nodes of the shared-nothing database system. When the data for a table is distributed among the nodes of a shared-nothing system in this manner, requests to read data from a particular row of the table may be handled by any node that stores a duplica of the slice to which the row is assigned. For each slice, a single duplica of the slice is designated as the “primary duplica”. All DML operations (e.g. inserts, deletes, updates, etc.) that target a particular row of the table are performed by the node that has the primary duplica of the slice to which the particular row is assigned. The changes made by the DML operations are then propagated from the primary duplica to the other duplicas (“secondary duplicas”) of the same slice.
Abstract:
Systems and methods for reducing latency of probing operations of remotely located linear hash tables are described herein. In an embodiment, a system receives a request to perform a probing operation on a remotely located linear hash table based on a key value. Prior to performing the probing operation, the system dynamically predicts a number of slots for a single read of the linear hash table to minimize total cost for an average probing operation. The system determines a hash value based on the key value and determines a slot of the linear hash table to which the hash value corresponds. After predicting the number of slots, the system issues an RDMA request to perform a read of the predicted number of slots from the linear hash table starting at the slot to which the hash value corresponds.
Abstract:
Techniques are described for offloading remote direct memory operations (RDMOs) to “execution candidates”. The execution candidates may be any hardware capable of performing the offloaded operation. Thus, the execution candidates may be network interface controllers, specialized co-processors, FPGAs, etc. The execution candidates may be on a machine that is remote from the processor that is offloading the operation, or may be on the same machine as the processor that is offloading the operation. Details for certain specific RDMOs, which are particularly useful in online transaction processing (OLTP) and hybrid transactional/analytical (HTAP) workloads, are provided.
Abstract:
Techniques are provided to allow more sophisticated operations to be performed remotely by machines that are not fully functional. Operations that can be performed reliably by a machine that has experienced a hardware and/or software error are referred to herein as Remote Direct Memory Operations or “RDMOs”. Unlike RDMAs, which typically involve trivially simple operations such as the retrieval of a single value from the memory of a remote machine, RDMOs may be arbitrarily complex. The techniques described herein can help applications run without interruption when there are software faults or glitches on a remote system with which they interact.
Abstract:
A hashing scheme includes a cache-friendly, latchless, non-blocking dynamically resizable hash index with constant-time lookup operations that is also amenable to fast lookups via remote memory access. Specifically, the hashing scheme provides each of the following features: latchless reads, fine grained lightweight locks for writers, non-blocking dynamic resizability, cache-friendly access, constant-time lookup operations, amenable to remote memory access via RDMA protocol through one sided read operations, as well as non-RDMA access.
Abstract:
A method and apparatus for reconfiguring hardware structures to pipeline the execution of multiple special purpose hardware implemented functions, without saving intermediate results to memory, is provided. Pipelining functions in a program is typically performed by a first function saving its results (the “intermediate results”) to memory, and a second function subsequently accessing the memory to use the intermediate results as input. Saving and accessing intermediate results stored in memory incurs a heavy performance penalty, requires more power, consumes more memory bandwidth, and increases the memory footprint. Due to the ability to redirect the input and output of the hardware structures, intermediate results are passed directly from one special purpose hardware implemented function to another without storing the intermediate results in memory. Consequently, a program that utilizes the method or apparatus, reduces power consumption, consumes less memory bandwidth, and reduces the program's memory footprint.
Abstract:
A method and apparatus for sending and receiving messages between nodes on a compute cluster is provided. Communication between nodes on a compute cluster, which do not share physical memory, is performed by passing messages over an I/O subsystem. Typically, each node includes a synchronization mechanism, a thread ready to receive connections, and other threads to process and reassemble messages. Frequently, a separate queue is maintained in memory for each node on the I/O subsystem sending messages to the receiving node. Such overhead increases latency and limits message throughput. Due to a specialized coprocessor running on each node, messages on an I/O subsystem are sent, received, authenticated, synchronized, and reassembled at a faster rate and with lower latency. Additionally, the memory structure used may reduce memory consumption by storing messages from multiple sources in the same memory structure, eliminating the need for per-source queues.