Abstract:
A method and mechanism is described for reliably transmitting messages over an unreliable transport mechanism. A sender transmits a first message over an unreliable mechanism to a receiver, and a second message that identifies the first is transported over a reliable transport mechanism to the receiver. When the receiver receives the second message, the receiver determines whether it has received the first message. If not, the receiver requests retransmission of the first message.
Abstract:
Various techniques are described for improving the performance of a shared-nothing database system in which at least two of the nodes that are running the shared-nothing database system have shared access to a disk. Specifically, techniques are provided for changing the ownership of data in a shared-nothing database dynamically, based on factors such as which node would be the most efficient owner relative to the performance of a particular operation. Once determined, the ownership of the data may be changed permanently to the new owner, or temporarily for the duration of the particular operation.
Abstract:
Techniques are provided for handling distributed transactions in shared-nothing database systems where one or more of the nodes have access to a shared persistent storage. Rather than coordinate the distributed transaction using a two-phase commit protocol, the coordinator of the distributed transaction uses a one-phase commit protocol with those participants that have access to the transaction status information maintained by the coordinator. The transaction status information may reside, for example, in the redo log of the coordinator. In the case that the coordinator fails, those participants can determine the state of the distributed transaction based on information stored on the shared disk. In addition, the coordinator is able to determine whether it is possible to commit the distributed transaction based on information that is stored on the shared disk by the participants, without those participants entering a formal nullpreparednull state.
Abstract:
Various techniques are described for improving the performance of a shared-nothing database system in which at least two of the nodes that are running the shared-nothing database system have shared access to a disk. Specifically, techniques are provided for changing the ownership of data in a shared-nothing database without changing the location of the data on persistent storage. Because the persistent storage location for the data is not changed during a transfer of ownership of the data, ownership can be transferred more freely and with less of a performance penalty than would otherwise be incurred by a physical relocation of the data. Various techniques are also described for providing fast run-time reassignment of ownership. Because the reassignment can be performed during run-time, the shared-nothing system does not have to be taken offline to perform the reassignment. Further, the techniques describe how the reassignment can be performed with relatively fine granularity, avoiding the need to perform bulk reassignment of large amounts of data across all nodes merely to reassign ownership of a few data items on one of the nodes.
Abstract:
An improved method, mechanism, and system for implementing, generating, and maintaining records, such as redo records and redo logs in a database system, are disclosed. Multiple sets of records may be created and combined into a partially ordered (or non-ordered) group of records, which are later collectively ordered or sorted as needed to create an fully ordered set of records. With respect to a database system, redo generation bottleneck is minimized by providing multiple in-memory redo buffers that are available to hold redo records generated by multiple threads of execution. When the in-memory redo buffers are written to a persistent storage medium, no specific ordering needs to be specified with respect to the redo records from the different in-memory redo buffers. While the collective group of records may not be ordered, the written-out redo records may be partially ordered based upon the ordered redo records from within individual in-memory redo buffers. At recovery, ordering and/or merging of redo records may occur to satisfy database consistency requirements.
Abstract:
Various techniques are described for improving the performance of a multiple node system by allocating, in two or more nodes of the system, partitions of a shared cache. A mapping is established between the data items managed by the system, and the various partitions of the shared cache. When a node requires a data item, the node first determines which partition of the shared cache corresponds to the required data item. If the data item does not currently reside in the corresponding partition, the data item is loaded into the corresponding partition even if the partition does not reside on the same node that requires the data item. The node then reads the data item from the corresponding partition of the shared cache.
Abstract:
Various techniques are described for improving the performance of a shared-nothing database system in which at least two of the nodes that are running the shared-nothing database system have shared access to a disk. Specifically, techniques are provided for recovering the data owned by a failed node using multiple recovery nodes operating in parallel. The data owned by a failed node is reassigned to recovery nodes that have access to the shared disk on which the data resides. The recovery logs of the failed node are read by the recovery nodes, or by a coordinator process that distributes the recovery tasks to the recovery nodes.