Abstract:
Techniques are described for executing an analytical query with a top-N clause. In an embodiment, a stream of tuples are received by each of the processing units from a data source identified in the query. The processing unit uses a portion of a received tuple to identify the partition that the tuple is assigned to. For each partition, the processing unit maintains a top-N data store that stores an N number of received tuples that match the criteria of top N tuples according to the query. The received tuple is compared to the N number of tuples to determine whether to store the received tuple and discard an already stored tuple, or to discard the received tuple. After all the tuples have been similarly processed by the processing units, all the top-N data stores for each partition are merged, yielding the top N number of tuples for each partition to return as a result of the query.
Abstract:
Herein is described a data placement scheme for a distributed query processing systems that achieves load balance amongst the nodes of the system. To identify a node on which to place particular data, a supervisor node performs a placement algorithm over the particular data's identifier, where the placement algorithm utilizes two or more hash functions. The supervisor node runs the placement algorithm until a destination node is identified that is available to store the data, or the supervisor node has run the placement algorithm an established number of times. If no available node is identified using the placement algorithm, then an available destination node is identified for the particular data and information identifying the data and the selected destination node is included in an exception map. Most data may be located by any node in the system based on the node performing the placement algorithm for the required data.
Abstract:
Techniques are described for executing a query with a top-N clause to select a first N-number of rows in a data source arranged at least according to a first key and a second key of the data source using a first sort order respectively specified for the first key and a second sort order respectively specified for the second key by the query. The data source may include one or more tiles that include at least a portion of the first key and the second key. To execute the query, in an embodiment, a DBMS determines, in a first vector of first key values that are in a first tile, row identifiers identifying entries of the first vector that contain values equal to a tail value that follows a particular top number of the first key values. The DBMS may select, from a second vector of values of the second key in the first tile, second key values identified based on the determined row identifiers of the first vector. In an embodiment, the DBMS generates a result set of the query that includes at least a value from the second key values selected from the second vector based on the determined first row identifiers.
Abstract:
Techniques are described for executing a query with a top-N clause to select a first N-number of rows in a data source arranged at least according to a first key and a second key of the data source using a first sort order respectively specified for the first key and a second sort order respectively specified for the second key by the query. The data source may include one or more tiles that include at least a portion of the first key and the second key. To execute the query, in an embodiment, a DBMS determines, in a first vector of first key values that are in a first tile, row identifiers identifying entries of the first vector that contain values equal to a tail value that follows a particular top number of the first key values. The DBMS may select, from a second vector of values of the second key in the first tile, second key values identified based on the determined row identifiers of the first vector. In an embodiment, the DBMS generates a result set of the query that includes at least a value from the second key values selected from the second vector based on the determined first row identifiers.
Abstract:
Techniques are provided for scheduling data operations for a given query based upon a query-cost model that analyzes the cost of scheduling data operations based upon their operation cost and the type of resources needed for the operation. In an embodiment, a database server receives a set of operations for a query. The database server determines a set of leaf operation nodes from the set of data operations, where the set of leaf operation nodes includes operation nodes that do not depend on the execution of other nodes within the set of data operations. The database server compares operation costs between the leaf operation nodes to determine which leaf operation node to insert into a scheduled order set. The database server inserts the leaf operation node into the scheduled order set. Then the database server iteratively determines new leaf operation nodes and performs cost analysis on remaining leaf operation nodes to generate a set of scheduled data operations.
Abstract:
Techniques are provided for scheduling data operations for a given query based upon a query-cost model that analyzes the cost of scheduling data operations based upon their operation cost and the type of resources needed for the operation. In an embodiment, a database server receives a set of operations for a query. The database server determines a set of leaf operation nodes from the set of data operations, where the set of leaf operation nodes includes operation nodes that do not depend on the execution of other nodes within the set of data operations. The database server compares operation costs between the leaf operation nodes to determine which leaf operation node to insert into a scheduled order set. The database server inserts the leaf operation node into the scheduled order set. Then the database server iteratively determines new leaf operation nodes and performs cost analysis on remaining leaf operation nodes to generate a set of scheduled data operations.
Abstract:
Herein is described a data placement scheme for a distributed query processing systems that achieves load balance amongst the nodes of the system. To identify a node on which to place particular data, a supervisor node performs a placement algorithm over the particular data's identifier, where the placement algorithm utilizes two or more hash functions. The supervisor node runs the placement algorithm until a destination node is identified that is available to store the data, or the supervisor node has run the placement algorithm an established number of times. If no available node is identified using the placement algorithm, then an available destination node is identified for the particular data and information identifying the data and the selected destination node is included in an exception map. Most data may be located by any node in the system based on the node performing the placement algorithm for the required data.