摘要:
A method of efficiently providing estimated answers to workloads of aggregate, multi-join SQL-like queries over a number of input data-streams. The method only examines each data elements once and uses a limited amount of computer memory. The method uses join graphs and atomic sketches that are essentially pseudo-random summaries formed using random binary variables. The estimated answer is the product of all the atomic sketches for all the vertices in the query join graph. A query workload is processed efficiently by identifying and sharing atomic sketches common to distinct queries, while ensuring that the join graphs remain well formed. The method may automatically minimize either the average query error or the maximum query error over the workload.
摘要:
A method of efficiently providing estimated answers to workloads of aggregate, multi-join SQL-like queries over a number of input data-streams. The method only examines each data elements once and uses a limited amount of computer memory. The method uses join graphs and atomic sketches that are essentially pseudo-random summaries formed using random binary variables. The estimated answer is the product of all the atomic sketches for all the vertices in the query join graph. A query workload is processed efficiently by identifying and sharing atomic sketches common to distinct queries, while ensuring that the join graphs remain well formed. The method may automatically minimize either the average query error or the maximum query error over the workload.
摘要:
A distinct-count estimate is obtained in a guaranteed small footprint using a two level hash, distinct count sketch. A first hash fills the first-level hash buckets with an exponentially decreasing number of data-elements. These are then uniformly hashed to an array of second-level-hash tables, and have an associated total-element counter and bit-location counters. These counters are used to identify singletons and so provide a distinct-sample and a distinct-count. An estimate of the total distinct-count is obtained by dividing by the distinct-count by the probability of mapping a data-element to that bucket. An estimate of the total distinct-source frequencies of destination address can be found in a similar fashion. By further associating the distinct-count sketch with a list of singletons, a total singleton count and a heap containing the destination addresses ordered by their distinct-source frequencies, a tracking distinct-count sketch may be formed that has considerably improved query time.
摘要:
A method of estimating set-expression cardinalities over data streams with guaranteed small maintenance time per data-element update. The method only examines each data element once and uses a limited amount of memory. The time-efficient stream synopsis extends 2-level hash-sketches by randomly, but uniformly, pre-hashing data-elements prior to logarithmically hashing them to a first-level hash-table. This generates a set of independent 2-level hash-sketches. The set-union cardinality can be estimated by determining the smallest hash-bucket index j at which only a predetermined fraction of the b hash-buckets has a non-empty union |A∪B|. Once a set-union cardinality is estimated, general set-expression cardinalities may be estimated by counting witness elements for the set-expression, i.e., those first-level hash-buckets that are both a singleton for the set-expression and a set-union singleton. The set-expression cardinality is the set-union cardinality times the number of witness elements divided by the number of hash-buckets.
摘要:
A method of estimating an aggregate of a join over data-streams in real-time using skimmed sketches, that only examines each data element once and has a worst case space requirement of O(n2/J), where J is the size of the join and n is the number of data elements. The skimmed sketch is an atomic sketch, formed as the inner product of the data-stream frequency vector and a random binary variable, from which the frequency values that exceed a predetermined threshold have been skimmed off and placed in a dense frequency vector. The join size is estimated as the sum of the sub-joins of skimmed sketches and dense frequency vectors. The atomic sketches may be arranged in a hash structure so that processing a data element only requires updating a single sketch per hash table. This keeps the per-element overhead logarithmic in the domain and stream sizes.
摘要:
A method and system for answering set-expression cardinality queries while lowering data communication costs by utilizing a coordinator site to provide global knowledge of the distribution of certain frequently occurring stream elements to significantly reduce the transmission of element state information to the central site and, optionally, capturing the semantics of the input set expression in a Boolean logic formula and using models of the formula to determine whether an element state change at a remote site can affect the set expression result.
摘要:
Method for performing information-preserving DTD schema embeddings between a source schema when matching a source schema and a target schema. The preservation is realized by a matching process between the two schemas that finds a first string marking of the target schema, evaluates a legality of the first string marking, determines an estimated mimimal cost of the first string marking and subsequently adjusts the estimated minimal cost based upon one to one mapping of source schema and target schema subcomponents.
摘要:
The invention comprises a method and apparatus for determining a rank of a query value. Specifically, the method comprises receiving a rank query request, determining, for each of the at least one remote monitor, a predicted lower-bound rank value and upper-bound rank value, wherein the predicted lower-bound rank value and upper-bound rank value are determined according to at least one respective prediction model used by each of the at least one remote monitor to compute the at least one local quantile summary, computing a predicted average rank value for each of the at least one remote monitor using the at least one predicted lower-bound rank value and the at least one predicted upper-bound rank value associated with the respective at least one remote monitor, and computing the rank of the query value using the at least one predicted average rank value associated with the respective at least one remote monitor.
摘要:
The invention provides methods and systems for summarizing multiple continuous update streams using corresponding multiple (parallel) JD Sketch data structures such that, for example, an approximate answer to a query requiring a join operation followed by a duplicate elimination step may be rapidly provided.
摘要:
A distinct-count estimate is obtained in a guaranteed small footprint using a two level hash, distinct count sketch. A first hash fills the first-level hash buckets with an exponentially decreasing number of data-elements. These are then uniformly hashed to an array of second-level-hash tables, and have an associated total-element counter and bit-location counters. These counters are used to identify singletons and so provide a distinct-sample and a distinct-count. An estimate of the total distinct-count is obtained by dividing by the distinct-count by the probability of mapping a data-element to that bucket. An estimate of the total distinct-source frequencies of destination address can be found in a similar fashion. By further associating the distinct-count sketch with a list of singletons, a total singleton count and a heap containing the destination addresses ordered by their distinct-source frequencies, a tracking distinct-count sketch may be formed that has considerably improved query time.