摘要:
Methods and apparatus for representing probabilistic data using a probabilistic histogram are disclosed. An example method comprises partitioning a plurality of ordered data items into a plurality of buckets, each of the data items capable of having a data value from a plurality of possible data values with a probability characterized by a respective individual probability distribution function (PDF), each bucket associated with a respective subset of the ordered data items bounded by a respective beginning data item and a respective ending data item, and determining a first representative PDF for a first bucket associated with a first subset of the ordered data items by partitioning the plurality of possible data values into a first plurality of representative data ranges and respective representative probabilities based on an error between the first representative PDF and a first plurality of individual PDFs characterizing the first subset of the ordered data items.
摘要:
A method of distributed approximate query tracking relies on tracking general-purpose randomized sketch summaries of local streams at remote sites along with concise prediction models of local site behavior in order to produce highly communication-efficient and space/time-efficient solutions. A powerful approximate query tracking framework readily incorporates several complex analysis queries, including distributed join and multi-join aggregates and approximate wavelet representations, thus giving the first known low-overhead tracking solution for such queries in the distributed-streams model.
摘要:
The first fast solution to the problem of tracking wavelet representations of one-dimensional and multi-dimensional data streams based on a stream synopsis, the Group-Count Sketch (GCS) is provided. By imposing a hierarchical structure of groups over the data and applying the GCS, our algorithms can quickly recover the most important wavelet coefficients with guaranteed accuracy. A tradeoff between query time and update time is established, by varying the hierarchical structure of groups, allowing the right balance to be found for specific data streams. Experimental analysis confirmed this tradeoff, and showed that all the methods significantly outperformed previously known methods in terms of both update time and query time, while maintaining a high level of accuracy.
摘要:
Example methods and apparatus to construct histogram and wavelet synopses for probabilistic data are disclosed. A disclosed example method involves receiving probabilistic data associated with probability measures and generating a plurality of histograms based on the probabilistic data. Each histogram is generated based on items represented by the probabilistic data. In addition, each histogram is generated using a different quantity of buckets containing different ones of the items. An error measure associated with each of the plurality of histograms is determined and one of the plurality of histograms is selected based on its associated error measure. The method also involves displaying parameter information associated with the one of the plurality of histograms to represent the data.
摘要:
Example methods and apparatus to construct histogram and wavelet synopses for probabilistic data are disclosed. A disclosed example method involves receiving probabilistic data associated with probability measures and generating a plurality of histograms based on the probabilistic data. Each histogram is generated based on items represented by the probabilistic data. In addition, each histogram is generated using a different quantity of buckets containing different ones of the items. An error measure associated with each of the plurality of histograms is determined and one of the plurality of histograms is selected based on its associated error measure. The method also involves displaying parameter information associated with the one of the plurality of histograms to represent the data.
摘要:
The invention comprises a method and apparatus for determining a rank of a query value. Specifically, the method comprises receiving a rank query request, determining, for each of the at least one remote monitor, a predicted lower-bound rank value and upper-bound rank value, wherein the predicted lower-bound rank value and upper-bound rank value are determined according to at least one respective prediction model used by each of the at least one remote monitor to compute the at least one local quantile summary, computing a predicted average rank value for each of the at least one remote monitor using the at least one predicted lower-bound rank value and the at least one predicted upper-bound rank value associated with the respective at least one remote monitor, and computing the rank of the query value using the at least one predicted average rank value associated with the respective at least one remote monitor.
摘要:
Methods and apparatus for representing probabilistic data using a probabilistic histogram are disclosed. An example method comprises partitioning a plurality of ordered data items into a plurality of buckets, each of the data items capable of having a data value from a plurality of possible data values with a probability characterized by a respective individual probability distribution function (PDF), each bucket associated with a respective subset of the ordered data items bounded by a respective beginning data item and a respective ending data item, and determining a first representative PDF for a first bucket associated with a first subset of the ordered data items by partitioning the plurality of possible data values into a first plurality of representative data ranges and respective representative probabilities based on an error between the first representative PDF and a first plurality of individual PDFs characterizing the first subset of the ordered data items.
摘要:
A distinct-count estimate is obtained in a guaranteed small footprint using a two level hash, distinct count sketch. A first hash fills the first-level hash buckets with an exponentially decreasing number of data-elements. These are then uniformly hashed to an array of second-level-hash tables, and have an associated total-element counter and bit-location counters. These counters are used to identify singletons and so provide a distinct-sample and a distinct-count. An estimate of the total distinct-count is obtained by dividing by the distinct-count by the probability of mapping a data-element to that bucket. An estimate of the total distinct-source frequencies of destination address can be found in a similar fashion. By further associating the distinct-count sketch with a list of singletons, a total singleton count and a heap containing the destination addresses ordered by their distinct-source frequencies, a tracking distinct-count sketch may be formed that has considerably improved query time.
摘要:
A method of estimating set-expression cardinalities over data streams with guaranteed small maintenance time per data-element update. The method only examines each data element once and uses a limited amount of memory. The time-efficient stream synopsis extends 2-level hash-sketches by randomly, but uniformly, pre-hashing data-elements prior to logarithmically hashing them to a first-level hash-table. This generates a set of independent 2-level hash-sketches. The set-union cardinality can be estimated by determining the smallest hash-bucket index j at which only a predetermined fraction of the b hash-buckets has a non-empty union |A∪B|. Once a set-union cardinality is estimated, general set-expression cardinalities may be estimated by counting witness elements for the set-expression, i.e., those first-level hash-buckets that are both a singleton for the set-expression and a set-union singleton. The set-expression cardinality is the set-union cardinality times the number of witness elements divided by the number of hash-buckets.