摘要:
Methods and apparatus for representing probabilistic data using a probabilistic histogram are disclosed. An example method comprises partitioning a plurality of ordered data items into a plurality of buckets, each of the data items capable of having a data value from a plurality of possible data values with a probability characterized by a respective individual probability distribution function (PDF), each bucket associated with a respective subset of the ordered data items bounded by a respective beginning data item and a respective ending data item, and determining a first representative PDF for a first bucket associated with a first subset of the ordered data items by partitioning the plurality of possible data values into a first plurality of representative data ranges and respective representative probabilities based on an error between the first representative PDF and a first plurality of individual PDFs characterizing the first subset of the ordered data items.
摘要:
Methods and apparatus for representing probabilistic data using a probabilistic histogram are disclosed. An example method comprises partitioning a plurality of ordered data items into a plurality of buckets, each of the data items capable of having a data value from a plurality of possible data values with a probability characterized by a respective individual probability distribution function (PDF), each bucket associated with a respective subset of the ordered data items bounded by a respective beginning data item and a respective ending data item, and determining a first representative PDF for a first bucket associated with a first subset of the ordered data items by partitioning the plurality of possible data values into a first plurality of representative data ranges and respective representative probabilities based on an error between the first representative PDF and a first plurality of individual PDFs characterizing the first subset of the ordered data items.
摘要:
A technique for building probabilistic wavelet synopses for multi-measure data sets is provided. In the presence of multiple measures, it is demonstrated that the problem of exact probabilistic coefficient thresholding becomes significantly more complex. An algorithmic formulation for probabilistic multi-measure wavelet thresholding based on the idea of partial-order dynamic programming (PODP) is provided. A fast, greedy approximation algorithm for probabilistic multi-measure thresholding based on the idea of marginal error gains is provided. An empirical study with both synthetic and real-life data sets validated the approach, demonstrating that the algorithms outperform naive approaches based on optimizing individual measures independently and the greedy thresholding scheme provides near-optimal and, at the same time, fast and scalable solutions to the probabilistic wavelet synopsis construction problem.
摘要:
The invention relates to data warehouses and the ability to create and maintain data cubes of multi-dimensional data. More specifically, the invention pertains to data cube architectures that permit significant reduction of storage, exhibit very efficient retrieval and provide a very efficient incremental update of the data cubes.