Abstract:
A method of estimating the Results of a database query are estimated by performing a sampling of weighted tuples in a database based on a probability of usage of tuples required in executing a workload. A probability is associated with each tuple sampled. And, can aggregate is computed over values in each sampled tuple while multiplying by the inverses of the probabilities associated with each tuple sampled.
Abstract:
Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
Abstract:
A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records, such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
Abstract:
An outlier index for a database and a given workload is generated by identifying sub-relations of tuples in the database induced by selection and group by conditions in queries in the workload. A variance is then generated for values in each sub-relation. Sub-relations having higher variances are selected, and outliers from such sub-relations having higher variances are generated.
Abstract:
Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
Abstract:
A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
Abstract:
Using adaptive random sampling with cross-validation helps determine when enough data of a database has been sampled to construct histograms on one or more columns of one or more tables of the database within a desired or predetermined degree of accuracy. An adaptive random sampling histogram construction tool constructs an approximate equi-height k-histogram using an initial sample of data values from the database and iteratively updates the histogram using an additional sample of data values from the database until the histogram is within the desired degree of accuracy. The accuracy of the histogram is cross-validated against the additional sample at each iteration, and the additional sample is used to update the histogram to help improve its accuracy. The accuracy of the histogram may be measured by an error in distribution of the additional sample over the histogram as compared to a threshold error using a suitable error metric. By attempting to sample only the number of data values necessary to construct the histogram within the desired degree of accuracy, the adaptive random sampling histogram construction tool attempts to avoid any cost increases in time and memory from sampling too many data values.
Abstract:
A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records, such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
Abstract:
Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
Abstract:
To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.