-
公开(公告)号:US07293037B2
公开(公告)日:2007-11-06
申请号:US11246354
申请日:2005-10-07
Applicant: Surajit Chaudhuri , Vivek R. Narasayya , Rajeev Motwani , Mayur D. Datar
Inventor: Surajit Chaudhuri , Vivek R. Narasayya , Rajeev Motwani , Mayur D. Datar
IPC: G06F17/30
CPC classification number: G06F17/30489 , G06F17/30536 , G06F2216/03 , Y10S707/957 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935 , Y10S707/99942 , Y10S707/99943
Abstract: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
Abstract translation: 通过首先识别异常值,聚合异常值和在修剪异常值之后对剩余数据进行采样来执行聚合查询。 采样数据被外推并加到聚合异常值中,以提供每个聚合查询的估计。 异常值通过选择具有最小方差的数据的所选滑动窗口之外的值来识别。 为异常值创建索引。 离群数据从数据窗口中移除,并单独汇总。 然后对没有异常值的剩余数据进行采样,以提供统计学上相关的样本,然后对其进行聚合和外插,以提供剩余数据的估计。 该采样估计与异常值聚合组合以形成整套数据的估计。
-
公开(公告)号:US07287020B2
公开(公告)日:2007-10-23
申请号:US09759804
申请日:2001-01-12
Applicant: Surajit Chaudhuri , Vivek R. Narasayya , Rajeev Motwani , Mayur D. Datar
Inventor: Surajit Chaudhuri , Vivek R. Narasayya , Rajeev Motwani , Mayur D. Datar
IPC: G06F17/30
CPC classification number: G06F17/30536 , G06F17/30489 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99942
Abstract: This disclosure describes leveraging workload information associated with executed database queries for estimating the result of a current database query. The workload information is analyzed to determine the usage of tuples in a database during query execution, such as how often a tuple is accessed and the number of different queries that accessed the tuple. A tuple is assigned a weight value that is based on the analyzed workload information. The particular tuples sampled for estimating a result for the current query is based on each tuple's weight value. The workload information may also be leveraged to generate an outlier index that identifies outlier tuples associated with the executed queries or that identifies outlier tuples associated with particular queries that are executed more frequently than other queries. The result for the current query can also be estimated using the sampled values along with the outlier tuples from the outlier index.
Abstract translation: 本公开描述了利用与执行的数据库查询相关联的工作负载信息来估计当前数据库查询的结果。 分析工作负载信息以确定查询执行期间数据库中元组的使用情况,例如访问元组的频率以及访问元组的不同查询的数量。 一个元组被分配一个基于分析的工作量信息的权重值。 为当前查询估计结果而采样的特定元组基于每个元组的权重值。 还可以利用工作负载信息来生成异常值索引,该索引识别与执行的查询相关联的异常值元组,或者识别与其他查询更频繁执行的特定查询相关联的异常值元组。 当前查询的结果也可以使用采样值以及来自离群值索引的异常值元组来估计。
-
公开(公告)号:US20060053129A1
公开(公告)日:2006-03-09
申请号:US10929514
申请日:2004-08-30
Applicant: Rajeev Motwani , Surajit Chaudhuri , Venkatesh Ganti
Inventor: Rajeev Motwani , Surajit Chaudhuri , Venkatesh Ganti
IPC: G06F7/00
CPC classification number: G06F17/30303 , Y10S707/99932 , Y10S707/99933 , Y10S707/99937 , Y10S707/99942 , Y10S707/99943 , Y10S707/99945
Abstract: At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.
-
公开(公告)号:US06842753B2
公开(公告)日:2005-01-11
申请号:US09759799
申请日:2001-01-12
Applicant: Surajit Chaudhuri , Vivek R. Narasayya , Rajeev Motwani , Mayur D. Datar
Inventor: Surajit Chaudhuri , Vivek R. Narasayya , Rajeev Motwani , Mayur D. Datar
IPC: G06F17/30
CPC classification number: G06F17/30489 , G06F17/30536 , G06F2216/03 , Y10S707/957 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935 , Y10S707/99942 , Y10S707/99943
Abstract: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled in one of many known ways to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data. Further methods involve the use of weighted sampling and weighted selection of outlier values for low selectivity queries, or queries having group by.
Abstract translation: 通过首先识别异常值,聚合异常值和在修剪异常值之后对剩余数据进行采样来执行聚合查询。 采样数据被外推并加到聚合异常值中,以提供每个聚合查询的估计。 异常值通过选择具有最小方差的数据的所选滑动窗口之外的值来识别。 为异常值创建索引。 离群数据从数据窗口中移除,并单独汇总。 然后以许多已知方式之一对剩余的没有异常值的数据进行采样,以提供统计学相关的样本,然后进行聚合和外推,以提供剩余数据的估计。 该采样估计与异常值聚合组合以形成整套数据的估计。 进一步的方法涉及对低选择性查询或具有分组查询的异常值的加权采样和加权选择。
-
公开(公告)号:US06532458B1
公开(公告)日:2003-03-11
申请号:US09268590
申请日:1999-03-15
Applicant: Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya
Inventor: Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya
IPC: G06F1730
CPC classification number: G06F17/30536 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99934 , Y10S707/99935
Abstract: A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records, such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
-
公开(公告)号:US07567949B2
公开(公告)日:2009-07-28
申请号:US10238175
申请日:2002-09-10
Applicant: Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya
Inventor: Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya
CPC classification number: G06F17/30536 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99934 , Y10S707/99935
Abstract: A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records, such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
Abstract translation: 数据库服务器根据期望的抽样语义(例如替换(WR),无替换(WoR)或独立硬币翻转(CF))语义支持对记录或元组进行加权和未加权采样。 数据库服务器可以顺序地执行这样的采样,以便例如非查询记录例如在查询树中由流水线生成的非实体记录,但是也可以在一次通过中对采样记录(无论是否实现)进行采样。 数据库服务器还支持对两个记录或元组关系的连接进行抽样,而不需要计算完整连接,而不需要在关系的连接属性值上实现关系和/或索引。
-
公开(公告)号:US07516149B2
公开(公告)日:2009-04-07
申请号:US10929514
申请日:2004-08-30
Applicant: Rajeev Motwani , Surajit Chaudhuri , Venkatesh Ganti
Inventor: Rajeev Motwani , Surajit Chaudhuri , Venkatesh Ganti
CPC classification number: G06F17/30303 , Y10S707/99932 , Y10S707/99933 , Y10S707/99937 , Y10S707/99942 , Y10S707/99943 , Y10S707/99945
Abstract: At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.
Abstract translation: 本文描述的至少一个实施例检测模糊重复并消除这种重复。 模糊重复是代表相同的真实世界实体或现象的数据库中的多个看似独特的元组(即,记录)。
-
公开(公告)号:US07493316B2
公开(公告)日:2009-02-17
申请号:US11296036
申请日:2005-12-07
Applicant: Surajit Chaudhuri , Vivek R. Narasayya , Rajeev Motwani , Mayur D. Datar
Inventor: Surajit Chaudhuri , Vivek R. Narasayya , Rajeev Motwani , Mayur D. Datar
IPC: G06F17/30
CPC classification number: G06F17/30536 , G06F17/30489 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99942
Abstract: A method of estimating results of a database query, the results are estimated by performing a sampling of weighted tuples in a database based on a probability of usage of tuples required in executing a workload. A probability is associated with each tuple sampled. An aggregate is computed over values in each sampled tuple while multiplying by the inverses of the probabilities associated with each tuple sampled.
Abstract translation: 一种估计数据库查询结果的方法,通过基于在执行工作负载中所需的元组的使用概率对数据库中的加权元组进行抽样来估计结果。 每个元组采样的概率相关。 根据每个采样元组中的值计算聚合,同时乘以与每个元组采样相关联的概率的反转。
-
公开(公告)号:US20060085463A1
公开(公告)日:2006-04-20
申请号:US11296034
申请日:2005-12-07
Applicant: Surajit Chaudhuri , Vivek Narasayya , Rajeev Motwani , Mayur Datar
Inventor: Surajit Chaudhuri , Vivek Narasayya , Rajeev Motwani , Mayur Datar
IPC: G06F7/00
CPC classification number: G06F17/30536 , G06F17/30489 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99942
Abstract: An outlier index for a database and a given workload is generated by identifying sub-relations of tuples in the database induced by selection and group by conditions in queries in the workload. A variance is then generated for values in each sub-relation. Sub-relations having higher variances are selected, and outliers from such sub-relations having higher variances are generated.
-
公开(公告)号:US20060036600A1
公开(公告)日:2006-02-16
申请号:US11246355
申请日:2005-10-07
Applicant: Surajit Chaudhuri , Vivek Narasayya , Rajeev Motwani , Mayur Datar
Inventor: Surajit Chaudhuri , Vivek Narasayya , Rajeev Motwani , Mayur Datar
IPC: G06F7/00
CPC classification number: G06F17/30489 , G06F17/30536 , G06F2216/03 , Y10S707/957 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935 , Y10S707/99942 , Y10S707/99943
Abstract: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
-
-
-
-
-
-
-
-
-