-
公开(公告)号:US07296011B2
公开(公告)日:2007-11-13
申请号:US10600083
申请日:2003-06-20
CPC分类号: G06F17/30542 , G06F17/30303 , Y10S707/99933
摘要: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.
摘要翻译: 为了帮助确保高数据质量,数据仓库验证和清理,如果需要外部来源的传入数据元组。 在许多情况下,输入元组或输入元组的一部分必须匹配参考表中可接受的元组。 例如,分销商的销售记录中的产品名称和描述字段必须与产品参考关系中的预先记录的名称和描述字段相匹配。 所公开的系统实现有效和准确的近似或模糊匹配操作,其可以有效地清除传入元组,如果它不能与参考关系中的任何多个元组完全匹配。 使用称为q-gram的令牌子串的公开的相似度函数克服了现有技术相似度功能的限制,同时有效地执行模糊匹配过程。
-
公开(公告)号:US07516149B2
公开(公告)日:2009-04-07
申请号:US10929514
申请日:2004-08-30
CPC分类号: G06F17/30303 , Y10S707/99932 , Y10S707/99933 , Y10S707/99937 , Y10S707/99942 , Y10S707/99943 , Y10S707/99945
摘要: At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.
摘要翻译: 本文描述的至少一个实施例检测模糊重复并消除这种重复。 模糊重复是代表相同的真实世界实体或现象的数据库中的多个看似独特的元组(即,记录)。
-
公开(公告)号:US20060053129A1
公开(公告)日:2006-03-09
申请号:US10929514
申请日:2004-08-30
IPC分类号: G06F7/00
CPC分类号: G06F17/30303 , Y10S707/99932 , Y10S707/99933 , Y10S707/99937 , Y10S707/99942 , Y10S707/99943 , Y10S707/99945
摘要: At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.
-
公开(公告)号:US07363301B2
公开(公告)日:2008-04-22
申请号:US11246355
申请日:2005-10-07
IPC分类号: G06F17/30
CPC分类号: G06F17/30489 , G06F17/30536 , G06F2216/03 , Y10S707/957 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935 , Y10S707/99942 , Y10S707/99943
摘要: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
摘要翻译: 通过首先识别异常值,聚合异常值和在修剪异常值之后对剩余数据进行采样来执行聚合查询。 采样数据被外推并加到聚合异常值中,以提供每个聚合查询的估计。 异常值通过选择具有最小方差的数据的所选滑动窗口之外的值来识别。 为异常值创建索引。 离群数据从数据窗口中移除,并单独汇总。 然后对没有异常值的剩余数据进行采样,以提供统计学上相关的样本,然后对其进行聚合和外插,以提供剩余数据的估计。 该采样估计与异常值聚合组合以形成整套数据的估计。
-
公开(公告)号:US20060085410A1
公开(公告)日:2006-04-20
申请号:US11296036
申请日:2005-12-07
IPC分类号: G06F17/30
CPC分类号: G06F17/30536 , G06F17/30489 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99942
摘要: A method of estimating the Results of a database query are estimated by performing a sampling of weighted tuples in a database based on a probability of usage of tuples required in executing a workload. A probability is associated with each tuple sampled. And, can aggregate is computed over values in each sampled tuple while multiplying by the inverses of the probabilities associated with each tuple sampled.
摘要翻译: 通过基于执行工作负载所需的元组的使用概率,对数据库中的加权元组进行抽样来估计估计数据库查询结果的方法。 每个元组采样的概率相关。 并且,可以在每个采样的元组中的值上计算可以聚合,同时乘以与每个元组采样相关联的概率的逆。
-
公开(公告)号:US20060053103A1
公开(公告)日:2006-03-09
申请号:US11246354
申请日:2005-10-07
IPC分类号: G06F17/30
CPC分类号: G06F17/30489 , G06F17/30536 , G06F2216/03 , Y10S707/957 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935 , Y10S707/99942 , Y10S707/99943
摘要: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
摘要翻译: 通过首先识别异常值,聚合异常值和在修剪异常值之后对剩余数据进行采样来执行聚合查询。 采样数据被外推并加到聚合异常值中,以提供每个聚合查询的估计。 异常值通过选择具有最小方差的数据的所选滑动窗口之外的值来识别。 为异常值创建索引。 离群数据从数据窗口中移除,并单独汇总。 然后对没有异常值的剩余数据进行采样,以提供统计学上相关的样本,然后对其进行聚合和外插,以提供剩余数据的估计。 该采样估计与异常值聚合组合以形成整套数据的估计。
-
公开(公告)号:US07567949B2
公开(公告)日:2009-07-28
申请号:US10238175
申请日:2002-09-10
CPC分类号: G06F17/30536 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99934 , Y10S707/99935
摘要: A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records, such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
摘要翻译: 数据库服务器根据期望的抽样语义(例如替换(WR),无替换(WoR)或独立硬币翻转(CF))语义支持对记录或元组进行加权和未加权采样。 数据库服务器可以顺序地执行这样的采样,以便例如非查询记录例如在查询树中由流水线生成的非实体记录,但是也可以在一次通过中对采样记录(无论是否实现)进行采样。 数据库服务器还支持对两个记录或元组关系的连接进行抽样,而不需要计算完整连接,而不需要在关系的连接属性值上实现关系和/或索引。
-
公开(公告)号:US07493316B2
公开(公告)日:2009-02-17
申请号:US11296036
申请日:2005-12-07
IPC分类号: G06F17/30
CPC分类号: G06F17/30536 , G06F17/30489 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99942
摘要: A method of estimating results of a database query, the results are estimated by performing a sampling of weighted tuples in a database based on a probability of usage of tuples required in executing a workload. A probability is associated with each tuple sampled. An aggregate is computed over values in each sampled tuple while multiplying by the inverses of the probabilities associated with each tuple sampled.
摘要翻译: 一种估计数据库查询结果的方法,通过基于在执行工作负载中所需的元组的使用概率对数据库中的加权元组进行抽样来估计结果。 每个元组采样的概率相关。 根据每个采样元组中的值计算聚合,同时乘以与每个元组采样相关联的概率的反转。
-
公开(公告)号:US20060085463A1
公开(公告)日:2006-04-20
申请号:US11296034
申请日:2005-12-07
IPC分类号: G06F7/00
CPC分类号: G06F17/30536 , G06F17/30489 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99942
摘要: An outlier index for a database and a given workload is generated by identifying sub-relations of tuples in the database induced by selection and group by conditions in queries in the workload. A variance is then generated for values in each sub-relation. Sub-relations having higher variances are selected, and outliers from such sub-relations having higher variances are generated.
-
公开(公告)号:US20060036600A1
公开(公告)日:2006-02-16
申请号:US11246355
申请日:2005-10-07
IPC分类号: G06F7/00
CPC分类号: G06F17/30489 , G06F17/30536 , G06F2216/03 , Y10S707/957 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935 , Y10S707/99942 , Y10S707/99943
摘要: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
-
-
-
-
-
-
-
-
-