-
公开(公告)号:US08606771B2
公开(公告)日:2013-12-10
申请号:US12973909
申请日:2010-12-21
申请人: Arvind Arasu , Parag Agrawal , Kaushik Shriraghav
发明人: Arvind Arasu , Parag Agrawal , Kaushik Shriraghav
CPC分类号: G06F17/30336
摘要: The claimed subject matter provides a method and a system for the efficient indexing of error tolerant set containment. An exemplary method comprises obtaining a frequency threshold and a query set. All tokens or token sets within the query set are determined, and then all minimal infrequent tokens or all minimal infrequent tokens sets of data records are found and used to build an index. The minimal infrequent tokens or minimal infrequent tokensets are processed in a fixed order, and then a collection of signatures for each minimal infrequent token or token set is determined.
摘要翻译: 所要求保护的主题提供了用于有效地索引误差容限集的方法和系统。 一种示例性方法包括获得频率阈值和查询集。 确定查询集中的所有令牌或令牌集,然后找到所有最小的不频繁令牌或所有最小的不频繁令牌数据记录集,并用于构建索引。 以固定的顺序处理最小的不频繁令牌或最小不频繁的令牌,然后确定每个最小不频繁令牌或令牌集的签名集合。
-
公开(公告)号:US20120158696A1
公开(公告)日:2012-06-21
申请号:US12973909
申请日:2010-12-21
申请人: Arvind Arasu , Parag Agrawal , Kaushik Shriraghav
发明人: Arvind Arasu , Parag Agrawal , Kaushik Shriraghav
IPC分类号: G06F17/30
CPC分类号: G06F17/30336
摘要: The claimed subject matter provides a method and a system for the efficient indexing of error tolerant set containment. An exemplary method comprises obtaining a frequency threshold and a query set. All tokens or token sets within the query set are determined, and then all minimal infrequent tokens or all minimal infrequent tokens sets of data records are found and used to build an index. The minimal infrequent tokens or minimal infrequent tokensets are processed in a fixed order, and then a collection of signatures for each minimal infrequent token or token set is determined.
摘要翻译: 所要求保护的主题提供了用于有效地索引误差容限集的方法和系统。 一种示例性方法包括获得频率阈值和查询集。 确定查询集中的所有令牌或令牌集,然后找到所有最小的不频繁令牌或所有最小的不频繁令牌数据记录集,并用于构建索引。 以固定的顺序处理最小的不频繁令牌或最小不频繁的令牌,然后确定每个最小不频繁令牌或令牌集的签名集合。
-
公开(公告)号:US07865505B2
公开(公告)日:2011-01-04
申请号:US11668870
申请日:2007-01-30
CPC分类号: G06F17/30498 , G06F17/30533
摘要: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.
摘要翻译: 一种机器实现的系统和方法,其有效地促进并实现集合集合之间的精确相似性连接。 系统和方法从接口获得集合集合和阈值,并且至少部分地基于分析组件生成的集合集合之间的可识别相似性(例如重叠或交集)并输出候选对, 至少等于或超过阈值。
-
公开(公告)号:US20120330880A1
公开(公告)日:2012-12-27
申请号:US13166831
申请日:2011-06-23
申请人: Arvind Arasu , Kaushik Shriraghav , Jian Li
发明人: Arvind Arasu , Kaushik Shriraghav , Jian Li
CPC分类号: G06F16/24544
摘要: The claimed subject matter provides a method for data generation. The method includes identifying a generative probability distribution based on one or more cardinality constraints for populating a database table. The method also includes selecting one or more values for a corresponding one or more attributes in the database table based on the generative probability distribution and the cardinality constraints. Additionally, the method includes generating a tuple for the database table. The tuple comprises the one or more values.
摘要翻译: 所要求保护的主题提供了用于数据生成的方法。 该方法包括基于用于填充数据库表的一个或多个基数约束识别生成概率分布。 该方法还包括基于生成概率分布和基数约束来选择数据库表中对应的一个或多个属性的一个或多个值。 另外,该方法包括生成数据库表的元组。 元组包含一个或多个值。
-
公开(公告)号:US20070294221A1
公开(公告)日:2007-12-20
申请号:US11424191
申请日:2006-06-14
IPC分类号: G06F17/30
CPC分类号: G06F17/30489 , Y10S707/99933 , Y10S707/99934
摘要: The subject disclosure pertains to a powerful and flexible framework for record matching. The framework facilitates design of a record matching query or package composed of a set of well-defined primitive operators (e.g., relational, data cleaning . . . ), which can ultimately be executed to match records. To assist design of such packages, a learning technique based on examples is provided. More specifically, a set of matching and non-matching record pairs can be input and employed to facilitate automatic package generation. A generated package can subsequently be transformed manually and/or automatically into a semantically equivalent form optimized for execution.
摘要翻译: 主题公开涉及用于记录匹配的强大且灵活的框架。 该框架便于设计由一组明确定义的原始运算符(例如,关系数据清理...)组成的记录匹配查询或包,其最终可以被执行以匹配记录。 为了协助这样的包装的设计,提供了基于示例的学习技术。 更具体地,可以输入并采用一组匹配和非匹配记录对来促进自动包装生成。 生成的包可以随后被手动和/或自动地变换成为执行而优化的语义上等同的形式。
-
公开(公告)号:US07707207B2
公开(公告)日:2010-04-27
申请号:US11357665
申请日:2006-02-17
CPC分类号: G06F17/30469 , G06Q30/0283
摘要: The claimed subject matter relates to incorporating a skyline operator within a relational database engine, and more particularly to a database engine that utilizes novel techniques to determine the lowest cost of generating the skyline produced by the skyline operator. The database engine receives queries and associated preferences and, based on a cardinality estimate and a cost estimate, an appropriate skyline generating technique is utilized to produce a skyline representative of the received queries and its associated preferences.
摘要翻译: 所要求保护的主题涉及在关系数据库引擎内并入天际线运算符,更具体地涉及利用新技术来确定由天际线运算符产生的天际线产生的最低成本的数据库引擎。 数据库引擎接收查询和相关联的偏好,并且基于基数估计和成本估计,利用适当的地平线生成技术来产生所接收的查询及其相关联的偏好的天际线。
-
公开(公告)号:US07634464B2
公开(公告)日:2009-12-15
申请号:US11424191
申请日:2006-06-14
CPC分类号: G06F17/30489 , Y10S707/99933 , Y10S707/99934
摘要: The subject disclosure pertains to a powerful and flexible framework for record matching. The framework facilitates design of a record matching query or package composed of a set of well-defined primitive operators (e.g., relational, data cleaning . . . ), which can ultimately be executed to match records. To assist design of such packages, a learning technique based on examples is provided. More specifically, a set of matching and non-matching record pairs can be input and employed to facilitate automatic package generation. A generated package can subsequently be transformed manually and/or automatically into a semantically equivalent form optimized for execution.
摘要翻译: 主题公开涉及用于记录匹配的强大且灵活的框架。 该框架便于设计由一组明确定义的原始运算符(例如,关系数据清理...)组成的记录匹配查询或包,其最终可以被执行以匹配记录。 为了协助这样的包装的设计,提供了基于示例的学习技术。 更具体地,可以输入并采用一组匹配和非匹配记录对来促进自动包装生成。 生成的包可以随后被手动和/或自动地变换成为执行而优化的语义上等同的形式。
-
公开(公告)号:US20070192297A1
公开(公告)日:2007-08-16
申请号:US11558029
申请日:2006-11-09
申请人: Kaushik Shriraghav , Venkatesh Ganti , Xin Dong
发明人: Kaushik Shriraghav , Venkatesh Ganti , Xin Dong
IPC分类号: G06F17/30
CPC分类号: G06F16/24535 , Y10S707/99932 , Y10S707/99933 , Y10S707/99934
摘要: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.
摘要翻译: 本发明涉及通过利用它们之间的共性来有效地计算查询之间的差异。 生成最小差异查询(MDQ),大致对应于删除尽可能多的连接,同时仍准确地表示查询差异。 可以使用最小差异来进一步实质地观察视图匹配的范围,其中查询未被完全包含在视图中。 另外,最小差异查询可以用作各种上下文中的分析工具。
-
公开(公告)号:US07454407B2
公开(公告)日:2008-11-18
申请号:US11149968
申请日:2005-06-10
IPC分类号: G06F7/00
CPC分类号: G06F17/30522 , G06F17/30474 , Y10S707/99932 , Y10S707/99933 , Y10S707/99934 , Y10S707/99935
摘要: Techniques for estimating the progress of database queries are described herein. In a first implementation, a respective lower-bound parameter is associated with each node in an operator tree that representing a given database query, and the progress of the database query at a given point is estimated based upon the lower-bound parameters. In a second implementation, the progress of the query is estimated by associating respective lower-bound and upper-bound parameters with each node in the operator tree. The progress of the query at the given point is then estimated based on the lower-bound and upper-bound parameters.
摘要翻译: 本文描述了用于估计数据库查询的进度的技术。 在第一实现中,相应的下限参数与表示给定数据库查询的运算符树中的每个节点相关联,并且基于下限参数来估计给定点处的数据库查询的进度。 在第二个实现中,通过将相应的下限和上限参数与运算符树中的每个节点相关联来估计查询的进度。 然后,基于下限和上限参数估计给定点处的查询进度。
-
公开(公告)号:US20070192342A1
公开(公告)日:2007-08-16
申请号:US11352141
申请日:2006-02-10
IPC分类号: G06F7/00
CPC分类号: G06F17/30442 , Y10S707/99942 , Y10S707/99943
摘要: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.
摘要翻译: 提供了一种集合相似性连接系统和方法。 可以通过识别“关闭”元组(例如,记录和/或行)来基于相似性来促进系统的数据清理。 可以使用选择适合域和/或应用程序的相似性函数来评估“接近度”。 因此,该系统便于通用的域无关数据清理。 该系统可以与基本原语,即相似性连接(SSJoin)运算符一起使用,其可以用作构建块来实现各种各样的相似性概念(例如,编辑相似性,Jaccard相似性,广义编辑相似性,汉明 距离,声音等)以及基于共同出现的相似性。 SSJoin算子可以利用设置重叠的观察结果有效地用于支持各种相似度函数。 SSJoin操作符根据与其中每一个相关联(或明确构建的)的“集合”来比较值。
-
-
-
-
-
-
-
-
-