Efficient indexing of error tolerant set containment
    1.
    发明授权
    Efficient indexing of error tolerant set containment 有权
    有效的索引错误容错集遏制

    公开(公告)号:US08606771B2

    公开(公告)日:2013-12-10

    申请号:US12973909

    申请日:2010-12-21

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30336

    摘要: The claimed subject matter provides a method and a system for the efficient indexing of error tolerant set containment. An exemplary method comprises obtaining a frequency threshold and a query set. All tokens or token sets within the query set are determined, and then all minimal infrequent tokens or all minimal infrequent tokens sets of data records are found and used to build an index. The minimal infrequent tokens or minimal infrequent tokensets are processed in a fixed order, and then a collection of signatures for each minimal infrequent token or token set is determined.

    摘要翻译: 所要求保护的主题提供了用于有效地索引误差容限集的方法和系统。 一种示例性方法包括获得频率阈值和查询集。 确定查询集中的所有令牌或令牌集,然后找到所有最小的不频繁令牌或所有最小的不频繁令牌数据记录集,并用于构建索引。 以固定的顺序处理最小的不频繁令牌或最小不频繁的令牌,然后确定每个最小不频繁令牌或令牌集的签名集合。

    EFFICIENT INDEXING OF ERROR TOLERANT SET CONTAINMENT
    2.
    发明申请
    EFFICIENT INDEXING OF ERROR TOLERANT SET CONTAINMENT 有权
    有效的索引错误容错集

    公开(公告)号:US20120158696A1

    公开(公告)日:2012-06-21

    申请号:US12973909

    申请日:2010-12-21

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30336

    摘要: The claimed subject matter provides a method and a system for the efficient indexing of error tolerant set containment. An exemplary method comprises obtaining a frequency threshold and a query set. All tokens or token sets within the query set are determined, and then all minimal infrequent tokens or all minimal infrequent tokens sets of data records are found and used to build an index. The minimal infrequent tokens or minimal infrequent tokensets are processed in a fixed order, and then a collection of signatures for each minimal infrequent token or token set is determined.

    摘要翻译: 所要求保护的主题提供了用于有效地索引误差容限集的方法和系统。 一种示例性方法包括获得频率阈值和查询集。 确定查询集中的所有令牌或令牌集,然后找到所有最小的不频繁令牌或所有最小的不频繁令牌数据记录集,并用于构建索引。 以固定的顺序处理最小的不频繁令牌或最小不频繁的令牌,然后确定每个最小不频繁令牌或令牌集的签名集合。

    Efficient exact set similarity joins
    3.
    发明授权
    Efficient exact set similarity joins 有权
    有效的精确集合相似性连接

    公开(公告)号:US07865505B2

    公开(公告)日:2011-01-04

    申请号:US11668870

    申请日:2007-01-30

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30498 G06F17/30533

    摘要: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.

    摘要翻译: 一种机器实现的系统和方法,其有效地促进并实现集合集合之间的精确相似性连接。 系统和方法从接口获得集合集合和阈值,并且至少部分地基于分析组件生成的集合集合之间的可识别相似性(例如重叠或交集)并输出候选对, 至少等于或超过阈值。

    SYNTHETIC DATA GENERATION
    4.
    发明申请
    SYNTHETIC DATA GENERATION 审中-公开
    合成数据生成

    公开(公告)号:US20120330880A1

    公开(公告)日:2012-12-27

    申请号:US13166831

    申请日:2011-06-23

    IPC分类号: G06F17/30 G06N5/02

    CPC分类号: G06F16/24544

    摘要: The claimed subject matter provides a method for data generation. The method includes identifying a generative probability distribution based on one or more cardinality constraints for populating a database table. The method also includes selecting one or more values for a corresponding one or more attributes in the database table based on the generative probability distribution and the cardinality constraints. Additionally, the method includes generating a tuple for the database table. The tuple comprises the one or more values.

    摘要翻译: 所要求保护的主题提供了用于数据生成的方法。 该方法包括基于用于填充数据库表的一个或多个基数约束识别生成概率分布。 该方法还包括基于生成概率分布和基数约束来选择数据库表中对应的一个或多个属性的一个或多个值。 另外,该方法包括生成数据库表的元组。 元组包含一个或多个值。

    DESIGNING RECORD MATCHING QUERIES UTILIZING EXAMPLES
    5.
    发明申请
    DESIGNING RECORD MATCHING QUERIES UTILIZING EXAMPLES 有权
    设计记录匹配问题应用实例

    公开(公告)号:US20070294221A1

    公开(公告)日:2007-12-20

    申请号:US11424191

    申请日:2006-06-14

    IPC分类号: G06F17/30

    摘要: The subject disclosure pertains to a powerful and flexible framework for record matching. The framework facilitates design of a record matching query or package composed of a set of well-defined primitive operators (e.g., relational, data cleaning . . . ), which can ultimately be executed to match records. To assist design of such packages, a learning technique based on examples is provided. More specifically, a set of matching and non-matching record pairs can be input and employed to facilitate automatic package generation. A generated package can subsequently be transformed manually and/or automatically into a semantically equivalent form optimized for execution.

    摘要翻译: 主题公开涉及用于记录匹配的强大且灵活的框架。 该框架便于设计由一组明确定义的原始运算符(例如,关系数据清理...)组成的记录匹配查询或包,其最终可以被执行以匹配记录。 为了协助这样的包装的设计,提供了基于示例的学习技术。 更具体地,可以输入并采用一组匹配和非匹配记录对来促进自动包装生成。 生成的包可以随后被手动和/或自动地变换成为执行而优化的语义上等同的形式。

    Robust cardinality and cost estimation for skyline operator
    6.
    发明授权
    Robust cardinality and cost estimation for skyline operator 有权
    天际线运营商的鲁棒基数和成本估算

    公开(公告)号:US07707207B2

    公开(公告)日:2010-04-27

    申请号:US11357665

    申请日:2006-02-17

    IPC分类号: G06F17/30 G06F15/16

    CPC分类号: G06F17/30469 G06Q30/0283

    摘要: The claimed subject matter relates to incorporating a skyline operator within a relational database engine, and more particularly to a database engine that utilizes novel techniques to determine the lowest cost of generating the skyline produced by the skyline operator. The database engine receives queries and associated preferences and, based on a cardinality estimate and a cost estimate, an appropriate skyline generating technique is utilized to produce a skyline representative of the received queries and its associated preferences.

    摘要翻译: 所要求保护的主题涉及在关系数据库引擎内并入天际线运算符,更具体地涉及利用新技术来确定由天际线运算符产生的天际线产生的最低成本的数据库引擎。 数据库引擎接收查询和相关联的偏好,并且基于基数估计和成本估计,利用适当的地平线生成技术来产生所接收的查询及其相关联的偏好的天际线。

    Designing record matching queries utilizing examples
    7.
    发明授权
    Designing record matching queries utilizing examples 有权
    使用示例设计记录匹配查询

    公开(公告)号:US07634464B2

    公开(公告)日:2009-12-15

    申请号:US11424191

    申请日:2006-06-14

    IPC分类号: G06F17/30 G06F7/00

    摘要: The subject disclosure pertains to a powerful and flexible framework for record matching. The framework facilitates design of a record matching query or package composed of a set of well-defined primitive operators (e.g., relational, data cleaning . . . ), which can ultimately be executed to match records. To assist design of such packages, a learning technique based on examples is provided. More specifically, a set of matching and non-matching record pairs can be input and employed to facilitate automatic package generation. A generated package can subsequently be transformed manually and/or automatically into a semantically equivalent form optimized for execution.

    摘要翻译: 主题公开涉及用于记录匹配的强大且灵活的框架。 该框架便于设计由一组明确定义的原始运算符(例如,关系数据清理...)组成的记录匹配查询或包,其最终可以被执行以匹配记录。 为了协助这样的包装的设计,提供了基于示例的学习技术。 更具体地,可以输入并采用一组匹配和非匹配记录对来促进自动包装生成。 生成的包可以随后被手动和/或自动地变换成为执行而优化的语义上等同的形式。

    MINIMAL DIFFERENCE QUERY AND VIEW MATCHING
    8.
    发明申请
    MINIMAL DIFFERENCE QUERY AND VIEW MATCHING 审中-公开
    最小差异查询和查看匹配

    公开(公告)号:US20070192297A1

    公开(公告)日:2007-08-16

    申请号:US11558029

    申请日:2006-11-09

    IPC分类号: G06F17/30

    摘要: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.

    摘要翻译: 本发明涉及通过利用它们之间的共性来有效地计算查询之间的差异。 生成最小差异查询(MDQ),大致对应于删除尽可能多的连接,同时仍准确地表示查询差异。 可以使用最小差异来进一步实质地观察视图匹配的范围,其中查询未被完全包含在视图中。 另外,最小差异查询可以用作各种上下文中的分析工具。

    Techniques for estimating progress of database queries
    9.
    发明授权
    Techniques for estimating progress of database queries 有权
    估计数据库查询进度的技术

    公开(公告)号:US07454407B2

    公开(公告)日:2008-11-18

    申请号:US11149968

    申请日:2005-06-10

    IPC分类号: G06F7/00

    摘要: Techniques for estimating the progress of database queries are described herein. In a first implementation, a respective lower-bound parameter is associated with each node in an operator tree that representing a given database query, and the progress of the database query at a given point is estimated based upon the lower-bound parameters. In a second implementation, the progress of the query is estimated by associating respective lower-bound and upper-bound parameters with each node in the operator tree. The progress of the query at the given point is then estimated based on the lower-bound and upper-bound parameters.

    摘要翻译: 本文描述了用于估计数据库查询的进度的技术。 在第一实现中,相应的下限参数与表示给定数据库查询的运算符树中的每个节点相关联,并且基于下限参数来估计给定点处的数据库查询的进度。 在第二个实现中,通过将相应的下限和上限参数与运算符树中的每个节点相关联来估计查询的进度。 然后,基于下限和上限参数估计给定点处的查询进度。

    Primitive operator for similarity joins in data cleaning
    10.
    发明申请
    Primitive operator for similarity joins in data cleaning 有权
    数据清理中相似性的原始运算符

    公开(公告)号:US20070192342A1

    公开(公告)日:2007-08-16

    申请号:US11352141

    申请日:2006-02-10

    IPC分类号: G06F7/00

    摘要: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.

    摘要翻译: 提供了一种集合相似性连接系统和方法。 可以通过识别“关闭”元组(例如,记录和/或行)来基于相似性来促进系统的数据清理。 可以使用选择适合域和/或应用程序的相似性函数来评估“接近度”。 因此,该系统便于通用的域无关数据清理。 该系统可以与基本原语,即相似性连接(SSJoin)运算符一起使用,其可以用作构建块来实现各种各样的相似性概念(例如,编辑相似性,Jaccard相似性,广义编辑相似性,汉明 距离,声音等)以及基于共同出现的相似性。 SSJoin算子可以利用设置重叠的观察结果有效地用于支持各种相似度函数。 SSJoin操作符根据与其中每一个相关联(或明确构建的)的“集合”来比较值。