EXAMPLE-DRIVEN DESIGN OF EFFICIENT RECORD MATCHING QUERIES
    1.
    发明申请
    EXAMPLE-DRIVEN DESIGN OF EFFICIENT RECORD MATCHING QUERIES 有权
    实例 - 有效记录匹配查询的驱动设计

    公开(公告)号:US20080306945A1

    公开(公告)日:2008-12-11

    申请号:US11758202

    申请日:2007-06-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30533 G06F17/30495

    摘要: Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.

    摘要翻译: 示例驱动创建记录匹配查询。 所公开的架构采用利用正(或匹配)和否定(不匹配)示例的可用性来搜索该空间并提出初始记录匹配查询的技术。 记录匹配任务被建模为设计通过组合几个原始算子获得的运算符树的记录匹配任务。 这确保了记录匹配程序可以在大的输入关系上有效和可扩展地执行。 该架构通过多个(例如,两个)关系(例如,R和S)连接记录。 该架构利用了关系中记录匹配的相似度函数的单调性,因为任何一对匹配记录具有比至少一个相似度函数上的非匹配记录对更高的相似度值。

    Leveraging constraints for deduplication
    2.
    发明授权
    Leveraging constraints for deduplication 有权
    利用重复数据删除的约束

    公开(公告)号:US08204866B2

    公开(公告)日:2012-06-19

    申请号:US11804400

    申请日:2007-05-18

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30489

    摘要: A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.

    摘要翻译: 重复数据删除算法,通过使用聚合和/或分组约束来提高重复数据删除的精度。 重复数据删除使用只有这些约束满足的约束才能实现,而不是将其作为硬约束条件强制强加。 此外,利用元组之间的文本相似性来限制搜索空间。 该算法以数据记录的粗略初始分区开始,并通过提高相似性阈值继续,直到阈值分裂给定分区。 这个拆分序列定义了丰富的替代空间。 在这个空间上,一个算法找到了一个最大化约束满足度的输入分区。 在重复数据消除的分组聚合约束的上下文中,允许所有SQL(结构化查询语言)聚合,包括求和。

    Example-driven design of efficient record matching queries
    3.
    发明授权
    Example-driven design of efficient record matching queries 有权
    高效记录匹配查询的示例驱动设计

    公开(公告)号:US08046339B2

    公开(公告)日:2011-10-25

    申请号:US11758202

    申请日:2007-06-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30533 G06F17/30495

    摘要: Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.

    摘要翻译: 示例驱动创建记录匹配查询。 所公开的架构采用利用正(或匹配)和否定(不匹配)示例的可用性来搜索该空间并提出初始记录匹配查询的技术。 记录匹配任务被建模为设计通过组合几个原始算子获得的运算符树的记录匹配任务。 这确保了记录匹配程序可以在大的输入关系上有效和可扩展地执行。 该架构通过多个(例如,两个)关系(例如,R和S)连接记录。 该架构利用了关系中记录匹配的相似度函数的单调性,因为任何一对匹配记录具有比至少一个相似度函数上的非匹配记录对更高的相似度值。

    Leveraging constraints for deduplication
    4.
    发明申请
    Leveraging constraints for deduplication 有权
    利用重复数据删除的约束

    公开(公告)号:US20080288482A1

    公开(公告)日:2008-11-20

    申请号:US11804400

    申请日:2007-05-18

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30489

    摘要: A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.

    摘要翻译: 重复数据删除算法,通过使用聚合和/或分组约束来提高重复数据删除的精度。 重复数据删除使用只有这些约束满足的约束才能实现,而不是将其作为硬约束条件强制强加。 此外,利用元组之间的文本相似性来限制搜索空间。 该算法以数据记录的粗略初始分区开始,并通过提高相似性阈值继续,直到阈值分裂给定分区。 这个拆分序列定义了丰富的替代空间。 在这个空间上,一个算法找到了一个最大化约束满足度的输入分区。 在重复数据消除的分组聚合约束的上下文中,允许所有SQL(结构化查询语言)聚合,包括求和。

    LEARNING STRING TRANSFORMATIONS FROM EXAMPLES
    5.
    发明申请
    LEARNING STRING TRANSFORMATIONS FROM EXAMPLES 有权
    从示例中学习STRING变换

    公开(公告)号:US20110038531A1

    公开(公告)日:2011-02-17

    申请号:US12492311

    申请日:2009-08-14

    IPC分类号: G06K9/62

    CPC分类号: G06F17/2765

    摘要: Techniques are described to leverage a set of sample or example matched pairs of strings to learn string transformation rules, which may be used to match data records that are semantically equivalent. In one embodiment, matched pairs of input strings are accessed. For a set of matched pairs, a set of one or more string transformation rules are learned. A transformation rule may include two strings determined to be semantically equivalent. The transformation rules are used to determine whether a first and second string match each other.

    摘要翻译: 描述技术来利用一组样本或示例匹配的字符串对来学习字符串转换规则,其可以用于匹配语义等同的数据记录。 在一个实施例中,访问匹配的输入串对。 对于一组匹配的对,学习一组或多个字符串转换规则。 转换规则可以包括确定为在语义上相等的两个字符串。 变换规则用于确定第一个和第二个字符串是否彼此匹配。

    TRANSFORMATION-BASED FRAMEWORK FOR RECORD MATCHING
    6.
    发明申请
    TRANSFORMATION-BASED FRAMEWORK FOR RECORD MATCHING 有权
    用于记录匹配的基于变换的框架

    公开(公告)号:US20090210418A1

    公开(公告)日:2009-08-20

    申请号:US12031715

    申请日:2008-02-15

    IPC分类号: G06F17/30

    摘要: A transformation-based record matching technique. The technique provides a flexible way to account for synonyms and more general forms of string equivalences when performing record matching by taking as explicit input user-defined transformation rules (such as, for example, the fact that “Robert” and “Bob” that are synonymous). The input string and user-defined transformation rules are used to generate a larger set of strings which are used when performing record matching. Both the input string and data elements in a database can be transformed using the user-defined transformation rules in order to generate a larger set of potential record matches. These potential record matches can then be subjected to a threshold test in order to determine one or more best matches. Additionally, signature-based similarity functions are used to improve the computational efficiency of the technique.

    摘要翻译: 基于变换的记录匹配技术。 当通过采用显式输入用户定义的转换规则(例如,“Robert”和“Bob”)这样的事实来执行记录匹配时,该技术提供了一种灵活的方式来解释同义词和更一般的字符串等同形式 同义词)。 输入字符串和用户定义的转换规则用于生成在执行记录匹配时使用的较大的一组字符串。 可以使用用户定义的变换规则来转换数据库中的输入字符串和数据元素,以便生成更大的潜在记录匹配集合。 然后可以对这些潜在的记录匹配进行阈值测试,以确定一个或多个最佳匹配。 另外,使用基于签名的相似度函数来提高该技术的计算效率。

    STOP-AND-RESTART STYLE EXECUTION FOR LONG RUNNING DECISION SUPPORT QUERIES
    7.
    发明申请
    STOP-AND-RESTART STYLE EXECUTION FOR LONG RUNNING DECISION SUPPORT QUERIES 审中-公开
    用于长时间运行的决策支持查询的停止和重新启动方式执行

    公开(公告)号:US20090083238A1

    公开(公告)日:2009-03-26

    申请号:US11859046

    申请日:2007-09-21

    IPC分类号: G06F17/30

    CPC分类号: G06F16/24561

    摘要: Stop-and-restart query execution that partially leverages the work already performed during the initial execution of the query to reduce the execution time during a restart. The technique selectively saves information from a previous execution of the query so that the overhead associated with restarting the query execution can be bounded. Despite saving only limited information, the disclosed technique substantially reduces the running time of the restarted query. The stop-and-restart query execution technique is constrained to save and reuse only a bounded number of records (intermediate records or output records) thereby releasing all other resources, rather than some of the resources. The technique chooses a subset of the records to save that were found during normal execution and then skipping the corresponding records when performing a scan during restart to prevent the duplication of execution. A skip-scan operator is employed to facilitate the disclosed restart technique.

    摘要翻译: 停止和重新启动的查询执行,部分利用在初始执行查询期间已经执行的工作,以减少重新启动期间的执行时间。 该技术选择性地保存来自查询的先前执行的信息,使得与重新启动查询执行相关联的开销可以被界定。 尽管仅节省有限的信息,但是所公开的技术大大减少了重新启动的查询的运行时间。 停止和重启查询执行技术被限制为只保存和重用有限数量的记录(中间记录或输出记录),从而释放所有其他资源,而不是一些资源。 该技术选择在正常执行期间发现的记录的子集,然后在重新启动期间执行扫描时跳过相应的记录,以防止重复执行。 采用跳过扫描运算符来促进公开的重启技术。

    Learning string transformations from examples
    8.
    发明授权
    Learning string transformations from examples 有权
    从示例中学习字符串变换

    公开(公告)号:US08249336B2

    公开(公告)日:2012-08-21

    申请号:US12492311

    申请日:2009-08-14

    IPC分类号: G06K9/00

    CPC分类号: G06F17/2765

    摘要: Techniques are described to leverage a set of sample or example matched pairs of strings to learn string transformation rules, which may be used to match data records that are semantically equivalent. In one embodiment, matched pairs of input strings are accessed. For a set of matched pairs, a set of one or more string transformation rules are learned. A transformation rule may include two strings determined to be semantically equivalent. The transformation rules are used to determine whether a first and second string match each other.

    摘要翻译: 描述技术来利用一组样本或示例匹配的字符串对来学习字符串转换规则,其可以用于匹配语义等同的数据记录。 在一个实施例中,访问匹配的输入串对。 对于一组匹配的对,学习一组或多个字符串转换规则。 转换规则可以包括确定为在语义上相等的两个字符串。 变换规则用于确定第一个和第二个字符串是否彼此匹配。

    ERROR TOLERANT AUTOCOMPLETION
    9.
    发明申请
    ERROR TOLERANT AUTOCOMPLETION 审中-公开
    错误的自动化

    公开(公告)号:US20100325136A1

    公开(公告)日:2010-12-23

    申请号:US12490288

    申请日:2009-06-23

    IPC分类号: G06F17/30 G06F3/048

    CPC分类号: G06F17/276

    摘要: Techniques for error-tolerant autocompletion are described. While displaying characters of an input string as they are inputted by a user, when a character is added to the input string by the user, matching strings may be selected from among a set of candidate strings by determining which of the candidate strings have a prefix whose characters match the characters of the input string within a given edit distance of the input string.

    摘要翻译: 描述了容错自动完成技术。 当用户输入输入字符串的字符时,当用户将字符添加到输入字符串时,可以通过确定哪个候选字符串具有前缀来从一组候选字符串中选择匹配字符串 其字符与输入字符串的给定编辑距离内的输入字符串的字符匹配。

    Keyword Searching On Database Views
    10.
    发明申请
    Keyword Searching On Database Views 审中-公开
    关键字搜索数据库视图

    公开(公告)号:US20100299367A1

    公开(公告)日:2010-11-25

    申请号:US12469399

    申请日:2009-05-20

    IPC分类号: G06F17/30

    摘要: A keyword search is executed on a view of a database based on a Boolean keyword query. The view includes multiple text columns, and the keyword search is executed on each of the multiple text columns in the view. The output results from the keyword search on each of the text columns include tuple identifiers of one or more relevant tuples and a relevancy score for ranking the results of the keyword query.

    摘要翻译: 在基于布尔关键字查询的数据库视图上执行关键字搜索。 该视图包括多个文本列,并且在视图中的每个多个文本列上执行关键字搜索。 每个文本列上的关键字搜索的输出结果包括一个或多个相关元组的元组标识符和用于对关键字查询的结果进行排名的相关分数。