Efficient fuzzy match for evaluating data records
    1.
    发明授权
    Efficient fuzzy match for evaluating data records 有权
    用于评估数据记录的高效模糊匹配

    公开(公告)号:US07296011B2

    公开(公告)日:2007-11-13

    申请号:US10600083

    申请日:2003-06-20

    IPC分类号: G06F7/00 G06F17/30

    摘要: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.

    摘要翻译: 为了帮助确保高数据质量,数据仓库验证和清理,如果需要外部来源的传入数据元组。 在许多情况下,输入元组或输入元组的一部分必须匹配参考表中可接受的元组。 例如,分销商的销售记录中的产品名称和描述字段必须与产品参考关系中的预先记录的名称和描述字段相匹配。 所公开的系统实现有效和准确的近似或模糊匹配操作,其可以有效地清除传入元组,如果它不能与参考关系中的任何多个元组完全匹配。 使用称为q-gram的令牌子串的公开的相似度函数克服了现有技术相似度功能的限制,同时有效地执行模糊匹配过程。

    INTEGRATED FUZZY JOINS IN DATABASE MANAGEMENT SYSTEMS
    3.
    发明申请
    INTEGRATED FUZZY JOINS IN DATABASE MANAGEMENT SYSTEMS 有权
    数据库管理系统中的集成FUZZY JOINS

    公开(公告)号:US20130091120A1

    公开(公告)日:2013-04-11

    申请号:US13253315

    申请日:2011-10-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30303 G06F17/30533

    摘要: A fuzzy joins system that is integrated in a database system generates fuzzy joins between records from two datasets. The fuzzy joins system includes a tokenizer to generate tokens for data records and a transformer to find transforms for the tokens. The fuzzy joins system invokes a signature generator, running within a runtime layer of the database system, to generate signatures for data records based on the tokens and their transforms. Subsequently, an equi-join operation joins the records from the two datasets with at least one equal signature. A similarity calculator, running within a runtime layer of the database system, computes a similarity measure using the token information of the joined records. If the similarity measure for any two records is above a threshold, the fuzzy joins system generates a fuzzy join between such two records.

    摘要翻译: 集成在数据库系统中的模糊连接系统在两个数据集的记录之间生成模糊连接。 模糊连接系统包括一个用于生成数据记录令牌的标记器和一个用于为令牌找到变换的变压器。 模糊连接系统调用在数据库系统的运行时层内运行的签名生成器,以基于令牌及其转换生成用于数据记录的签名。 随后,等连接操作将来自两个数据集的记录与至少一个相等的签名相连。 在数据库系统的运行时层内运行的相似度计算器使用所连接的记录的令牌信息来计算相似性度量。 如果任何两个记录的相似性度量高于阈值,则模糊连接系统在这两个记录之间生成模糊连接。

    Data Services for Enterprises Leveraging Search System Data Assets
    4.
    发明申请
    Data Services for Enterprises Leveraging Search System Data Assets 审中-公开
    企业数据服务利用搜索系统数据资产

    公开(公告)号:US20130346464A1

    公开(公告)日:2013-12-26

    申请号:US13527601

    申请日:2012-06-20

    IPC分类号: G06F15/16

    CPC分类号: G06Q10/10

    摘要: A data service system is described herein which processes raw data assets from at least one network-accessible system (such as a search system), to produce processed data assets. Enterprise applications can then leverage the processed data assets to perform various environment-specific tasks. In one implementation, the data service system can generate any of: synonym resources for use by an enterprise application in providing synonyms for specified terms associated with entities; augmentation resources for use by an enterprise application in providing supplemental information for specified seed information; and spelling-correction resources for use by an enterprise application in providing spelling information for specified terms, and so on.

    摘要翻译: 本文描述了一种数据服务系统,其处理来自至少一个网络可访问系统(例如搜索系统)的原始数据资产以产生处理的数据资产。 企业应用程序可以利用已处理的数据资产来执行各种环境特定任务。 在一个实现中,数据服务系统可以生成以下任何一种:供企业应用使用的同义词资源,为与实体相关联的指定术语提供同义词; 增加资源供企业应用用于提供指定种子信息的补充信息; 以及企业应用程序为指定的术语提供拼写信息的拼写纠正资源等。

    Integrated fuzzy joins in database management systems
    5.
    发明授权
    Integrated fuzzy joins in database management systems 有权
    在数据库管理系统中集成模糊连接

    公开(公告)号:US09317544B2

    公开(公告)日:2016-04-19

    申请号:US13253315

    申请日:2011-10-05

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30303 G06F17/30533

    摘要: A fuzzy joins system that is integrated in a database system generates fuzzy joins between records from two datasets. The fuzzy joins system includes a tokenizer to generate tokens for data records and a transformer to find transforms for the tokens. The fuzzy joins system invokes a signature generator, running within a runtime layer of the database system, to generate signatures for data records based on the tokens and their transforms. Subsequently, an equi-join operation joins the records from the two datasets with at least one equal signature. A similarity calculator, running within a runtime layer of the database system, computes a similarity measure using the token information of the joined records. If the similarity measure for any two records is above a threshold, the fuzzy joins system generates a fuzzy join between such two records.

    摘要翻译: 集成在数据库系统中的模糊连接系统在两个数据集的记录之间生成模糊连接。 模糊连接系统包括一个用于生成数据记录令牌的标记器和一个用于为令牌找到变换的变压器。 模糊连接系统调用在数据库系统的运行时层内运行的签名生成器,以基于令牌及其变换生成用于数据记录的签名。 随后,等连接操作将来自两个数据集的记录与至少一个相等的签名相连。 在数据库系统的运行时层内运行的相似度计算器使用所连接的记录的令牌信息来计算相似性度量。 如果任何两个记录的相似性度量高于阈值,则模糊连接系统在这两个记录之间生成模糊连接。