Disk-based probabilistic set-similarity indexes
    1.
    发明授权
    Disk-based probabilistic set-similarity indexes 有权
    基于磁盘的概率集相似性指标

    公开(公告)号:US07610283B2

    公开(公告)日:2009-10-27

    申请号:US11761425

    申请日:2007-06-12

    摘要: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.

    摘要翻译: 用于集合相似性查找的输入集索引。 该体系结构为索引过程提供输入,可以对大数据集(例如,基于磁盘)进行更有效的查找,而无需对输入进行全面扫描。 提供了一个新的索引结构,其输出是精确的,而不是近似的。 使用将两组映射到表示两组相似度的数值的相似度函数来指定两组的相似度。 如果数字相似性分数高于阈值,则基于阈值的查找被解决为其中两个集合被认为是相似的。 该结构有效地识别查询集合的距离k(例如,汉明距离)内的所有输入集合。 使用元素频率(元素发生的输入集合的数量)的形式的附加信息用于提高索引性能。

    Disk-Based Probabilistic Set-Similarity Indexes
    2.
    发明申请
    Disk-Based Probabilistic Set-Similarity Indexes 有权
    基于磁盘的概率集相似性指标

    公开(公告)号:US20080313128A1

    公开(公告)日:2008-12-18

    申请号:US11761425

    申请日:2007-06-12

    IPC分类号: G06F7/06 G06F17/30

    摘要: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.

    摘要翻译: 用于集合相似性查找的输入集索引。 该体系结构为索引过程提供输入,可以对大数据集(例如,基于磁盘)进行更有效的查找,而无需对输入进行全面扫描。 提供了一个新的索引结构,其输出是精确的,而不是近似的。 使用将两组映射到表示两组相似度的数值的相似度函数来指定两组的相似度。 如果数字相似性分数高于阈值,则基于阈值的查找被解决为其中两个集合被认为是相似的。 该结构有效地识别查询集合的距离k(例如,汉明距离)内的所有输入集合。 使用元素频率(元素发生的输入集合的数量)的形式的附加信息用于提高索引性能。

    Active learning of record matching packages
    3.
    发明授权
    Active learning of record matching packages 有权
    积极学习记录匹配包

    公开(公告)号:US09081817B2

    公开(公告)日:2015-07-14

    申请号:US13084527

    申请日:2011-04-11

    IPC分类号: G06F17/30 G06N99/00

    CPC分类号: G06F17/30507 G06N99/005

    摘要: An active learning record matching system and method for producing a record matching package that is used to identify pairs of duplicate records. Embodiments of the system and method allow a precision threshold to be specified and then generate a learned record matching package having precision greater than this threshold and a recall close to the best possible recall. Embodiments of the system and method use a blocking technique to restrict the space of record matching packages considered and scale to large inputs. The learning method considers several record matching packages, estimates the precision and recall of the packages, and identifies the package with maximum recall having precision greater than equal to the given precision threshold. A human domain expert labels a sample of record pairs in the output of the package as matches or non-matches and this labeling is used to estimate the precision of the package.

    摘要翻译: 用于产生用于识别重复记录对的记录匹配包的主动学习记录匹配系统和方法。 系统和方法的实施例允许指定精度阈值,然后产生具有大于该阈值的精度的学习记录匹配包以及接近最佳可能召回的召回。 系统和方法的实施例使用阻塞技术来限制所考虑的记录匹配包的空间并将其缩放到大的输入。 学习方法考虑了几个记录匹配包,估计包的精度和调用,并且识别具有大于等于给定精度阈值的精度的最大召回的包。 人类领域专家将包的输出中的记录对的样本标记为匹配或不匹配,并且该标签用于估计包的精度。

    ACTIVE LEARNING OF RECORD MATCHING PACKAGES
    4.
    发明申请
    ACTIVE LEARNING OF RECORD MATCHING PACKAGES 有权
    主动学习记录匹配包

    公开(公告)号:US20120259802A1

    公开(公告)日:2012-10-11

    申请号:US13084527

    申请日:2011-04-11

    IPC分类号: G06F15/18

    CPC分类号: G06F17/30507 G06N99/005

    摘要: An active learning record matching system and method for producing a record matching package that is used to identify pairs of duplicate records. Embodiments of the system and method allow a precision threshold to be specified and then generate a learned record matching package having precision greater than this threshold and a recall close to the best possible recall. Embodiments of the system and method use a blocking technique to restrict the space of record matching packages considered and scale to large inputs. The learning method considers several record matching packages, estimates the precision and recall of the packages, and identifies the package with maximum recall having precision greater than equal to the given precision threshold. A human domain expert labels a sample of record pairs in the output of the package as matches or non-matches and this labeling is used to estimate the precision of the package.

    摘要翻译: 用于产生用于识别重复记录对的记录匹配包的主动学习记录匹配系统和方法。 系统和方法的实施例允许指定精度阈值,然后产生具有大于该阈值的精度的学习记录匹配包以及接近最佳可能召回的召回。 系统和方法的实施例使用阻塞技术来限制所考虑的记录匹配包的空间并将其缩放到大的输入。 学习方法考虑了几个记录匹配包,估计包的精度和调用,并且识别具有大于等于给定精度阈值的精度的最大召回的包。 人类领域专家将包的输出中的记录对的样本标记为匹配或不匹配,并且该标签用于估计包的精度。

    TRANSFORMATION-BASED FRAMEWORK FOR RECORD MATCHING
    5.
    发明申请
    TRANSFORMATION-BASED FRAMEWORK FOR RECORD MATCHING 有权
    用于记录匹配的基于变换的框架

    公开(公告)号:US20090210418A1

    公开(公告)日:2009-08-20

    申请号:US12031715

    申请日:2008-02-15

    IPC分类号: G06F17/30

    摘要: A transformation-based record matching technique. The technique provides a flexible way to account for synonyms and more general forms of string equivalences when performing record matching by taking as explicit input user-defined transformation rules (such as, for example, the fact that “Robert” and “Bob” that are synonymous). The input string and user-defined transformation rules are used to generate a larger set of strings which are used when performing record matching. Both the input string and data elements in a database can be transformed using the user-defined transformation rules in order to generate a larger set of potential record matches. These potential record matches can then be subjected to a threshold test in order to determine one or more best matches. Additionally, signature-based similarity functions are used to improve the computational efficiency of the technique.

    摘要翻译: 基于变换的记录匹配技术。 当通过采用显式输入用户定义的转换规则(例如,“Robert”和“Bob”)这样的事实来执行记录匹配时,该技术提供了一种灵活的方式来解释同义词和更一般的字符串等同形式 同义词)。 输入字符串和用户定义的转换规则用于生成在执行记录匹配时使用的较大的一组字符串。 可以使用用户定义的变换规则来转换数据库中的输入字符串和数据元素,以便生成更大的潜在记录匹配集合。 然后可以对这些潜在的记录匹配进行阈值测试,以确定一个或多个最佳匹配。 另外,使用基于签名的相似度函数来提高该技术的计算效率。

    Learning string transformations from examples
    6.
    发明授权
    Learning string transformations from examples 有权
    从示例中学习字符串变换

    公开(公告)号:US08249336B2

    公开(公告)日:2012-08-21

    申请号:US12492311

    申请日:2009-08-14

    IPC分类号: G06K9/00

    CPC分类号: G06F17/2765

    摘要: Techniques are described to leverage a set of sample or example matched pairs of strings to learn string transformation rules, which may be used to match data records that are semantically equivalent. In one embodiment, matched pairs of input strings are accessed. For a set of matched pairs, a set of one or more string transformation rules are learned. A transformation rule may include two strings determined to be semantically equivalent. The transformation rules are used to determine whether a first and second string match each other.

    摘要翻译: 描述技术来利用一组样本或示例匹配的字符串对来学习字符串转换规则,其可以用于匹配语义等同的数据记录。 在一个实施例中,访问匹配的输入串对。 对于一组匹配的对,学习一组或多个字符串转换规则。 转换规则可以包括确定为在语义上相等的两个字符串。 变换规则用于确定第一个和第二个字符串是否彼此匹配。

    LEARNING STRING TRANSFORMATIONS FROM EXAMPLES
    7.
    发明申请
    LEARNING STRING TRANSFORMATIONS FROM EXAMPLES 有权
    从示例中学习STRING变换

    公开(公告)号:US20110038531A1

    公开(公告)日:2011-02-17

    申请号:US12492311

    申请日:2009-08-14

    IPC分类号: G06K9/62

    CPC分类号: G06F17/2765

    摘要: Techniques are described to leverage a set of sample or example matched pairs of strings to learn string transformation rules, which may be used to match data records that are semantically equivalent. In one embodiment, matched pairs of input strings are accessed. For a set of matched pairs, a set of one or more string transformation rules are learned. A transformation rule may include two strings determined to be semantically equivalent. The transformation rules are used to determine whether a first and second string match each other.

    摘要翻译: 描述技术来利用一组样本或示例匹配的字符串对来学习字符串转换规则,其可以用于匹配语义等同的数据记录。 在一个实施例中,访问匹配的输入串对。 对于一组匹配的对,学习一组或多个字符串转换规则。 转换规则可以包括确定为在语义上相等的两个字符串。 变换规则用于确定第一个和第二个字符串是否彼此匹配。

    Efficient exact set similarity joins
    8.
    发明授权
    Efficient exact set similarity joins 有权
    有效的精确集合相似性连接

    公开(公告)号:US07865505B2

    公开(公告)日:2011-01-04

    申请号:US11668870

    申请日:2007-01-30

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30498 G06F17/30533

    摘要: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.

    摘要翻译: 一种机器实现的系统和方法,其有效地促进并实现集合集合之间的精确相似性连接。 系统和方法从接口获得集合集合和阈值,并且至少部分地基于分析组件生成的集合集合之间的可识别相似性(例如重叠或交集)并输出候选对, 至少等于或超过阈值。

    EFFICIENT EXACT SET SIMILARITY JOINS
    9.
    发明申请
    EFFICIENT EXACT SET SIMILARITY JOINS 有权
    有效的精确设置

    公开(公告)号:US20080183693A1

    公开(公告)日:2008-07-31

    申请号:US11668870

    申请日:2007-01-30

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30498 G06F17/30533

    摘要: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.

    摘要翻译: 一种机器实现的系统和方法,其有效地促进并实现集合集合之间的精确相似性连接。 系统和方法从接口获得集合集合和阈值,并且至少部分地基于分析组件生成的集合集合之间的可识别相似性(例如重叠或交集)并输出候选对, 至少等于或超过阈值。

    High precision set expansion for large concepts
    10.
    发明授权
    High precision set expansion for large concepts 有权
    高精度集扩展为大概念

    公开(公告)号:US09547718B2

    公开(公告)日:2017-01-17

    申请号:US13325072

    申请日:2011-12-14

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30867 G06Q30/0201

    摘要: A set expansion system is described herein that improves precision, recall, and performance of prior set expansion methods for large sets of data. The system maintains high precision and recall by 1) identifying the qualify of particular lists and applying that quality through a weight, 2) allowing for the specification or negative examples in a set of seeds to reduce the introduction of bad entities into the set, and 3) applying a cutoff to eliminate lists that include a low number of positive matches. The system may perform multiple passes to first generate a good candidate result set and then refine the set to find a set with highest quality. The system may also apply Map Reduce or other distributed processing techniques to allow calculation in parallel. Thus, the system efficiently expands large concept sets from a potentially small set of initial seeds from readily available web data.

    摘要翻译: 本文描述了一种扩展系统,可提高大型数据集的先前设置扩展方法的精度,调用和性能。 该系统通过1)确定特定列表的资格并通过权重来应用该质量,保持高精度和召回; 2)允许一组种子中的规范或否定示例,以减少将不良实体引入到集合中; 3)应用截止值来消除包括少量正匹配的列表。 系统可以执行多次通过以首先产生良好的候选结果集合,然后对该集合进行优化以找到具有最高质量的集合。 该系统还可以应用Map Reduce或其他分布式处理技术来并行计算。 因此,系统从容易获得的网络数据的一小部分初始种子中有效地扩展了大概念集。