Duplicate data elimination system
    1.
    发明授权
    Duplicate data elimination system 有权
    重复数据消除系统

    公开(公告)号:US07287019B2

    公开(公告)日:2007-10-23

    申请号:US10453992

    申请日:2003-06-04

    IPC分类号: G06F17/30

    摘要: A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.

    摘要翻译: 从一组数据记录中查找类似数据记录的过程。 数据库表或表提供了一些数据记录,从中可以识别一个或多个规范数据记录。 在数据记录中识别令牌,并根据属性字段进行分类。 基于数据记录的令牌之间的相似度,将相似性得分分配给与其他数据记录有关的数据记录。 其相似度相对于彼此的数据记录大于阈值形成一组或多组数据记录。 记录或元组形成图的节点,其中节点之间的边表示组之间的相似性得分。 在每个组内,基于数据记录在组内的彼此的相似性来识别规范记录。

    Keyword Searching On Database Views
    2.
    发明申请
    Keyword Searching On Database Views 审中-公开
    关键字搜索数据库视图

    公开(公告)号:US20100299367A1

    公开(公告)日:2010-11-25

    申请号:US12469399

    申请日:2009-05-20

    IPC分类号: G06F17/30

    摘要: A keyword search is executed on a view of a database based on a Boolean keyword query. The view includes multiple text columns, and the keyword search is executed on each of the multiple text columns in the view. The output results from the keyword search on each of the text columns include tuple identifiers of one or more relevant tuples and a relevancy score for ranking the results of the keyword query.

    摘要翻译: 在基于布尔关键字查询的数据库视图上执行关键字搜索。 该视图包括多个文本列,并且在视图中的每个多个文本列上执行关键字搜索。 每个文本列上的关键字搜索的输出结果包括一个或多个相关元组的元组标识符和用于对关键字查询的结果进行排名的相关分数。

    Scalable lookup-driven entity extraction from indexed document collections
    4.
    发明申请
    Scalable lookup-driven entity extraction from indexed document collections 有权
    从索引文档集合提取可扩展的查找驱动实体

    公开(公告)号:US20090319500A1

    公开(公告)日:2009-12-24

    申请号:US12144675

    申请日:2008-06-24

    IPC分类号: G06F17/30 G06F7/06 G06F17/27

    CPC分类号: G06F17/30011 G06F17/278

    摘要: A set of documents is filtered for entity extraction. A list of entity strings is received. A set of token sets that covers the entity strings in the list is determined. An inverted index generated on a first set of documents is queried using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set. A second set of documents identified by the set of document identifiers is retrieved from the first set of documents. The second set of documents is filtered to include one or more documents of the second set that each includes a match with at least one entity string of the list of entity strings. Entity recognition may be performed on the filtered second set of documents.

    摘要翻译: 过滤一组文档进行实体提取。 接收到实体字符串的列表。 确定一组涵盖列表中的实体字符串的令牌集。 使用该组令​​牌查询在第一组文档上生成的反向索引,以确定第一组中的文档的子集的一组文档标识符。 从第一组文档中检索由该组文档标识符标识的第二组文档。 第二组文档被过滤以包括第二组的一个或多个文档,每个文档包括与实体字符串列表的至少一个实体字符串的匹配。 可以对经过滤的第二组文件执行实体识别。

    EXAMPLE-DRIVEN DESIGN OF EFFICIENT RECORD MATCHING QUERIES
    5.
    发明申请
    EXAMPLE-DRIVEN DESIGN OF EFFICIENT RECORD MATCHING QUERIES 有权
    实例 - 有效记录匹配查询的驱动设计

    公开(公告)号:US20080306945A1

    公开(公告)日:2008-12-11

    申请号:US11758202

    申请日:2007-06-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30533 G06F17/30495

    摘要: Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.

    摘要翻译: 示例驱动创建记录匹配查询。 所公开的架构采用利用正(或匹配)和否定(不匹配)示例的可用性来搜索该空间并提出初始记录匹配查询的技术。 记录匹配任务被建模为设计通过组合几个原始算子获得的运算符树的记录匹配任务。 这确保了记录匹配程序可以在大的输入关系上有效和可扩展地执行。 该架构通过多个(例如,两个)关系(例如,R和S)连接记录。 该架构利用了关系中记录匹配的相似度函数的单调性,因为任何一对匹配记录具有比至少一个相似度函数上的非匹配记录对更高的相似度值。

    Primitive operator for similarity joins in data cleaning
    6.
    发明授权
    Primitive operator for similarity joins in data cleaning 有权
    数据清理中相似性的原始运算符

    公开(公告)号:US07406479B2

    公开(公告)日:2008-07-29

    申请号:US11352141

    申请日:2006-02-10

    IPC分类号: G06F17/00

    摘要: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing.The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.

    摘要翻译: 提供了一种集合相似性连接系统和方法。 可以通过识别“关闭”元组(例如,记录和/或行)来基于相似性来促进系统的数据清理。 可以使用选择适合域和/或应用程序的相似性函数来评估“接近度”。 因此,该系统便于通用的域无关数据清理。 该系统可以与基本原语,即相似性连接(SSJoin)运算符一起使用,其可以用作构建块来实现各种各样的相似性概念(例如,编辑相似性,Jaccard相似性,广义编辑相似性,汉明 距离,声音等)以及基于共同出现的相似性。 SSJoin算子可以利用设置重叠的观察结果有效地用于支持各种相似度函数。 SSJoin操作符根据与其中每一个相关联(或明确构建的)的“集合”来比较值。

    String predicate selectivity estimation
    8.
    发明授权
    String predicate selectivity estimation 失效
    字符串谓词选择性估计

    公开(公告)号:US07149735B2

    公开(公告)日:2006-12-12

    申请号:US10603035

    申请日:2003-06-24

    IPC分类号: G06F17/30

    摘要: A method of estimating selectivity of a given string predicate in a database query. In the method selectivities of substrings of various substring lengths are estimated. For example, the selectivity of substrings between length l (or some constant q) to the length of the given string predicate may be estimated. The method then selects a candidate sub string for each sub string length based on estimated selectivities of the substrings. The estimated selectivities of the candidate substrings are combined. The combined estimated selectivity of the candidate substrings is returned as the estimated selectivity of the given string predicate.

    摘要翻译: 在数据库查询中估计给定字符串谓词的选择性的方法。 在方法中,估计各种子串长度的子串的选择性。 例如,可以估计长度l(或一些常数q)与给定字符串谓词的长度之间的子串的选择性。 然后,该方法基于所估计的子串的选择性来选择每个子串长度的候选子串。 合并候选子串的估计选择性。 候选子串的组合估计选择性作为给定字符串谓词的估计选择性返回。

    Scalable lookup-driven entity extraction from indexed document collections
    9.
    发明授权
    Scalable lookup-driven entity extraction from indexed document collections 有权
    从索引文档集合提取可扩展的查找驱动实体

    公开(公告)号:US08782061B2

    公开(公告)日:2014-07-15

    申请号:US12144675

    申请日:2008-06-24

    IPC分类号: G06F17/30 G06F7/00

    CPC分类号: G06F17/30011 G06F17/278

    摘要: A set of documents is filtered for entity extraction. A list of entity strings is received. A set of token sets that covers the entity strings in the list is determined. An inverted index generated on a first set of documents is queried using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set. A second set of documents identified by the set of document identifiers is retrieved from the first set of documents. The second set of documents is filtered to include one or more documents of the second set that each includes a match with at least one entity string of the list of entity strings. Entity recognition may be performed on the filtered second set of documents.

    摘要翻译: 过滤一组文档进行实体提取。 接收到实体字符串的列表。 确定一组涵盖列表中的实体字符串的令牌集。 使用该组令​​牌查询在第一组文档上生成的反向索引,以确定第一组中的文档的子集的一组文档标识符。 从第一组文档中检索由该组文档标识符标识的第二组文档。 第二组文档被过滤以包括第二组的一个或多个文档,每个文档包括与实体字符串列表的至少一个实体字符串的匹配。 可以对经过滤的第二组文件执行实体识别。

    Identifying synonyms of entities using a document collection
    10.
    发明授权
    Identifying synonyms of entities using a document collection 有权
    使用文档集合识别实体的同义词

    公开(公告)号:US08533203B2

    公开(公告)日:2013-09-10

    申请号:US12478120

    申请日:2009-06-04

    IPC分类号: G06F17/30 G06F7/00

    CPC分类号: G06F17/2795 G06F17/278

    摘要: Identifying synonyms of entities using a collection of documents is disclosed herein. In some aspects, a document from a collection of documents may be analyzed to identify hit sequences that include one or more tokens (e.g., words, number, etc.). The hit sequences may then be used to generate discriminating token sets (DTS's) that are subsets of both the hit sequences and the entity names. The DTS's are matched with corresponding entity names, and then used to create DTS phrases by selecting adjacent text in the document that is proximate to the DTS. The DTS phrases may be analyzed to determine whether the corresponding DTS is synonyms of the entity name. In various aspects, the tokens of an associated entity name that are present in the DTS phrases are used to generate a score for the DTS. When the score at least reaches a threshold, the DTS may be designated as a synonym. A list of synonyms may be generated for each entity name.

    摘要翻译: 本文公开了使用文档集合识别实体的同义词。 在一些方面,可以分析来自文档集合的文档以识别包括一个或多个令牌(例如,单词,数字等)的命中序列。 然后可以使用命中序列来生成作为命中序列和实体名称的子集的识别令牌集(DTS's)。 DTS与相应的实体名称相匹配,然后用于通过选择靠近DTS的文档中的相邻文本来创建DTS短语。 可以分析DTS短语以确定对应的DTS是否是实体名称的同义词。 在各方面,使用存在于DTS短语中的关联实体名称的令牌来产生DTS的得分。 当分数至少达到阈值时,DTS可以被指定为同义词。 可以为每个实体名称生成同义词列表。