EFFICIENT EXACT SET SIMILARITY JOINS
    1.
    发明申请
    EFFICIENT EXACT SET SIMILARITY JOINS 有权
    有效的精确设置

    公开(公告)号:US20080183693A1

    公开(公告)日:2008-07-31

    申请号:US11668870

    申请日:2007-01-30

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30498 G06F17/30533

    摘要: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.

    摘要翻译: 一种机器实现的系统和方法,其有效地促进并实现集合集合之间的精确相似性连接。 系统和方法从接口获得集合集合和阈值,并且至少部分地基于分析组件生成的集合集合之间的可识别相似性(例如重叠或交集)并输出候选对, 至少等于或超过阈值。

    Efficient exact set similarity joins
    2.
    发明授权
    Efficient exact set similarity joins 有权
    有效的精确集合相似性连接

    公开(公告)号:US07865505B2

    公开(公告)日:2011-01-04

    申请号:US11668870

    申请日:2007-01-30

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30498 G06F17/30533

    摘要: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.

    摘要翻译: 一种机器实现的系统和方法,其有效地促进并实现集合集合之间的精确相似性连接。 系统和方法从接口获得集合集合和阈值,并且至少部分地基于分析组件生成的集合集合之间的可识别相似性(例如重叠或交集)并输出候选对, 至少等于或超过阈值。

    Disk-based probabilistic set-similarity indexes
    3.
    发明授权
    Disk-based probabilistic set-similarity indexes 有权
    基于磁盘的概率集相似性指标

    公开(公告)号:US07610283B2

    公开(公告)日:2009-10-27

    申请号:US11761425

    申请日:2007-06-12

    摘要: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.

    摘要翻译: 用于集合相似性查找的输入集索引。 该体系结构为索引过程提供输入,可以对大数据集(例如,基于磁盘)进行更有效的查找,而无需对输入进行全面扫描。 提供了一个新的索引结构,其输出是精确的,而不是近似的。 使用将两组映射到表示两组相似度的数值的相似度函数来指定两组的相似度。 如果数字相似性分数高于阈值,则基于阈值的查找被解决为其中两个集合被认为是相似的。 该结构有效地识别查询集合的距离k(例如,汉明距离)内的所有输入集合。 使用元素频率(元素发生的输入集合的数量)的形式的附加信息用于提高索引性能。

    Disk-Based Probabilistic Set-Similarity Indexes
    4.
    发明申请
    Disk-Based Probabilistic Set-Similarity Indexes 有权
    基于磁盘的概率集相似性指标

    公开(公告)号:US20080313128A1

    公开(公告)日:2008-12-18

    申请号:US11761425

    申请日:2007-06-12

    IPC分类号: G06F7/06 G06F17/30

    摘要: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.

    摘要翻译: 用于集合相似性查找的输入集索引。 该体系结构为索引过程提供输入,可以对大数据集(例如,基于磁盘)进行更有效的查找,而无需对输入进行全面扫描。 提供了一个新的索引结构,其输出是精确的,而不是近似的。 使用将两组映射到表示两组相似度的数值的相似度函数来指定两组的相似度。 如果数字相似性分数高于阈值,则基于阈值的查找被解决为其中两个集合被认为是相似的。 该结构有效地识别查询集合的距离k(例如,汉明距离)内的所有输入集合。 使用元素频率(元素发生的输入集合的数量)的形式的附加信息用于提高索引性能。

    Scalable lookup-driven entity extraction from indexed document collections
    5.
    发明授权
    Scalable lookup-driven entity extraction from indexed document collections 有权
    从索引文档集合提取可扩展的查找驱动实体

    公开(公告)号:US08782061B2

    公开(公告)日:2014-07-15

    申请号:US12144675

    申请日:2008-06-24

    IPC分类号: G06F17/30 G06F7/00

    CPC分类号: G06F17/30011 G06F17/278

    摘要: A set of documents is filtered for entity extraction. A list of entity strings is received. A set of token sets that covers the entity strings in the list is determined. An inverted index generated on a first set of documents is queried using the set of token sets to determine a set of document identifiers for a subset of the documents in the first set. A second set of documents identified by the set of document identifiers is retrieved from the first set of documents. The second set of documents is filtered to include one or more documents of the second set that each includes a match with at least one entity string of the list of entity strings. Entity recognition may be performed on the filtered second set of documents.

    摘要翻译: 过滤一组文档进行实体提取。 接收到实体字符串的列表。 确定一组涵盖列表中的实体字符串的令牌集。 使用该组令​​牌查询在第一组文档上生成的反向索引,以确定第一组中的文档的子集的一组文档标识符。 从第一组文档中检索由该组文档标识符标识的第二组文档。 第二组文档被过滤以包括第二组的一个或多个文档,每个文档包括与实体字符串列表的至少一个实体字符串的匹配。 可以对经过滤的第二组文件执行实体识别。

    Identifying synonyms of entities using a document collection
    6.
    发明授权
    Identifying synonyms of entities using a document collection 有权
    使用文档集合识别实体的同义词

    公开(公告)号:US08533203B2

    公开(公告)日:2013-09-10

    申请号:US12478120

    申请日:2009-06-04

    IPC分类号: G06F17/30 G06F7/00

    CPC分类号: G06F17/2795 G06F17/278

    摘要: Identifying synonyms of entities using a collection of documents is disclosed herein. In some aspects, a document from a collection of documents may be analyzed to identify hit sequences that include one or more tokens (e.g., words, number, etc.). The hit sequences may then be used to generate discriminating token sets (DTS's) that are subsets of both the hit sequences and the entity names. The DTS's are matched with corresponding entity names, and then used to create DTS phrases by selecting adjacent text in the document that is proximate to the DTS. The DTS phrases may be analyzed to determine whether the corresponding DTS is synonyms of the entity name. In various aspects, the tokens of an associated entity name that are present in the DTS phrases are used to generate a score for the DTS. When the score at least reaches a threshold, the DTS may be designated as a synonym. A list of synonyms may be generated for each entity name.

    摘要翻译: 本文公开了使用文档集合识别实体的同义词。 在一些方面,可以分析来自文档集合的文档以识别包括一个或多个令牌(例如,单词,数字等)的命中序列。 然后可以使用命中序列来生成作为命中序列和实体名称的子集的识别令牌集(DTS's)。 DTS与相应的实体名称相匹配,然后用于通过选择靠近DTS的文档中的相邻文本来创建DTS短语。 可以分析DTS短语以确定对应的DTS是否是实体名称的同义词。 在各方面,使用存在于DTS短语中的关联实体名称的令牌来产生DTS的得分。 当分数至少达到阈值时,DTS可以被指定为同义词。 可以为每个实体名称生成同义词列表。

    Finding related entity results for search queries
    7.
    发明授权
    Finding related entity results for search queries 有权
    查找搜索查询的相关实体结果

    公开(公告)号:US08195655B2

    公开(公告)日:2012-06-05

    申请号:US11758024

    申请日:2007-06-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/278 G06F17/30864

    摘要: Architecture for finding related entities for web search queries. An extraction component takes a document as input and outputs all the mentions (or occurrences) of named entities such as names of people, organizations, locations, and products in the document, as well as entity metadata. An indexing component takes a document identifier (docID) and the set of mentions of named entities and, stores and indexes the information for retrieval. A document-based search component takes a keyword query and returns the docIDs of the top documents matching with the query. A retrieval component takes a docID as input, accesses the information stored by the indexing component and returns the set of mentions of named entities in the document. This information is then passed to an entity scoring and thresholding component that computes an aggregate score of each entity and selects the entities to return to the user.

    摘要翻译: 用于查找网络搜索查询的相关实体的架构。 提取组件将文档作为输入并输出所有实体的所有提及(或出现),例如文档中的人员,组织,位置和产品的名称以及实体元数据。 索引组件采用文档标识符(docID)和命名实体的提及集合,并存储和索引信息进行检索。 基于文档的搜索组件接受关键字查询,并返回与查询匹配的顶级文档的docID。 检索组件将docID作为输入,访问由索引组件存储的信息,并返回文档中命名实体的提及集。 然后将该信息传递给实体计分和阈值组件,该组件计算每个实体的聚合分数,并选择要返回给用户的实体。

    Pushing Search Query Constraints Into Information Retrieval Processing
    8.
    发明申请
    Pushing Search Query Constraints Into Information Retrieval Processing 审中-公开
    将搜索查询约束推送到信息检索处理中

    公开(公告)号:US20110320446A1

    公开(公告)日:2011-12-29

    申请号:US12823124

    申请日:2010-06-25

    IPC分类号: G06F17/30

    CPC分类号: G06F16/90335

    摘要: This patent application relates to interval-based information retrieval (IR) search techniques for efficiently and correctly answering keyword search queries. In some embodiments, a range of information-containing blocks for a search query can be identified. Each of these blocks, and thus the range, can include document identifiers that identify individual corresponding documents that contain a term found in the search query. From the range, a subrange(s) having a smaller number of blocks than the range can be selected. This can be accomplished without decompressing the blocks by partitioning the range into intervals and evaluating the intervals. The smaller number of blocks in the subranges(s) can then be decompressed and processed to identify a doc ID(s) and thus document(s) that satisfies the query.

    摘要翻译: 该专利申请涉及用于有效和正确地回答关键词搜索查询的基于间隔的信息检索(IR)搜索技术。 在一些实施例中,可以识别用于搜索查询的一系列含有信息的块。 这些块中的每个以及因此的范围可以包括识别包含在搜索查询中找到的术语的各个对应文档的文档标识符。 从该范围可以选择具有比该范围少的块数量的子范围。 这可以在不通过将范围划分成间隔并且评估间隔来解压缩块的情况下实现。 然后可以解压缩和处理子范围中较小数量的块,以识别文档ID,从而识别符合查询的文档。

    Efficient evaluation of object finder queries
    9.
    发明授权
    Efficient evaluation of object finder queries 失效
    对象查询器查询的高效评估

    公开(公告)号:US07730060B2

    公开(公告)日:2010-06-01

    申请号:US11423303

    申请日:2006-06-09

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30964

    摘要: The subject disclosure pertains to a class of object finder queries that return the best target objects that match a set of given keywords. Mechanisms are provided that facilitate identification of target objects related to search objects that match a set of query keywords. Scoring mechanisms/functions are also disclosed that compute relevance scores of target objects. Further, efficient early termination techniques are provided to compute the top K target objects based on a scoring function.

    摘要翻译: 主题公开涉及一类对象查找器查询,其返回与一组给定关键字匹配的最佳目标对象。 提供了有助于识别与一组查询关键字匹配的搜索对象相关的目标对象的机制。 还公开了计算目标对象的相关性分数的评分机制/功能。 此外,提供有效的提前终止技术以基于评分功能计算顶部K个目标对象。

    LEVERAGING CROSS-DOCUMENT CONTEXT TO LABEL ENTITY
    10.
    发明申请
    LEVERAGING CROSS-DOCUMENT CONTEXT TO LABEL ENTITY 有权
    将交叉文档引向标签实体

    公开(公告)号:US20090282012A1

    公开(公告)日:2009-11-12

    申请号:US12114824

    申请日:2008-05-05

    IPC分类号: G06F7/06 G06F17/30

    摘要: Entities, such as people, places and things, are labeled based on information collected across a possibly large number of documents. One or more documents are scanned to recognize the entities, and features are extracted from the context in which those entities occur in the documents. Observed entity-feature pairs are stored either in an in-memory store or an external store. A store manager optimizes use of the limited amount of space for an in-memory store by determining which store to put an entity-feature pair in, and when to evict features from the in-memory store to make room for new pairs. Feature that may be observed in an entity's context may take forms such as specific word sequences or membership in a particular list.

    摘要翻译: 诸如人物,地点和事物等实体根据可能大量文件收集的信息进行标注。 扫描一个或多个文档以识别实体,并且从文档中出现这些实体的上下文提取特征。 观察到的实体特征对存储在内存存储或外部存储中。 存储管理器通过确定哪个存储放置实体特征对,以及何时从存储器内存存储器中删除特征以为新的对腾出空间来优化对存储器存储器中的有限数量的空间的使用。 可能在实体的上下文中观察到的特征可以采取诸如特定单词序列或特定列表中的成员资格的形式。