System, method and program product for answering questions using a search engine
    1.
    发明授权
    System, method and program product for answering questions using a search engine 有权
    使用搜索引擎回答问题的系统,方法和程序产品

    公开(公告)号:US06665666B1

    公开(公告)日:2003-12-16

    申请号:US09495645

    申请日:2000-02-01

    IPC分类号: G06F1730

    摘要: The present invention is a system, method, and program product that comprises a computer with a collection of documents to be searched. The documents contain free form (natural language) text. We define a set of labels called QA-Tokens, which function as abstractions of phrases or question-types. We define a pattern file, which consists of a number of pattern records, each of which has a question template, an associated question word pattern, and an associated set of QA-Tokens. We describe a query-analysis process which receives a query as input and matches it to one or more of the question templates, where a priority algorithm determines which match is used if there is more than one. The query-analysis process then replaces the associated question word pattern in the matching query with the associated set of QA-Tokens, and possibly some other words. This results in a processed query having some combination of original query tokens, new tokens from the pattern file, and QA-Tokens, possibly with weights. We describe a pattern-matching process that identifies patterns of text in the document collection and augments the location with corresponding QA-Tokens. We define a text index data structure which is an inverted list of the locations of all of the words in the document collection, together with the locations of all of the augmented QA-Tokens. A search process then matches the processed query against a window of a user-selected number of sentences that is slid across the document texts. A hit-list of top-scoring windows is returned to the user.

    摘要翻译: 本发明是一种系统,方法和程序产品,其包括具有要搜索的文档的集合的计算机。 文件包含自由形式(自然语言)文本。 我们定义了一组称为QA-Tokens的标签,它们作为短语或问题类型的抽象。 我们定义一个模式文件,它由多个模式记录组成,每个模式记录都有一个问题模板,一个关联的问题单词模式和一组关联的质量检查标记。 我们描述一个查询分析过程,它接收一个查询作为输入并将其与一个或多个问题模板相匹配,其中优先级算法确定如果存在多个问题模板,则使用哪个匹配。 然后,查询分析过程将匹配查询中的相关问题词模式与相关的QA令牌集合以及可能的其他一些单词替换。 这导致处理的查询具有原始查询令牌,来自模式文件的新令牌和可能具有权重的QA令牌的某些组合。 我们描述了一种模式匹配过程,用于识别文档集合中的文本模式,并使用相应的QA-Token来增加位置。 我们定义一个文本索引数据结构,它是文档集合中所有单词的位置的反向列表,以及所有增强的质量检查令牌的位置。 然后,搜索过程将处理的查询与用户选择的句子数目的窗口匹配,该窗口在文档文本上滑动。 顶级评分窗口的命中列表将返回给用户。

    System and method for hierarchically grouping and ranking a set of
objects in a query context based on one or more relationships
    2.
    发明授权
    System and method for hierarchically grouping and ranking a set of objects in a query context based on one or more relationships 失效
    用于基于一个或多个关系对查询语境中的一组对象进行分层分组和排序的系统和方法

    公开(公告)号:US5875446A

    公开(公告)日:1999-02-23

    申请号:US804599

    申请日:1997-02-24

    IPC分类号: G06F17/30

    摘要: Topically relevant objects in an object database are first identified using any generally known methods to obtain a set of topically relevant objects (topically relevant set). Parents, and in alternative embodiments other ancestors, of one or more of the topically relevant objects are identified according to directional structural relationships that the parents have with respect to the topically relevant objects. These objects form a set of structurally relevant objects (structurally relevant set). In some embodiments, the user query identifies one or more of these structural relationships. The topically relevant objects are then organized under one or more of their respective parents to form a hierarchy level of both (topically relevant and structurally relevant) sets of objects. In some preferred embodiments, the process can iterate to create more than one hierarchy level.

    摘要翻译: 首先使用任何通常已知的方法来识别对象数据库中的局部相关对象以获得一组局部相关对象(局部相关集合)。 根据父母对局部相关对象的有向结构关系,识别一个或多个局部相关对象的父母以及其他祖先的其他祖先。 这些对象形成一组结构相关的对象(结构相关的集合)。 在一些实施例中,用户查询识别这些结构关系中的一个或多个。 然后将这些局部相关的对象组织在它们各自的父母中的一个或多个下面,以形成对象的两个(局部相关的和结构上相关的)对象的层级。 在一些优选实施例中,该过程可以迭代以创建多于一个层级。

    Identifying duplicate documents from search results without comparing
document content
    3.
    发明授权
    Identifying duplicate documents from search results without comparing document content 失效
    从搜索结果中识别重复的文档,而不比较文档内容

    公开(公告)号:US5913208A

    公开(公告)日:1999-06-15

    申请号:US677059

    申请日:1996-07-09

    IPC分类号: G06F17/30

    摘要: A computer system has a document collection of one or more documents and one or more indexes that each include an inverted file with one or more terms. Each of the terms is associated with one or more document identifiers. The index further includes a document catalog that associates each of the document identifiers with one or more attributes, either intrinsic or non intrinsic. A search engine process produces a hit list having one or more hit list entries. Each hit list entry, with one or more hit list attributes, is associated with one of the documents that is determined by the search engine to be relevant to the query. A formatter processor selects one or more of the hit list attributes, identified by a hit list attribute selector and then compares the selected attributes of two or more entries on the hit list to determine whether or not documents associated with these entries are duplicate instances of one another. The determination can be made without examining the content of the document associated with the entries.

    摘要翻译: 计算机系统具有一个或多个文档的文档集合和一个或多个索引,每个索引包括具有一个或多个术语的反转文件。 每个术语都与一个或多个文档标识符相关联。 索引还包括将每个文档标识符与一个或多个属性(内在的或非固有的)相关联的文档目录。 搜索引擎过程产生具有一个或多个命中列表条目的命中列表。 具有一个或多个命中列表属性的每个命中列表条目与由搜索引擎确定为与查询相关的文档之一相关联。 格式器处理器选择由命中列表属性选择器识别的命中列表属性中的一个或多个,然后比较命中列表上的两个或多个条目的所选属性,以确定与这些条目相关联的文档是否是重复的一个实例 另一个。 可以在不检查与条目相关联的文档的内容的情况下进行确定。

    Method and apparatus providing capitalization recovery for text
    4.
    发明授权
    Method and apparatus providing capitalization recovery for text 失效
    为文本提供资本化回收的方法和设备

    公开(公告)号:US06922809B2

    公开(公告)日:2005-07-26

    申请号:US09893158

    申请日:2001-06-27

    IPC分类号: G06F17/27 G06F7/02

    CPC分类号: G06F17/273

    摘要: A method for capitalizing text in a document includes processing a reference corpus to construct a plurality of dictionaries of capitalized terms, where the plurality of dictionaries include a singleton dictionary and a phrase dictionary. Each record in the singleton dictionary contains a word in lowercase, a range of phrase lengths m:n for capitalized phrases that the word begins, where m is a minimum phrase length and n is a maximum phrase length, and where each record in the phrase dictionary includes a multi-word phrase in lowercase. The method adds proper capitalization to an input monocase document by capitalizing words found in mandatory capitalization positions; and by looking up each word in the singleton dictionary and, if the word is found in the singleton dictionary, testing the corresponding phrase length range. If the phrase length range indicates that the word does not start a multi-word phrase, the method capitalizes the word, while if the phrase length range indicates that the word does start a multi-word phrase, the method tests the word and an indicated plurality of next words as a candidate phrase to determine if the candidate phrase is found in the phrase dictionary and, if it is, capitalizes the words of the multi-word phrase. If the candidate phrase is not found in the phrase dictionary, the method changes the number of words in the candidate phrase (e.g., decrements by one) to form a revised candidate phrase, and determines whether the revised candidate phrase is found in the phrase dictionary.

    摘要翻译: 用于在文档中大写文本的方法包括处理参考语料库以构建大写字词的多个词典,其中多个词典包括单词典和短语词典。 单身字典中的每个记录都包含一个小写字母,该单词开始的大写词组的短语长度m:n的范围,其中m是最小短语长度,n是最大短语长度,并且其中短语中的每个记录 字典包含小写字母的多字词组。 该方法通过利用在强制性资本化位置中发现的字词,向输入单一文件提供适当的大小写; 并且通过查找单个字典中的每个单词,并且如果单词在单例字典中找到,则测试相应的短语长度范围。 如果短语长度范围表示该单词没有开始多字短语,则该方法将该单词大写,而如果短语长度范围表示该单词确实启动了一个多单词短语,则该方法测试该单词和一个指示 多个下一个单词作为候选短语,以确定在短语词典中是否找到候选短语,如果是,则将多字短语的单词大写。 如果短语词典中没有找到候选短语,则该方法改变候选短语中的单词数量(例如减1),以形成修改后的候选短语,并且确定在短语词典中是否找到修改后的候选短语 。

    System, method and apparatus providing collateral information for a video/audio stream
    5.
    发明授权
    System, method and apparatus providing collateral information for a video/audio stream 有权
    为视频/音频流提供抵押信息的系统,方法和装置

    公开(公告)号:US06816858B1

    公开(公告)日:2004-11-09

    申请号:US09698894

    申请日:2000-10-27

    IPC分类号: G06F1730

    摘要: A system and method is disclosed for performing Automatic Stream Analysis for Broadcast Information which takes speech audio as input, converts the audio stream into text using a speech recognition system, applies a variety of analyzers to the text stream to identify information elements, automatically generates queries from these information elements, and extracts data from search results that is relevant to a current program. The data is multiplexed into the broadcast signal and transmitted along with the original audio/video program. The system is fully automatic and operates in real time, allowing broadcasters to add relevant collateral information to live programming.

    摘要翻译: 公开了一种用于执行广播信息的自动流分析的系统和方法,其将语音音频作为输入,使用语音识别系统将音频流转换成文本,将多种分析器应用于文本流以识别信息元素,自动生成查询 从这些信息元素中提取与当前节目相关的搜索结果中的数据。 数据被多路复用到广播信号中并与原始音频/视频节目一起发送。 该系统是全自动的,可实时运行,允许广播公司添加相关的附属信息进行现场节目制作。