Method and apparatus for automatic document summarization
    1.
    发明授权
    Method and apparatus for automatic document summarization 失效
    自动文件摘要的方法和装置

    公开(公告)号:US5638543A

    公开(公告)日:1997-06-10

    申请号:US71114

    申请日:1993-06-03

    IPC分类号: G06F17/21 G06F17/30

    CPC分类号: G06F17/30719

    摘要: Regions of a document such as sentences and blocks of sentences are scored and classified based upon their scores. An abstract of the document can be formed from the classified sentences. Sentences are classified by the use of words classified as stop words and vanish words. Sentences are scored based on the number of stop words and the number of strings of connected stop words, called stop-word runs, contained in the sentence. Passionate sentences, which usually contain information which the writer has strong feelings about, such as joy, admiration, or sadness, are identified. This method can also select sentences that are contrapassionate, which the writer may either have to strengthen or have inserted to complete the record and provide continuity or information.

    摘要翻译: 文档的区域,例如句子和句子块根据他们的分数得分和分类。 文件的摘要可以由分类句子形成。 句子通过使用分类为停止词和消失词的词来分类。 根据句子中包含的停止词的数量和所连接的停止词的串数(称为停止词运行),对句子进行评分。 确定了热情的句子,通常包含作者对喜悦,钦佩或悲伤等强烈感情的信息。 这种方法还可以选择具有矛盾性的句子,作者可能必须加强或插入以完成记录并提供连续性或信息。

    Scatter-gather: a cluster-based method and apparatus for browsing large
document collections
    2.
    发明授权
    Scatter-gather: a cluster-based method and apparatus for browsing large document collections 失效
    散点收集:用于浏览大型文档集合的基于群集的方法和设备

    公开(公告)号:US5442778A

    公开(公告)日:1995-08-15

    申请号:US790316

    申请日:1991-11-12

    IPC分类号: G06F17/30

    摘要: Scatter-Gather is a computer based document browsing method which operates in time proportional to a number of documents in a target corpus. The Scatter-Gather method includes: preparing an initial ordering of the corpus using, for example, an off-line computational method; determining a summary of the initial ordering of the corpus for interactive utility; and providing a further ordering of the corpus using, for example, an on-line non-deterministic method. The step of an off-line preparation of an initial ordering of a corpus is non-time-dependent, thus an accurate initial ordering is prepared. The step of determining a summary includes determining a summary for presentation to a user without scrolling on a CRT. The step of providing a further ordering includes truncated group average agglomerate clustering, merging disjointed document sets, center finding, assign-to-nearest and other refinement methods.

    摘要翻译: Scatter-Gather是一种基于计算机的文档浏览方法,与目标语料库中的文档数量成正比。 分散收集方法包括:使用例如离线计算方法来准备语料库的初始排序; 确定用于交互式实用程序的语料库的初始排序的摘要; 并使用例如在线非确定性方法提供语料库的进一步排序。 离线准备语料库的初始排序的步骤是非时间依赖的,因此准备了准确的初始排序。 确定摘要的步骤包括确定用于呈现给用户的摘要,而不在CRT上滚动。 提供进一步排序的步骤包括截断组平均聚集聚类,合并不相关文档集,中心查找,分配到最近和其他细化方法。

    Iterative technique for phrase query formation and an information
retrieval system employing same
    3.
    发明授权
    Iterative technique for phrase query formation and an information retrieval system employing same 失效
    用于短语查询形成的迭代技术和采用它的信息检索系统

    公开(公告)号:US5278980A

    公开(公告)日:1994-01-11

    申请号:US745794

    申请日:1991-08-16

    摘要: An information retrieval system and method are provided in which an operator inputs one or more query words which are used to determine a search key for searching through a corpus of documents, and which returns any matches between the search key and the corpus of documents as a phrase containing the word data matching the query word(s), a non-stop (content) word next adjacent to the matching word data, and all intervening stop-words between the matching word data and the next adjacent non-stop word. The operator, after reviewing one or more of the returned phrases can then use one or more of the next adjacent non-stop-words as new query words to reformulate the search key and perform a subsequent search through the document corpus. This process can be conducted iteratively, until the appropriate documents of interest are located. The additional non-stop-words from each phrase are preferably aligned with each other (e.g., by columnation) to ease viewing of the "new" content words.

    摘要翻译: 提供了一种信息检索系统和方法,其中操作者输入用于确定用于通过文档语料库搜索的搜索关键字的一个或多个查询词,并且将搜索关键字和文档语料库之间的任何匹配返回为 包含与查询字匹配的词数据,与匹配字数据相邻的不停(内容)字,以及匹配字数据与下一相邻不停字之间的所有中间停止字的短语。 操作者在查看一个或多个返回的短语之后,可以使用下一个相邻的非停止词中的一个或多个作为新的查询词来重新组合搜索关键字,并通过文档语料库执行后续搜索。 这个过程可以迭代进行,直到找到相关文档。 来自每个短语的附加非停止词优选彼此对齐(例如,通过列),以便于观看“新”内容词。

    Method of ordering document clusters without requiring knowledge of user
interests
    4.
    发明授权
    Method of ordering document clusters without requiring knowledge of user interests 失效
    在不需要用户兴趣的知识的情况下排序文档集群的方法

    公开(公告)号:US5787420A

    公开(公告)日:1998-07-28

    申请号:US572558

    申请日:1995-12-14

    IPC分类号: G06F17/30

    摘要: A computerized method of ordering document clusters for presentation after browsing a corpus of documents that presents document clusters in a logical fashion in the absence of any indication of the computer user's interests. The method begins by grouping the corpus into a plurality of clusters, each having a centroid and including at least one document. Next, for each cluster a degree of similarity between that cluster and every other cluster is by finding a dot product between each cluster centroid and every other cluster centroid. The similarity information is then used to determine an order of presentation for the plurality of in a way that maximizes the degree of similarity between adjacent clusters.

    摘要翻译: 在没有计算机用户的兴趣的任何指示的情况下,在浏览了以逻辑方式呈现文档簇的文档的语料库之后,排序文档簇以进行呈现的计算机化方法。 该方法开始于将语料库分组成多个簇,每个簇具有质心并且包括至少一个文档。 接下来,对于每个集群,该集群和每个其他集群之间的相似程度通过在每个集群质心和每个其他集群质心之间找到点积。 然后,相似性信息用于以使相邻集群之间的相似度最大化的方式来确定多个呈现的顺序。

    Method and apparatus for information access employing overlapping
clusters
    5.
    发明授权
    Method and apparatus for information access employing overlapping clusters 失效
    使用重叠聚类的信息访问方法和装置

    公开(公告)号:US5999927A

    公开(公告)日:1999-12-07

    申请号:US65828

    申请日:1998-04-24

    IPC分类号: G06F17/30

    摘要: The present invention is a method and apparatus for document clustering-based browsing of a corpus of documents, and more particularly to the use of overlapping clusters to improve recall. The present invention is directed to improving the performance of information access methods and apparatus through the use of non-disjoint (overlapped) clustering operations. The present invention is further described in terms of two possible methods for expanding document clusters so as to achieve the overlap, and a method for increasing precision through the use of the overlapped clusters.

    摘要翻译: 本发明是用于文档语料库的基于文档聚类的浏览的方法和装置,更具体地说,涉及使用重叠的聚类来改进回忆。 本发明旨在通过使用非分离(重叠)聚类操作来提高信息访问方法和装置的性能。 根据用于扩展文档簇以实现重叠的两种可能的方法进一步描述本发明,以及通过使用重叠的簇来提高精度的方法。

    Method of ordering document clusters given some knowledge of user
interests
    6.
    发明授权
    Method of ordering document clusters given some knowledge of user interests 失效
    给定一些用户兴趣知识的文档集群的排序方法

    公开(公告)号:US5911140A

    公开(公告)日:1999-06-08

    申请号:US572399

    申请日:1995-12-14

    IPC分类号: G06F17/30

    摘要: A method of automatically ordering the presentation of documents clusters generated from a ranked corpus of documents. First, the corpus is ordered into a plurality of clusters. Next, a rank is determined for each cluster based upon the rank of a document within that cluster. Afterward, the clusters are presented to a computer user in the order determined by their rank.

    摘要翻译: 一种自动排序从排序的文档语料库生成的文档集合的呈现的方法。 首先,语料库被排列成多个群集。 接下来,基于该群集内的文档的等级,为每个群集确定等级。 之后,按照其等级确定的顺序将群集呈现给计算机用户。

    Method and apparatus for information accesss employing overlapping
clusters
    7.
    发明授权
    Method and apparatus for information accesss employing overlapping clusters 失效
    使用重叠聚类的信息访问的方法和装置

    公开(公告)号:US5787422A

    公开(公告)日:1998-07-28

    申请号:US585075

    申请日:1996-01-11

    IPC分类号: G06F17/30

    摘要: The present invention is a method and apparatus for document clustering-based browsing of a corpus of documents, and more particularly to the use of overlapping clusters to improve recall. The present invention is directed to improving the performance of information access methods and apparatus through the use of non-disjoint (overlapped) clustering operations. The present invention is further described in terms of two possible methods for expanding document clusters so as to achieve the overlap, and a method for increasing precision through the use of the overlapped clusters.

    摘要翻译: 本发明是用于文档语料库的基于文档聚类的浏览的方法和装置,更具体地说,涉及使用重叠的聚类来改进记忆。 本发明旨在通过使用非分离(重叠)聚类操作来提高信息访问方法和装置的性能。 根据用于扩展文档簇以实现重叠的两种可能的方法进一步描述本发明,以及通过使用重叠的簇来提高精度的方法。

    Automatic method of identifying drop words in a document image without
performing character recognition
    8.
    发明授权
    Automatic method of identifying drop words in a document image without performing character recognition 失效
    在不执行字符识别的情况下识别文档图像中的放置词的自动方法

    公开(公告)号:US5850476A

    公开(公告)日:1998-12-15

    申请号:US572847

    申请日:1995-12-14

    CPC分类号: G06K9/00442 G06K9/00852

    摘要: A method of automatically identifying drop words in a document image without performing character recognition to generate an ASCII representation of the document text. First, the document image is analyzed to identify word equivalence classes, each of which represents at least one word of the multiplicity of words included in the document. Second, for each word equivalence class, the likelihood that it is not a drop word is determined. Third, document length is analyzed to determine whether the document is short. For a short document, the number of word equivalence classes identified as drop words based upon their likelihood is proportional to document length. For long documents, a fixed number of word equivalence classes are identified as drop words based upon the likelihood that they are not drop words.

    摘要翻译: 自动识别文档图像中的放置词而不执行字符识别以生成文档文本的ASCII表示的方法。 首先,分析文档图像以识别单词等价类,其中每一个表示文档中包括的多个单词的至少一个单词。 第二,对于每个单词等价类,确定它不是丢弃词的可能性。 第三,分析文档长度以确定文档是否短。 对于一个简短的文件,基于它们的可能性确定为丢弃词的单词等价类的数量与文档长度成比例。 对于长文档,固定数量的字等价类根据它们不是丢弃字的可能性被识别为丢弃字。

    Automatic method of generating thematic summaries from a document image
without performing character recognition
    9.
    发明授权
    Automatic method of generating thematic summaries from a document image without performing character recognition 失效
    从文档图像生成专题摘要而不进行字符识别的自动方法

    公开(公告)号:US5848191A

    公开(公告)日:1998-12-08

    申请号:US572848

    申请日:1995-12-14

    摘要: A method of automatically generating a thematic summary from a document image without performing character recognition to generate an ASCII representation of the document text. The method begins with decomposition of the document image into text blocks, and text lines. Using the median x-height of text blocks the main body of text is identified. Afterward, word image equivalence classes and sentence boundaries within the blocks of the main body of text are determined. The word image equivalence classes are used to identify thematic words. These, in turn are used to score the sentences within the main body of text, and the highest scoring sentences are selected for extraction.

    摘要翻译: 一种从文档图像自动生成专题摘要而不执行字符识别以生成文档文本的ASCII表示的方法。 该方法从文档图像分解成文本块和文本行开始。 使用文本的中间x高度块标识文本的主体。 之后,确定文本主体块内的单词图像等价类和句子边界。 单词图像等价类用于标识专题词。 这些反过来用于对文本主体内的句子进行分数,并且选择最高的评分句子用于提取。

    Detecting function words without converting a scanned document to
character codes
    10.
    发明授权
    Detecting function words without converting a scanned document to character codes 失效
    检测功能字,而不将扫描的文档转换为字符代码

    公开(公告)号:US5455871A

    公开(公告)日:1995-10-03

    申请号:US242990

    申请日:1994-05-16

    IPC分类号: G06K9/46 G06K9/00 G06K9/34

    CPC分类号: G06K9/00

    摘要: A method and apparatus detects function words in a first image of a scanned document without first converting the image to character codes. Function words include determiners, prepositions, articles, and other words that play a largely grammatical role, as opposed to words such as nouns and verbs that convey topic information. Non-content based morphological characteristics of image units are predetermined as well as the presence or omission of character ascenders and descenders in image units. Predetermined characteristics of function word image units are compared with the image units of an image and when a match occurs, the image unit is identified as a function word. Conversely when no matching characteristics occur, the image unit is identified as a non-function word. Additionally, image units are classified and identified as containing only upper case characters, only lower case characters, only digits, and mixed character types.

    摘要翻译: 方法和装置检测扫描文件的第一图像中的功能词,而无需首先将图像转换成字符代码。 功能词包括决定者,介词,文章和其他发挥主要语法作用的单词,而不是传达主题信息的名词和动词。 图像单位的基于非内容的形态特征是预先确定的,以及图像单元中角色上升器和下降器的存在或不存在。 将功能字图像单元的预定特征与图像的图像单位进行比较,并且当匹配发生时,图像单元被识别为功能字。 相反,当没有匹配特征出现时,图像单元被识别为非功能字。 此外,图像单位被分类并标识为仅包含大写字母,仅包含小写字母,仅数字和混合字符类型。