Method of searching similar document, system for performing the same and program for processing the same
    1.
    发明授权
    Method of searching similar document, system for performing the same and program for processing the same 失效
    搜索类似文档的方法,执行相同的系统和处理程序的方法

    公开(公告)号:US07200587B2

    公开(公告)日:2007-04-03

    申请号:US10081203

    申请日:2002-02-25

    IPC分类号: G06F17/30 G06F17/00

    摘要: A similar document search method includes a step of extracting a characteristic word candidate as a candidate for a characteristic word from a seeds document including desired retrieval contents, a step of extracting as characteristic words of the seeds document, when the characteristic word candidate extracted by the extracting step is a compound characteristic word including a plurality of characteristic words, the compound characteristic word and constituent characteristic words included in the compound characteristic word from the characteristic word candidate, a step of calculating, according to the characteristic words extracted by the extracting step, similarity between the seeds document and a registration document, and a step of outputting as a retrieval result a result of the similarity calculated by the similarity calculating step.

    摘要翻译: 类似的文档搜索方法包括从包括期望的检索内容的种子文档中提取特征词候选作为特征词的候选的步骤,当由所述特征词候选提取的特征词候选提取时,提取种子文档的特征词的步骤 提取步骤是包括多个特征词的复合特征词,来自特征词候选的复合特征词中包括的复合特征词和构成特征词,根据由提取步骤提取的特征词计算的步骤, 种子文档和登记文档之间的相似性,以及作为检索结果输出由相似度计算步骤计算出的相似度的结果的步骤。

    Data display method and apparatus for use in text mining
    2.
    发明授权
    Data display method and apparatus for use in text mining 失效
    用于文本挖掘的数据显示方法和装置

    公开(公告)号:US06738786B2

    公开(公告)日:2004-05-18

    申请号:US09874005

    申请日:2001-06-06

    IPC分类号: G06F1730

    摘要: In a text mining technique, if the system only extracts characteristic words and phrases frequently cooccurring with the respective components of an analysis axis as an analysis condition, similar words and phrases are extracted for any component. To clearly indicate existence of characteristic words and phrases which do not appear as cooccurrence words and phrases for other components of the analysis axis, it is desired to appropriately present distinguishable features between the components to the user. For this purpose, the frequency of appearances of a plurality of characteristic words and phrases in a document satisfying each analysis condition is calculated. As a result, multiple cooccurrence words and phrases and component-cooccurrence words and phrases are discriminatively displayed. It is therefore possible for the user to appropriately analyze the contents of a plurality of documents.

    摘要翻译: 在文本挖掘技术中,如果系统只提取经常与分析轴的各个分量共同出现的特征词和短语作为分析条件,则为任何分量提取类似的词和短语。 为了清楚地表示存在不是作为分析轴的其他部件的共同文字和短语的特征词和短语,希望适当地向用户呈现组件之间的可区分的特征。 为此,计算满足各分析条件的文件中的多个特征词和短语的出现次数。 结果,多个同时出现的单词和短语以及组合 - 共同文字和短语被歧视地显示出来。 因此,用户可以适当地分析多个文档的内容。

    Similar document retrieving method and system
    3.
    发明授权
    Similar document retrieving method and system 有权
    类似的文件检索方法和系统

    公开(公告)号:US07231388B2

    公开(公告)日:2007-06-12

    申请号:US10206595

    申请日:2002-07-29

    IPC分类号: G06F10/30

    摘要: Similar document retrieving method and system for retrieving similar documents from a document database storing plural documents written in different languages with high accuracy while suppressing retrieval noise even when difference is found in the number of registered documents in dependence on the species of description languages. Statistical information concerning the registration-subjected documents is collected on a language-by-language basis upon registration thereof. Upon retrieval of documents similar to a query document, weights of words extracted from the query document are taken into account and on a language-by-language basis by referencing the statistical information.

    摘要翻译: 相似的文件检索方法和系统,用于从存储多种写入不同语言的多种文件的文件数据库中检索类似的文档,同时抑制检索噪声,即使在依赖于描述语言的种类的登记文件的数量上存在差异的情况下。 有关登记受影响的文件的统计资料,在登记后将逐一收集。 在检索与查询文档类似的文档时,通过参考统计信息考虑从查询文档中提取的单词的权重,并且逐个语言地考虑。

    Text mining method and apparatus allowing a user to analyze contents of a document set from plural analysis axes
    6.
    发明授权
    Text mining method and apparatus allowing a user to analyze contents of a document set from plural analysis axes 失效
    允许用户从多个分析轴分析文档集的内容的文本挖掘方法和装置

    公开(公告)号:US06757676B1

    公开(公告)日:2004-06-29

    申请号:US09649961

    申请日:2000-08-29

    IPC分类号: G06F1730

    摘要: A text mining method whereby documents (texts) can be analyzed from a wide variety of visual points. The text mining method includes: distinctive word and/or phrase extraction step of extracting words and/or phrases characteristically emerging in a processing subject document set obtained by taking out whole or a part of a set of documents registered beforehand; definition information setting step of setting definition information including a specified word or phrase or specified bibliography information; coincident word and/or phrase acquisition step of acquiring coincident words and/or phrases coincident in a predetermined range with a word or phrase or bibliography information included in said definition information from among words and/or phrases extracted at said distinctive word and/or phrase extraction step; and multiplex coincident word and/or phrase acquisition step of acquiring coincident words and/or phrases coincident in a predetermined range with an individual word or phrase or bibliography information acquired from each of a plurality of different definition information pieces.

    摘要翻译: 一种文本挖掘方法,可以从各种视觉点分析文档(文本)。 文本挖掘方法包括:提取通过取出预先登记的一组文档的全部或一部分而获得的处理对象文档集中特征出现的单词和/或短语的特征词和/或短语提取步骤; 定义信息设置步骤,设置包括指定的单词或短语或指定参考书目信息的定义信息; 一致的单词和/或短语获取步骤,用于在预定范围内与在所述特征词和/或短语中提取的单词和/或短语中包含的所述定义信息中包含的单词或短语或参考书目信息获取一致的单词和/或短语 提取步骤 以及将从预定范围重合的一致字和/或短语与从多个不同定义信息片段中的每一个获取的单个词或短语或参考书目信息进行多路复用的一致词和/或短语获取步骤。