Method and apparatus for calculating similarity among documents
    11.
    发明授权
    Method and apparatus for calculating similarity among documents 失效
    用于计算文件之间相似度的方法和装置

    公开(公告)号:US07440938B2

    公开(公告)日:2008-10-21

    申请号:US10838231

    申请日:2004-05-05

    IPC分类号: G06F7/00

    摘要: Information that individual elements (characteristic character strings) indicative of characteristics of a registered document appear in the registered document is stored in advance. When calculating similarity of the registered document, a query designated by a searcher is analyzed. The query is represented by a characteristic vector having the individual elements which take the relation between a plurality of words into consideration. Pieces of appearance information of the individual words contained in the query are counted. The counted appearance information is compared with a searching index to calculate similarity between documents.

    摘要翻译: 预先存储指示登记文件的特征的各个要素(特征字符串)出现在登记文件中的信息。 当计算登记文件的相似度时,分析由搜索者指定的查询。 该查询由具有考虑到多个单词之间的关系的各个单元的特征向量表示。 对查询中包含的各个单词的外观信息进行计数。 将计数的外观信息与搜索索引进行比较,以计算文档之间的相似度。

    Registration method and search method for structured documents
    14.
    发明授权
    Registration method and search method for structured documents 失效
    结构化文件的注册方法和搜索方法

    公开(公告)号:US06826567B2

    公开(公告)日:2004-11-30

    申请号:US10218495

    申请日:2002-08-15

    IPC分类号: G06F1730

    摘要: A registration/search method for structured documents where correspondence data is prepared between a fixed-length-string and a string occurrence position within a structured document for all fixed-length-strings in the document and for each structured document. A list of a character and all hierarchical elements containing the character and element lengths is prepared. An occurrence frequency and an occurrence position of a search term is obtained using the plurality of fixed-length-substrings and the occurrence frequency extracting index. A search character is selected from the search term. A hierarchical element containing the search character is obtained using the character from the element length index. A length of the element corresponding to a search range is extracted using the obtained occurrence position. A matching degree for the search term is calculated from the obtained occurrence frequency of the search term and the extracted element length of the element corresponding to the search range.

    摘要翻译: 一种结构化文档的注册/搜索方法,其中在文档和每个结构化文档中的所有固定长度字符串的结构化文档中的固定长度字符串和字符串发生位置之间准备对应数据。 准备了包含字符和元素长度的字符和所有分层元素的列表。 使用多个固定长度子串和出现频率提取索引来获得搜索项的出现频率和出现位置。 从搜索项中选择搜索字符。 使用元素长度索引中的字符获得包含搜索字符的分层元素。 使用所获得的发生位置提取与搜索范围对应的元素的长度。 从搜索项的获得的出现频率和与搜索范围对应的元素的提取的元素长度计算搜索项的匹配度。

    Data display method and apparatus for use in text mining
    15.
    发明授权
    Data display method and apparatus for use in text mining 失效
    用于文本挖掘的数据显示方法和装置

    公开(公告)号:US06738786B2

    公开(公告)日:2004-05-18

    申请号:US09874005

    申请日:2001-06-06

    IPC分类号: G06F1730

    摘要: In a text mining technique, if the system only extracts characteristic words and phrases frequently cooccurring with the respective components of an analysis axis as an analysis condition, similar words and phrases are extracted for any component. To clearly indicate existence of characteristic words and phrases which do not appear as cooccurrence words and phrases for other components of the analysis axis, it is desired to appropriately present distinguishable features between the components to the user. For this purpose, the frequency of appearances of a plurality of characteristic words and phrases in a document satisfying each analysis condition is calculated. As a result, multiple cooccurrence words and phrases and component-cooccurrence words and phrases are discriminatively displayed. It is therefore possible for the user to appropriately analyze the contents of a plurality of documents.

    摘要翻译: 在文本挖掘技术中,如果系统只提取经常与分析轴的各个分量共同出现的特征词和短语作为分析条件,则为任何分量提取类似的词和短语。 为了清楚地表示存在不是作为分析轴的其他部件的共同文字和短语的特征词和短语,希望适当地向用户呈现组件之间的可区分的特征。 为此,计算满足各分析条件的文件中的多个特征词和短语的出现次数。 结果,多个同时出现的单词和短语以及组合 - 共同文字和短语被歧视地显示出来。 因此,用户可以适当地分析多个文档的内容。

    Document retrieval method and system and computer readable storage medium
    17.
    发明授权
    Document retrieval method and system and computer readable storage medium 失效
    文件检索方法和系统以及计算机可读存储介质

    公开(公告)号:US06665668B1

    公开(公告)日:2003-12-16

    申请号:US09645561

    申请日:2000-08-24

    IPC分类号: G06F1730

    摘要: A document retrieval system is provided which has a document display interface which is easy to recognize the important portions even if a document retrieved by using a query expression designated by a document or a long sentence is displayed. When a text is registered, predetermined character strings and location information which are extracted from the text are stored in a location information file. A weight of each character string is calculated by a predetermined method and is stored in a weight file. In retrieving a document, predetermined character strings are extracted from a designated query expression. A similarity is calculated between the query expression and texts in the database by using the location information and the weights acquired from the location file and the weight file. In displaying the document, character strings having the high weights are extracted from the character strings used for the retrieval. Then, the display format of a portion which contains the extracted character strings is changed to display the text.

    摘要翻译: 提供了一种具有文档显示界面的文档检索系统,即使通过使用由文档或长句子指定的查询表达式检索到的文档被显示,也容易识别重要部分。 当登记文本时,将从文本中提取的预定字符串和位置信息存储在位置信息文件中。 每个字符串的权重通过预定方法计算并存储在权重文件中。 在检索文档时,从指定的查询表达式中提取预定的字符串。 通过使用位置信息和从位置文件和权重文件获得的权重,在查询表达式和数据库中的文本之间计算相似度。 在显示文档时,从用于检索的字符串中提取具有高权重的字符串。 然后,改变包含提取的字符串的部分的显示格式以显示文本。

    Method and search method for structured documents
    18.
    发明授权
    Method and search method for structured documents 失效
    结构化文件的方法和搜索方法

    公开(公告)号:US06496820B1

    公开(公告)日:2002-12-17

    申请号:US09300594

    申请日:1999-04-28

    IPC分类号: G06F1730

    摘要: A registration method for structured documents includes the steps of: preparing correspondence data between a string and a string occurrence position within a structured document for each structured document, and additionally storing the correspondence data in an occurrence frequency extracting index; and preparing a list of a character, an element containing the character and a length of the element and additionally storing the list in an element length index. A search method for structured documents includes the steps of: inputting search conditions including a search term and an element for specifying a search range; decomposing the search term into a plurality of substrings, obtaining an occurrence frequency and an occurrence position of the search term using the plurality of substrings from the occurrence frequency extracting index; selecting a character from the search term, obtaining an element containing the character using the character from the element length index, and further extracting a length of the element within the search range; calculating a matching degree for the search conditions from the occurrence frequency and the occurrence position of the search term and the length of the element within the search range; and outputting the element containing the search term and the matching degree.

    摘要翻译: 结构化文档的注册方法包括以下步骤:为每个结构化文档准备在结构化文档内的字符串和字符串发生位置之间的对应数据,并且将对应数据另外存储在发生频率提取索引中; 以及准备字符的列表,包含该元素的字符和长度的元素,并另外将该列表存储在元素长度索引中。 用于结构化文档的搜索方法包括以下步骤:输入包括搜索项和用于指定搜索范围的元素的搜索条件; 将搜索项分解为多个子串,使用来自发生频率提取索引的多个子串来获得搜索项的出现频率和出现位置; 从所述搜索项中选择一个字符,从所述元素长度索引获得包含所述字符的元素,并进一步提取所述元素在所述搜索范围内的长度; 根据搜索项的出现频率和出现位置以及搜索范围内的元素的长度来计算搜索条件的匹配度; 并输出包含搜索项和匹配度的元素。