Method and apparatus for establishing topic word classes based on an
entropy cost function to retrieve documents represented by the topic
words
    1.
    发明授权
    Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words 失效
    用于基于熵成本函数建立主题词类以检索由主题词表示的文档的方法和装置

    公开(公告)号:US6128613A

    公开(公告)日:2000-10-03

    申请号:US69618

    申请日:1998-04-29

    申请人: Wing S. Wong An Qin

    发明人: Wing S. Wong An Qin

    IPC分类号: G06F17/30

    摘要: A computer-based method and system for establishing topic words to represent a document, the topic words being suitable for use in document retrieval. The method includes determining document keywords from the document; classifying each of the document keywords into one of a plurality of preestablished keyword classes; and selecting words as the topic words, each selected word from a different one of the preestablished keyword classes, to minimize a cost function on proposed topic words. The cost function may be a metric of dissimilarity, such as cross-entropy, between a first distribution of likelihood of appearance by the plurality of document keywords in a typical document and a second distribution of likelihood of appearance by the plurality of document keywords in a typical document, the second distribution being approximated using proposed topic words. The cost function can be a basis for sorting the priority of the documents.

    摘要翻译: 一种基于计算机的方法和系统,用于建立主题词来表示文档,主题词适用于文档检索。 该方法包括从文档确定文档关键字; 将每个文档关键字分类为多个预先建立的关键字类之一; 并且选择单词作为主题词,来自预先建立的关键字类中的不同的一个的每个所选择的单词,以最小化所提出的主题词的成本函数。 成本函数可以是在典型文档中的多个文档关键词的出现的可能性的第一分布与多个文档关键字在一个文档关键词中出现的可能性的第二次分布之间的不相似性的度量,例如交叉熵 典型的文件,第二个分布使用提出的主题词近似。 成本函数可以作为分类文件优先级的基础。