Finite-state transduction of related word forms for text indexing and
retrieval
    2.
    发明授权
    Finite-state transduction of related word forms for text indexing and retrieval 失效
    文本索引和检索相关词形的有限状态转换

    公开(公告)号:US5594641A

    公开(公告)日:1997-01-14

    申请号:US255504

    申请日:1994-06-08

    IPC分类号: G06F17/30

    摘要: The present invention solves a number of problems in using stems (canonical indicators of word meanings) in full-text retrieval of natural language documents, and thus permits recall to be improved without sacrificing precision. It uses various arrangements of finite-state transducers to accurately encode a number of desirable ways of mapping back and forth between words and stems, taking into account both systematic aspects of a language's morphological rule system and also the word-by-word irregularities that also occur. The techniques described apply generally across the languages of the world and are not just limited to simple suffixing languages like English. Although the resulting transducers can have many states and transitions or arcs, they can be compacted by finite-state compression algorithms so that they can be used effectively in resource-limited applications. The invention contemplates the information retrieval system comprising the novel finite state transducer as a database and a processor for responding to user queries, for searching the database, and for outputting proper responses, if they exist, as well as the novel database used in such a system and methods for constructing the novel database.

    摘要翻译: 本发明解决了在自然语言文件的全文检索中使用词干(词义的规范指标)的一些问题,从而允许提取而不牺牲精度。 它使用有限状态传感器的各种布置来准确地编码在词和词干之间来回映射的许多期望的方式,同时考虑到语言的形态规则系统的系统方面以及逐字逐句的不规则性 发生。 所描述的技术通常适用于世界各地的语言,并不仅限于简单的后缀语言(如英语)。 虽然所得到的传感器可以具有许多状态和转换或弧,但是它们可以通过有限状态压缩算法来压缩,使得它们可以在资源有限的应用中被有效地使用。 本发明考虑了包括作为数据库的新型有限状态传感器的信息检索系统,以及用于响应用户查询,用于搜索数据库以及输出适当响应(如果存在的话)的处理器,以及用于这种 系统和构建新数据库的方法。

    Augmenting a lexical transducer by analogy
    3.
    发明授权
    Augmenting a lexical transducer by analogy 失效
    类比地增加词汇传感器

    公开(公告)号:US5412567A

    公开(公告)日:1995-05-02

    申请号:US999736

    申请日:1992-12-31

    申请人: Lauri Karttunen

    发明人: Lauri Karttunen

    IPC分类号: G06F17/27 G06F17/30 G06F15/38

    CPC分类号: G06F17/30985 G06F17/2795

    摘要: A method and apparatus for adding a word to a lexical transducer in a computer system. The invention allows a user of the computer system to specify a word to be added to the lexical transducer database. The lexical transducer represents words as ordered sequences of symbols, i.e., characters and morphological tags. "Upper" and "lower" symbols are associated with arcs. The arcs join states and form a path. Each path determines an upper and lower sequence of ordered symbols. The upper sequence of symbols represents a base form of a word and the lower sequence of symbols represents a surface form of the same word. The user adds a word to the lexical transducer by specifying a "model" word already existing in the lexical transducer, along with a new word that has surface forms analogous to the model word. The new word is added to the lexical transducer by sharing, as much as possible, the existing arcs of the path of the model word.

    摘要翻译: 一种将词语添加到计算机系统中的词汇换能器的方法和装置。 本发明允许计算机系统的用户指定要添加到词汇换能器数据库的单词。 词汇换能器将字作为符号的有序序列,即字符和形态标签。 “上”和“下”符号与弧相关联。 弧连接状态并形成路径。 每个路径确定有序符号的上下序列。 较高的符号序列表示单词的基本形式,较低的符号序列表示相同单词的表面形式。 用户通过指定词汇传感器中已经存在的“模型”字以及具有与模型词类似的表面形式的新单词,将词语添加到词汇换能器。 通过尽可能多地共享模型词的路径的现有弧,将新词添加到词汇传感器。