摘要:
The present invention solves a number of problems in using stems (canonical indicators of word meanings) in full-text retrieval of natural language documents, and thus permits recall to be improved without sacrificing precision. It uses various arrangements of finite-state transducers to accurately encode a number of desirable ways of mapping back and forth between words and stems, taking into account both systematic aspects of a language's morphological rule system and also the word-by-word irregularities that also occur. The techniques described apply generally across the languages of the world and are not just limited to simple suffixing languages like English. Although the resulting transducers can have many states and transitions or arcs, they can be compacted by finite-state compression algorithms so that they can be used effectively in resource-limited applications. The invention contemplates the information retrieval system comprising the novel finite state transducer as a database and a processor for responding to user queries, for searching the database, and for outputting proper responses, if they exist, as well as the novel database used in such a system and methods for constructing the novel database.
摘要:
The present invention solves a number of problems in using stems (canonical indicators of word meanings) in full-text retrieval of natural language documents, and thus permits recall to be improved without sacrificing precision. It uses various arrangements of finite-state transducers to accurately encode a number of desirable ways of mapping back and forth between words and stems, taking into account both systematic aspects of a language's morphological rule system and also the word-by-word irregularities that also occur. The techniques described apply generally across the languages of the world and are not just limited to simple suffixing languages like English. Although the resulting transducers can have many states and transitions or arcs, they can be compacted by finite-state compression algorithms so that they can be used effectively in resource-limited applications. The invention contemplates the information retrieval system comprising the novel finite state transducer as a database and a processor for responding to user queries, for searching the database, and for outputting proper responses, if they exist, as well as the novel database used in such a system and methods for constructing the novel database.
摘要:
A method and apparatus for adding a word to a lexical transducer in a computer system. The invention allows a user of the computer system to specify a word to be added to the lexical transducer database. The lexical transducer represents words as ordered sequences of symbols, i.e., characters and morphological tags. "Upper" and "lower" symbols are associated with arcs. The arcs join states and form a path. Each path determines an upper and lower sequence of ordered symbols. The upper sequence of symbols represents a base form of a word and the lower sequence of symbols represents a surface form of the same word. The user adds a word to the lexical transducer by specifying a "model" word already existing in the lexical transducer, along with a new word that has surface forms analogous to the model word. The new word is added to the lexical transducer by sharing, as much as possible, the existing arcs of the path of the model word.