摘要:
Computer methods, apparatus and articles of manufacture therefor, are disclosed for text-characterization using a finite state transducer that along each path accepts on a first side an n-gram of text-characterization (e.g., a language or a topic) and outputs on a second side a sequence of symbols identifying one or more text-characterizations from a set of text-characterizations. The finite state transducer is applied to input data. For each n-gram accepted by the finite state transducer, a frequency counter associated with the n-gram of the one or more text-characterizations in the set of text-characterizations is incremented. The input data is classified as one or more text-characterizations from the set of text-characterizations using the frequency counters associated therewith.
摘要:
Computer methods, apparatus and articles of manufacture therefor, are disclosed for developing a region-matching transducer for marking language data having delimited strings. The region-matching transducer defines one or more patterns of one or more sequences of delimited strings, with at least one of the patterns defined in the region-matching transducer having an arrangement of a plurality of class-matching networks. The plurality of class-matching networks defines a combination of two or more entity classes from one or both of part-of-speech classes and application-specific classes. The region-matching transducer has, for each of the one or more patterns, an arc that leads from a penultimate state with a transition label that identifies the entity class of the pattern, and shares states between patterns leading to a penultimate state when segments of delimited strings making up two or more patterns overlap.
摘要:
A system and method for generating tag glossaries and use thereof is provided. A set of tags is accessed. Each tag is associated with a glossary that includes one or more terms and definitions for the terms. A new tag is generated and a new glossary is generated for the new tag based on the glossaries associated with the set of tags. The tag glossaries can be used to provide context for documents associated with the tags, to determine appropriate tags for untagged documents, to help in search for other documents, and to build indices for documents or collections of documents.
摘要:
A technique of using the path numbers of an acyclic finite-state transducer as a method of indexing a database. Each entry in the database has associated therewith one or more keys. A finite state transducer is provided defining the keys for the database. For each key, a path number is determined associated with that key, the path number defining a mapping between that key and the (or each) corresponding entry in the database.
摘要:
A system and method for generating tag glossaries and use thereof is provided. A set of tags is accessed. Each tag is associated with a glossary that includes one or more terms and definitions for the terms. A new tag is generated and a new glossary is generated for the new tag based on the glossaries associated with the set of tags. The tag glossaries can be used to provide context for documents associated with the tags, to determine appropriate tags for untagged documents, to help in search for other documents, and to build indices for documents or collections of documents.
摘要:
A processor implemented method of modifying a string of a regular language, which includes at least two symbols and at least two predetermined substrings. Upon receipt of the string, the processor determines an initial position within the string of a substring matching one of the preselected substrings. To make this determination, the processor either matches symbols of the string starting from the left and proceeding to the right or by starting from the right and proceeding to the left. After identifying the initial position, the processor then selects either the longest or the shortest of the preselected substrings. The processor then replaces the matching substring with the string of the lower language associated with the selected preselected substring and outputs the modified string.
摘要:
Valid positions for hyphens in input strings are determined by reading in and processing the symbols of the input string through a finite state transducer which has a state-transition data structure determined by a compilation of a set of hyphenation rules. The output of the encoding system can include a hyphenated string, or can accept a hyphenated string and output an indication of whether the input hyphenation is proper according to the set of hyphenation rules.
摘要:
A technique of using an electronic dictionary in conjunction with electronically-encoded running text that gives the user the most relevant information rather than belaboring the user with all possible information about a selected word. The technique maps the selected word from its inflected form to its citation form, analyzes the selected word in the context of neighboring and surrounding words to resolve ambiguities, and displays the information that is determined to be the most likely to be relevant. The dictionary preferably has information about multi-word combinations that include the selected word, and the context determination typically entails checking whether the selected word is part of a predefined multi-word combination.