摘要:
A system for stemming words of Semitic languages, the system including an affix scanner configured to scan a word of a Semitic language for at least one affix according to a predefined scanning sequence and determine if at least one predefined scanning criterion is met, and a stemmer configured to remove the affix from the word if the predefined scanning criterion is met.
摘要:
A hybrid n-gram/lexical analysis tokenization system including a lexicon and a hybrid tokenizer operative to perform both N-gram tokenization of a text and lexical analysis tokenization of a text using the lexicon, and to construct either of an index and a classifier from the results of both of the N-gram tokenization and the lexical analysis tokenization, where the hybrid tokenizer is implemented in at least one of computer hardware and computer software and is embodied within a computer-readable medium.
摘要:
Illustrated embodiments provide a computer implemented method, an apparatus, and a computer program product for unsupervised stemming schema learning and lexicon acquisition from corpora. In one illustrative embodiment, the computer implemented method obtains a corpus from corpora, analyzes the corpus to deduce a set of possible stemming schema and reviews and revises the set of possible stemming schema, to create a pruned set of stemming schema. The computer implemented method further deduces a lexicon from the corpus using the pruned set of stemming schema.
摘要:
Illustrative embodiments provide a computer implemented method, apparatus, and computer program product for learning word segmentation from non-white space language corpora. In one illustrative embodiment, the computer implemented method receives text input characters and calculates a ratio-measure for each pair of characters in the input characters. The computer implemented method further determines whether the ratio-measure of each pair of characters is equal to a predetermined threshold value. Responsive to determining the ratio-measure is less than the predetermined threshold value, and a local-minimum value, the computer method further identifies the pair as a weak pair and breaks the weak pair of characters.
摘要:
A hybrid n-gram/lexical analysis tokenization system including a lexicon and a hybrid tokenizer operative to perform both N-gram tokenization of a text and lexical analysis tokenization of a text using the lexican, and to construct either of an index and a classifier from the results of both of the N-gram tokenization and the lexical analysis tokenization, where the hybrid tokenizer is implemented in at least one of computer hardware and computer software and is embodied within a computer-readable medium.
摘要:
A method for constructing an automaton for automated analysis of agglutinative languages, the method including constructing an affix automaton for each of a plurality of affix types of an agglutinative language, where each of the affix types is associated with one or more affixes associated with a morphological concept, combining any of the affix automatons to form a plurality of template automatons, where each of the template automatons is patterned after any of a plurality of agglutination templates of any of the affix types for the language, and combining the template automatons into a master automaton.
摘要:
Constructing an automaton for automated analysis of agglutinative languages comprises: constructing an affix automaton for each of a plurality of affix types of an agglutinative language, where each of the affix types is associated with one or more affixes associated with a morphological concept; combining any of the affix automatons to form a plurality of template automatons, where each of the template automatons is patterned after any of a plurality of agglutination templates of any of the affix types for the language; and combining the template automatons into a master automaton.
摘要:
Illustrated embodiments provide a computer implemented method, an apparatus, and a computer program product for unsupervised stemming schema learning and lexicon acquisition from corpora. In one illustrative embodiment, the computer implemented method obtains a corpus from corpora, analyzes the corpus to deduce a set of possible stemming schema and reviews and revises the set of possible stemming schema, to create a pruned set of stemming schema. The computer implemented method further deduces a lexicon from the corpus using the pruned set of stemming schema.
摘要:
A system for stemming words of Semitic languages, the system including an affix scanner configured to scan a word of a Semitic language for at least one affix according to a predefined scanning sequence and determine if at least one predefined scanning criterion is met, and a stemmer configured to remove the affix from the word if the predefined scanning criterion is met.
摘要:
Illustrative embodiments provide a computer implemented method, apparatus, and computer program product for learning word segmentation from non-white space language corpora. In one illustrative embodiment, the computer implemented method receives text input characters and calculates a ratio-measure for each pair of characters in the input characters. The computer implemented method further determines whether the ratio-measure of each pair of characters is equal to a predetermined threshold value. Responsive to determining the ratio-measure is less than the predetermined threshold value, and a local-minimum value, the computer method further identifies the pair as a weak pair and breaks the weak pair of characters.