摘要:
A method of grammar learning from a corpus comprises, for the other non-context words, generating frequency vectors for each non-context token in a corpus based upon counted occurrences of a predetermined relationship of the non-context tokens to identified context tokens. Clusters are grown from the frequency vectors according to a lexical correlation among the non-context tokens.
摘要:
A method, system and machine-readable medium are provided for watermarking natural language digital text. A deep structure may be generated and a group of features may be extracted from natural language digital text input. The deep structure may be modified based, at least partly, on a watermark. Natural language digital text output may be generated based on the modified deep structure.
摘要:
A system for recognizing patterns is disclosed. Grammar learning from a corpus includes, for the other non-context words, generating frequency vectors for each non-context token in a corpus based upon counted occurrences of a predetermined relationship of the non-context tokens to identified context tokens. Clusters are grown from the frequency vectors according to a lexical correlation or a cluster tree among the non-context tokens. The cluster tree is used for pattern recognition.
摘要:
A method iteratively integrates clustering techniques with phrase acquisition techniques to build complex linguistic models from a corpus. A set of features is initialized by the corpus. Thereafter, the method determines, according to a predetermined cost function, to process the features by one of phrase clustering processing or phrase grammar learning processing. If phrase clustering processing is performed, the method processes an interstitial set of features comprising both the old features and newly established clusters by phrase grammar learning processing. The features obtained as an output of phrase grammar learning is re-indexed as a set of features for a subsequent iteration. The method may be repeated over several iterations to build a hierarchical linguistic model.
摘要:
A method, system and machine-readable medium are provided for watermarking natural language digital text. A deep structure may be generated and a group of features may be extracted from natural language digital text input. The deep structure may be modified based, at least partly, on a watermark. Natural language digital text output may be generated based on the modified deep structure.
摘要:
In a method of learning grammar from a corpus, context words are identified from a corpus. For the other non-context words, the method counts the occurrence of predetermined relationships which the context words, and maps the counted occurrences to a multidimensional frequency space. Clusters are grown from the frequency vectors. The clusters represent classes of words; words in the same cluster possess the same lexical significancy and provide an indicator of grammatical structure.
摘要:
A method and apparatus for stochastic finite-state machine translation is provided. The method may include receiving a speech input and translating the speech input in a source language into one or more symbols in a target language based on stochastic language model. Subsequently, all possible sequences of the translated symbols may be generated. One of the generated sequences may be selected based on a monolingual target language model.
摘要:
In a method of learning grammar from a corpus, context words are identified from a corpus. For the other non-context words, the method counts the occurrence of predetermined relationships which the context words, and maps the counted occurrences to a multidimensional frequency space. Clusters are grown from the frequency vectors. The clusters represent classes of words; words in the same cluster possess the same lexical significancy and provide an indicator of grammatical structure.
摘要:
Utterance data that includes at least a small amount of manually transcribed data is provided. Automatic speech recognition is performed on ones of the utterance data not having a corresponding manual transcription to produce automatically transcribed utterances. A model is trained using all of the manually transcribed data and the automatically transcribed utterances. A predetermined number of utterances not having a corresponding manual transcription are intelligently selected and manually transcribed. Ones of the automatically transcribed data as well as ones having a corresponding manual transcription are labeled. In another aspect of the invention, audio data is mined from at least one source, and a language model is trained for call classification from the mined audio data to produce a language model.
摘要:
A system and a method are provided. A speech recognition processor receives unconstrained input speech and outputs a string of words. The speech recognition processor is based on a numeric language that represents a subset of a vocabulary. The subset includes a set of words identified as being for interpreting and understanding number strings. A numeric understanding processor contains classes of rules for converting the string of words into a sequence of digits. The speech recognition processor utilizes an acoustic model database. A validation database stores a set of valid sequences of digits. A string validation processor outputs validity information based on a comparison of a sequence of digits output by the numeric understanding processor with valid sequences of digits in the validation database.