摘要:
In order to determine a next event based upon available data, a binary decision tree is constructed having true or false questions at each node and a probability distribution of the unknown next event based upon available data at each leaf. Starting at the root of the tree, the construction process proceeds from node-to-node towards a leaf by answering the question at each node encountered and following either the true or false path depending upon the answer. The questions are phrased in terms of the available data and are designed to provide as much information as possible about the next unknown event. The process is particularly useful in speech recognition when the next word to be spoken is determined on the basis of the previously spoken words.
摘要:
A continuous speech recognition system includes an automatic phonological rules generator which determines variations in the pronunciation of phonemes based on the context in which they occur. This phonological rules generator associates sequences of labels derived from vocalizations of a training text with respective phonemes inferred from the training text. These sequences are then annotated with their pheneme context from the training text and clustered into groups representing similar pronunciations of each phoneme. A decision tree is generated using the context information of the sequences to predict the clusters to which the sequences belong. The training data is processed by the decision tree to divide the sequences into leaf-groups representing similar pronunciations of each phoneme. The sequences in each leaf-group are clustered into sub-groups representing respectively different pronunciations of their corresponding phoneme in a give context. A Markov model is generated for each sub-group. The various Markov models of a leaf-group are combined into a single compound model by assigning common initial and final states to each model. The compound Markov models are used by a speech recognition system to analyze an unknown sequence of labels given its context.
摘要:
In a word, or speech, recognition system for decoding a vocabulary word from outputs selected from an alphabet of outputs in response to a communicated word input wherein each word in the vocabulary is represented by a baseform of at least one probabilistic finite state model and wherein each probabilistic model has transition probability items and output probability items and wherein a value is stored for each of at least some probability items, the present invention relates to apparatus and method for determining probability values for probability items by biassing at least some of the stored values to enhance the likelihood that outputs generated in response to communication of a known word input are produced by the baseform for the known word relative to the respective likelihood of the generated outputs being produced by the baseform for at least one other word. Specifically, the current values of counts --from which probability items are derived--are adjusted by uttering a known word and determining how often probability events occur relative to (a) the model corresponding to the known uttered "correct" word and (b) the model of at least one other "incorrect" word. The current count values are increased based on the event occurrences relating to the correct word and are reduced based on the event occurrences relating to the incorrect word or words.
摘要:
In a speech recognition system, apparatus and method for modelling words with label-based Markov models is disclosed. The modelling includes: entering a first speech input, corresponding to words in a vocabulary, into an acoustic processor which converts each spoken word into a sequence of standard labels, where each standard label corresponds to a sound type assignable to an interval of time; representing each standard label as a probabilistic model which has a plurality of states, at least one transition from a state to a state, and at least one settable output probability at some transitions; entering selected acoustic inputs into an acoustic processor which converts the selected acoustic inputs into personalized labels, each personalized label corresponding to a sound type assigned to an interval of time; and setting each output probability as the probability of the standard label represented by a given model producing a particular personalized label at a given transition in the given model. The present invention addresses the problem of generating models of words simply and automatically in a speech recognition system.
摘要:
Apparatus and method for constructing word baseforms which can be matched against a string of generated acoustic labels. A set of phonetic phone machines are formed, wherein each phone machine has (i) a plurality of states, (ii) a plurality of transitions each of which extends from a state to a state, (iii) a stored probability for each transition, and (iv) stored label output probabilities, each label output probability corresponding to the probability of each phone machine producing a corresponding label. The set of phonetic machines is formed to include a subset of onset phone machines. The stored probabilities of each onset phone macine correspond to at least one phonetic element being uttered at the beginning of a speech segment. The set of phonetic machines is formed to include a subset of trailing phone machines. The stored probabilities of each trailing phone machine correspond to at least one single phonetic element being uttered at the end of a speech segment. Word baseforms are constructed by concatenating phone machines selected from the set.
摘要:
Apparatus and method for synthesizing word baseforms for words not spoken during a training session, wherein each synthesized baseform represents a series of models from a first set of models, which include: (a) uttering speech during a training session and representing the uttered speech as a sequence of models from a second set of models; (b) for each of at least some of the second set models spoken in a given phonetic model context during the training session, storing a respective string of first set models; and (c) constructing a word baseform of first set models for a word not spoken during the training session, including the step of representing each piece of a word that corresponds to a second set model in a given context by the stored respective string, if any, corresponding thereto.
摘要:
In a system that (i) defines each word in a vocabulary by a fenemic baseform of fenemic phones, (ii) defines an alphabet of composite phones each of which corresponds to at least one fenemic phone, and (iii) generates a string of fenemes in response to speech input, the method provides for converting a word baseform comprised of fenemic phones into a stunted word baseform of composite phones by (a) replacing each fenemic phone in the fenemic phone word baseform by the composite phone corresponding thereto; and (b) merging together at least one pair of adjacent composite phones by a single composite phone where the adverse effect of the merging is below a predefined threshold.
摘要:
The present invention relates to apparatus and method for segmenting multiple utterances of a vocabulary word in a consistent and coherent manner and determining a Markov model sequence for each segment. A fenemic Markov model corresponds to each label.
摘要:
In a speech recognition system, discrimination between similar-sounding uttered words is improved by weighting the probability vector data stored for the Markov model representing the reference word sequence of phones. The weighting vector is derived for each reference word by comparing similar sounding utterances using Viterbi alignment and multivariate analysis which maximizes the differences between correct and incorrect recognition multivariate distributions.
摘要:
Speech words are recognized by first recognizing each spectral vector identified by a label (feneme), then identifying the word by matching the string of labels against phones using simplified phone machines based on label and transition probabilities and Merkov chains. In one embodiment, a detailed acoustic match word score is combined with an approximate acoustic match word score to provide a total word score for a subject word. In another embodiment, a polling word score is combined with an acoustic match word score to provide a total word score for a subject word. The acoustic models employed in the acoustic matching may correspond, alternatively, to phonetic elements or to fenemes. Fenemes represent labels generated by an acoustic processor in response to a spoken input. Apparatus and method for determining word scores according to approximate acoustic matching and for determining word scores according to a polling methodology are disclosed.