摘要:
A method and apparatus for estimating the probability of phones, a-posteriori, in the context of not only the acoustic feature at that time, but also the acoustic features in the vicinity of the current time, and its use in cutting down the search-space in a speech recognition system. The method constructs and uses a decision tree, with the predictors of the decision tree being the vector-quantized acoustic feature vectors at the current time, and in the vicinity of the current time. The process starts with an enumeration of all (predictor, class) events in the training data at the root node, and successively partitions the data at a node according to the most informative split at that node. An iterative algorithm is used to design the binary partitioning. After the construction of the tree is completed, the probability distribution of the predicted class is stored at all of its terminal leaves. The decision tree is used during the decoding process by tracing a path down to one of its leaves, based on the answers to binary questions about the vector-quantized acoustic feature vector at the current time and its vicinity.
摘要:
A speech coding apparatus and method for use in a speech recognition apparatus and method. The value of at least one feature of an utterance is measured during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values. A plurality of prototype vector signals, each having at least one parameter value and a unique identification value are stored. The closeness of the feature vector signal is compared to the parameter values of the prototype vector signals to obtain prototype match scores for the feature value signal and each prototype vector signal. The identification value of the prototype vector signal having the best prototype match score is output as a coded representation signal of the feature vector signal. Speaker-dependent prototype vector signals are generated from both synthesized training vector signals and measured training vector signals. The synthesized training vector signals are transformed reference feature vector signals representing the values of features of one or more utterances of one or more speakers in a reference set of speakers. The measured training feature vector signals represent the values of features of one or more utterances of a new speaker/user not in the reference set.
摘要:
A speech coding apparatus and method measures the values of at least first and second different features of an utterance during each of a series of successive time intervals. For each time interval, a feature vector signal has a first component value equal to a first weighted combination of the values of only one feature of the utterance for at least two time intervals. The feature vector signal has a second component value equal to a second weighted combination, different from the first weighted combination, of the values of only one feature of the utterance for at least two time intervals. The resulting feature vector signals for a series of successive time intervals form a coded representation of the utterance. In one embodiment, a first weighted mixture signal has a value equal to a first weighted mixture of the values of the features of the utterance during a single time interval. A second weighted mixture signal has a value equal to a second weighted mixture, different from the first weighted mixture, of the values of the features of the utterance during a single time interval. The first component value of each feature vector signal is equal to a first weighted combination of the values of only the first weighted mixture signals for at least two time intervals, and the second component value of each feature vector signal is equal to a second weighted combination, different from the first weighted combination, of the values of only the second weighted mixture for at least two time intervals.
摘要:
A speech coding apparatus and method uses classification rules to code an utterance while consuming fewer computing resources. The value of at least one feature of an utterance is measured during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values. The classification rules comprise at least first and second sets of classification rules. The first set of classification rules map each feature vector signal from a set of all possible feature vector signals to exactly one of at least two disjoint subsets of feature vector signals. The second set of classification rules map each feature vector signal in a subset of feature vector signals to exactly one of at least two different classes of prototype vector signals. Each class contains a plurality of prototype vector signals. According to the classification rules, a first feature vector signal is mapped to a first class of prototype vector signals. The closeness of the feature value of the first feature vector signal is compared to the parameter values of only the prototype vector signals in the first class of prototype vector signals to obtain prototype match scores for the first feature vector signal and each prototype vector signal in the first class. At least the identification value of at least the prototype vector signal having the best prototype match score is output as a coded utterance representation signal of the first feature vector signal.
摘要:
Symbol feature values and contextual feature values of each event in a training set of events are measured. At least two pairs of complementary subsets of observed events are selected. In each pair of complementary subsets of observed events, one subset has contextual features with values in a set C.sub.n, and the other set has contextual features with values in a set C.sub.n, were the sets in C.sub.n and C.sub.n are complementary sets of contextual feature values. For each subset of observed events, the similarity values of the symbol features of the observed events in the subsets are calculated. For each pair of complementary sets of observed events, a "goodness of fit" is the sum of the symbol feature value similarity of the subsets. The sets of contextual feature values associated with the subsets of observed events having the best "goodness of fit" are identified and form context-dependent bases for grouping the observed events into two output sets.
摘要:
Modeling a word is done by concatenating a series of elemental models to form a word model. At least one elemental model in the series is a composite elemental model formed by combining the starting states of at least first and second primitive elemental models. Each primitive elemental model represents a speech component. The primitive elemental models are combined by a weighted combination of their parameters in proportion to the values of the weighting factors. To tailor the word model to closely represent variations in the pronunciation of the word, the word is uttered a plurality of times by a plurality of different speakers. Constructing word models from composite elemental models, and constructing composite elemental models from primitive elemental models enables word models to represent many variations in the pronunciation of a word. Providing a relatively small set of primitive elemental models for a relatively large vocabulary of words enables models to be trained to the voice of a new speaker by having the new speaker utter only a small subset of the words in the vocabulary.
摘要:
A speech coding apparatus and method uses a hierarchy of prototype sets to code an utterance while consuming fewer computing resources. The value of at least one feature of an utterance is measured during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values. A plurality of level subsets of prototype vector signals is computed, wherein each prototype vector signal in a higher level subset is associated with at least one prototype vector signal in a lower level subset. Each level subset contains a plurality of prototype vector signals, with lower level subsets containing more prototypes than higher level subsets. The closeness of the feature value of the first feature vector signal is compared to the parameter values of prototype vector signals in the first level subset of prototype vector signals to obtain a ranked list of prototype match scores for the first feature vector signal and each prototype vector signal in the first level subset. The closeness of the feature value of the first feature vector signal is compared to the parameter values of each prototype vector signal in a second (lower) level subset that is associated with the highest ranking prototype vectors in the first level subset, to obtain a second ranked list of prototype match scores. The identification value of the prototype vector signal in the second ranked list having the best prototype match score is output as a coded utterance representation signal of the first feature vector signal.
摘要:
In a speech recognition system, the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances. The speech recognition system models the posterior probability of linguistic units relevant to speech recognition using a log-linear model. The posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model. The posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. Log-linear models are used with features derived from sparse or incomplete data. The speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features. Not all features used in training need to appear in testing/recognition.
摘要:
In a speech recognition system, the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances. The speech recognition system models the posterior probability of linguistic units relevant to speech recognition using a log-linear model. The posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model. The posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. Log-linear models are used with features derived from sparse or incomplete data. The speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features. Not all features used in training need to appear in testing/recognition.
摘要:
A method of automatically aligning a written transcript with speech in video and audio clips. The disclosed technique involves as a basic component an automatic speech recognizer. The automatic speech recognizer decodes speech (recorded on a tape) and produces a file with a decoded text. This decoded text is then matched with the original written transcript via identification of similar words or clusters of words. The results of this matching is an alignment of the speech with the original transcript. The method can be used (a) to create indexing of video clips, (b) for "teleprompting" (i.e. showing the next portion of text when someone is reading from a television screen), or (c) to enhance editing of a text that was dictated to a stenographer or recorded on a tape for its subsequent textual reproduction by a typist.