摘要:
A first speech recognition method receives an acoustic description of an utterance to be recognized and scores a portion of that description against each of a plurality of cluster models representing similar sounds from different words. The resulting score for each cluster is used to calculate a word score for each word represented by that cluster. Preferably these word scores are used to prefilter vocabulary words, and the description of the utterance includes a succession of acoustic decriptions which are compared by linear time alignment against a succession of acoustic models. A second speech recognition method is also provided which matches an acoustic model with each of a succession of acoustic descriptions of an utterance to be recognized. Each of these models has a probability score for each vocabulary word. The probability scores for each word associated with the matching acoustic models are combined to form a total score for that word. The preferred speech recognition method calculates to separate word scores for each currently active vocabulary word from a common succession of sounds. Preferably the first scores is calculated by a time alignment method, while the second score is calculated by a time independent method. Preferably this calculation of two separate word scores is used in one of multiple word-selecting phase of a recognition process, such as in the prefiltering phase.
摘要:
A method is provided for deriving acoustic word representations for use in speech recognition. Initial word models are created, each formed of a sequence of acoustic sub-models. The acoustic sub-models from a plurality of word models are clustered, so as to group acoustically similar sub-models from different words, using, for example, the Kullback-Leibler information as a metric of similarity. Then each word is represented by cluster spelling representing the clusters into which its acoustic sub-models were placed by the clustering. Speech recognition is performed by comparing sequences of frames from speech to be recognized against sequences of acoustic models associated with the clusters of the cluster spelling of individual word models. The invention also provides a method for deriving a word representation which involves receiving a first set of frame sequences for a word, using dynamic programming to derive a corresponding initial sequence of probabilistic acoustic sub-models for the word independently of any previously derived acoustic model particular to the word, using dynamic programming to time align each of a second set of frame sequences for the word into a succession of new sub-sequences corresponding to the initial sequence of models, and using these new sub-sequences to calculate new probabilistic sub-models.
摘要:
A method determines if a portion of speech corresponds to a speech pattern by time aligning both the speech and a plurality of speech pattern models against a common time-aligning model. This compensates for speech variation between the speech and the pattern models. The method then compares the resulting time-aligned speech model against the resulting time-aligned pattern models to determine which of the patterns most probably corresponds to the speech. Preferably there are a plurality of time-aligning models, each representing a group of somewhat similar sound sequences which occur in different words. Each of these time-aligning models is scored for similarity against a portion of speech, and the time-aligned speech model and time-aligned pattern models produced by time alignment with the best scoring time-aligning model are compared to determine the likelihood that each speech pattern corresponds to the portion of speech. This is performed for each successive portion of speech. When a portion of speech appears to correspond to a given speech pattern model, a range of likely start times is calculated for the vocabulary word associated with that model, and a word score is calculated to indicate the likelihood of that word starting in that range. The method uses a more computationally intensive comparison between the speech and selected vocabulary words, so as to more accurately determine which words correspond with which portions of the speech. When this more intensive comparison indicates the ending of a word at a given point in the speech, the method selects the best scoring vocabulary words whose range of start times overlaps that ending time, and performs the computationally intensive comparison on those selected words starting at that point in the speech.
摘要:
A method of speech analysis calculates one or more difference parameters for each of a sequence of acoustic frames, where each difference parameter is a function of the difference between an acoustic parameter in one frame and an acoustic parameter in a nearby frame. The method is used in speech recognition which compares the difference parameters of each frame against acoustic models representing speech units, where each speech-unit model has a model of the difference parameters associated with the frames of its speech unit. The difference parameters can be slope parameters or energy difference parameters. Slope parameters are derived by finding the difference between the energy of a given spectral parameter of a given frame and the energy, in a nearby frame, of a spectral parameter associated with a different frequency band. The resulting parameter indicates the extent to which the frequency of energy in the part of the spectrum represented by the given parameter is going up or going down. Energy difference parameters are calculated as a function of the difference between a given spectral parameter in one frame and a spectral parameter in a nearby frame representing the same frequency band. In one embodiment of the invention, dynamic programming compares the difference parameters of a sequence of frames to be recognized against a sequence of dynamic programming elements associated with each of a plurality of speech-unit models. In another embodiment of the invention, each speech-unit model represents one phoneme, and the speech-unit models for a plurality of phonemes are compared against individual frames, to associate with each such frame the one or more phonemes whose models compare most closely with it.
摘要:
According to certain aspects of the invention a mobile voice communication device includes a wireless transceiver circuit for transmitting and receiving auditory information and data, a processor, and a memory storing executable instructions which when executed on the processor causes the mobile voice communication device to provide a selectable personality associated with a user interface to a user of the mobile voice communication device. The executable instructions include implementing on the device a user interface that employs the different user prompts having the selectable personality, wherein each selectable personality of the different user prompts is defined and mapped to data stored in at least one database in the mobile voice communication device. The mobile voice communication device may include a decoder that recognizes a spoken user input and provides a corresponding recognized word, and a speech synthesizer that synthesizes a word corresponding to the recognized word. The device includes user-selectable personalities that are either transmitted wirelessly to the device, transmitted through a computer interface, or provided as memory cards to the device.
摘要:
A method of constructing a list of alternate transcripts from a recognized transcript includes generating a list of close call records, matching partial sub-histories from the recognized transcript with one of the history pairs stored in each of the records, and substituting the other of the history pairs for the partial sub-history of the recognized transcript. A close call record is generated each time a pair of partial hypotheses attempt to seed a common word. Each close call record includes history information and scoring information associated with a particular pair of partial hypotheses seeding a common word. Alternate transcripts are constructed by substituting close call histories for partial histories of the recognized transcripts, and also by substituting close call histories for partial histories of other alternate transcript.
摘要:
In one aspect, a method of processing a voice signal to extract information to facilitate training a speech synthesis model is provided. The method comprises acts of detecting a plurality of candidate features in the voice signal, performing at least one comparison between one or more combinations of the plurality of candidate features and the voice signal, and selecting a set of features from the plurality of candidate features based, at least in part, on the at least one comparison. In another aspect, the method is performed by executing a program encoded on a computer readable medium. In another aspect, a speech synthesis model is provided by, at least in part, performing the method.
摘要:
Statistics are measured from an initial portion of a speech utterance. Feature normalization parameters are estimated based on the measured statistics and a statistically derived mapping relating measured statistics and feature normalization parameters.
摘要:
A method of generating an alternative pronunciation for a word or phrase, given an initial pronunciation and a spoken example of the word or phrase, includes providing the initial pronunciation of the word or phrase, and generating the alternative pronunciation by searching a neighborhood of pronunciations about the initial pronunciation via a constrained hypothesis, wherein the neighborhood includes pronunciations that differ from the initial pronunciation by at most one phoneme. The method further includes selecting a highest scoring pronunciation within the neighborhood of pronunciations.
摘要:
A method of operating a mobile communication device having a set of one or more applications, each with its own associated user-configurable customization, the method comprising detecting whether the user-configurable customization of any of the applications has changed since an earlier time, and for all applications for which the user-configurable customization has changed since said earlier time, wirelessly transmitting those changes to a remote server. The method further comprises maintaining a set of flags indicating whether changes have occurred to the user-configurable customization, wherein detecting whether the user-configurable customization of any of the applications has changed since said earlier time includes reading the set of flags. The remote server is one of a carrier server and a third party provider server.