摘要:
Disclosed herein are methods, systems, and computer-readable storage media for automatic speech recognition. The method includes selecting a speaker independent model, and selecting a quantity of speaker dependent models, the quantity of speaker dependent models being based on available computing resources, the selected models including the speaker independent model and the quantity of speaker dependent models. The method also includes recognizing an utterance using each of the selected models in parallel, and selecting a dominant speech model from the selected models based on recognition accuracy using the group of selected models. The system includes a processor and modules configured to control the processor to perform the method. The computer-readable storage medium includes instructions for causing a computing device to perform the steps of the method.
摘要:
Disclosed herein are systems, computer-implemented methods, and computer-readable storage media for handling expected repeat speech queries or other inputs. The method causes a computing device to detect a misrecognized speech query from a user, determine a tendency of the user to repeat speech queries based on previous user interactions, and adapt a speech recognition model based on the determined tendency before an expected repeat speech query. The method can further include recognizing the expected repeat speech query from the user based on the adapted speech recognition model. Adapting the speech recognition model can include modifying an acoustic model, a language model, and a semantic model. Adapting the speech recognition model can also include preparing a personalized search speech recognition model for the expected repeat query based on usage history and entries in a recognition lattice. The method can include retaining unmodified speech recognition models with adapted speech recognition models.
摘要:
A system and method for performing speech recognition is disclosed. The method comprises receiving an utterance, applying the utterance to a recognizer with a language model having pronunciation probabilities associated with unique word identifiers for words given their pronunciations and presenting a recognition result for the utterance. Recognition improvement is found by moving a pronunciation model from a dictionary to the langue model.
摘要:
A method is disclosed for applying a multi-state barge-in acoustic model in a spoken dialogue system. The method includes receiving an audio speech input from the user during the presentation of a prompt, accumulating the audio speech input from the user, applying a non-speech component having at least two one-state Hidden Markov Models (HMMs) to the audio speech input from the user, applying a speech component having at least five three-state HMMs to the audio speech input from the user, in which each of the five three-state HMMs represents a different phonetic category, determining whether the audio speech input is a barge-in-speech input from the user, and if the audio speech input is determined to be the barge-in-speech input from the user, terminating the presentation of the prompt.
摘要:
Disclosed are systems and methods for training a barge-in-model for speech processing in a spoken dialogue system comprising the steps of (1) receiving an input having at least one speech segment and at least one non-speech segment, (2) establishing a restriction of recognizing only speech states during speech segments of the input and non-speech states during non-speech segments of the input, (2) generating a hypothesis lattice by allowing any sequence of speech Hidden Markov Models (HMMs) and non-speech HMMs, (4) generating a reference lattice by only allowing speech HMMs for at least one speech segment and non-speech HMMs for at least one non-speech segment, wherein different iterations of training generates at least one different reference lattice and at least one reference transcription, and (5) employing the generated reference lattice as the barge-in-model for speech processing.
摘要:
Disclosed herein are systems, computer-implemented methods, and computer-readable storage media for handling expected repeat speech queries or other inputs. The method causes a computing device to detect a misrecognized speech query from a user, determine a tendency of the user to repeat speech queries based on previous user interactions, and adapt a speech recognition model based on the determined tendency before an expected repeat speech query. The method can further include recognizing the expected repeat speech query from the user based on the adapted speech recognition model. Adapting the speech recognition model can include modifying an acoustic model, a language model, and/or a semantic model. Adapting the speech recognition model can also include preparing a personalized search speech recognition model for the expected repeat query based on usage history and entries in a recognition lattice. The method can include retaining unmodified speech recognition models with adapted speech recognition models.
摘要:
A method of correlating received communication data with operational communication characteristics is provided. The method includes receiving audible input from a source in a communication over a communications network, recording the received audible input, and transcribing the recorded audible input into a transcript. The method further includes outputting the transcript, specifying features of the transcript to be analyzed, specifying and recording operational communication characteristics particular to the communication, analyzing the transcript for the specified features to identify patterns associated with the audible input, computing statistical correlations of the identified patterns with the operational communication characteristics, and outputting results of the computed statistical correlations on a user interface.
摘要:
Disclosed herein are systems, computer-implemented methods, and computer-readable media for speech recognition. The method includes receiving speech utterances, assigning a pronunciation weight to each unit of speech in the speech utterances, each respective pronunciation weight being normalized at a unit of speech level to sum to 1, for each received speech utterance, optimizing the pronunciation weight by (1) identifying word and phone alignments and corresponding likelihood scores, and (2) discriminatively adapting the pronunciation weight to minimize classification errors, and recognizing additional received speech utterances using the optimized pronunciation weights. A unit of speech can be a sentence, a word, a context-dependent phone, a context-independent phone, or a syllable. The method can further include discriminatively adapting pronunciation weights based on an objective function. The objective function can be maximum mutual information (MMI), maximum likelihood (MLE) training, minimum classification error (MCE) training, or other functions known to those of skill in the art. Speech utterances can be names. The speech utterances can be received as part of a multimodal search or input. The step of discriminatively adapting pronunciation weights can further include stochastically modeling pronunciations.
摘要:
A method and system for training an automatic speech recognition system are provided. The method includes separating training data into speaker specific segments, and for each speaker specific segment, performing the following acts: generating spectral data, selecting a first warping factor and warping the spectral data, and comparing the warped spectral data with a speech model. The method also includes iteratively performing the steps of selecting another warping factor and generating another warped spectral data, comparing the other warped spectral data with the speech model, and if the other warping factor produces a closer match to the speech model, saving the other warping factor as the best warping factor for the speaker specific segment. The system includes modules configured to control a processor in the system to perform the steps of the method.
摘要:
Disclosed are systems, methods and computer readable media for training acoustic models for an automatic speech recognition systems (ASR) system. The method includes receiving a speech signal, defining at least one syllable boundary position in the received speech signal, based on the at least one syllable boundary position, generating for each consonant in a consonant phoneme inventory a pre-vocalic position label and a post-vocalic position label to expand the consonant phoneme inventory, reformulating a lexicon to reflect an expanded consonant phoneme inventory, and training a language model for an automated speech recognition (ASR) system based on the reformulated lexicon.