Abstract:
Disclosed herein are systems, computer-implemented methods, and tangible computer-readable media for handling missing speech data. The computer-implemented method includes receiving speech with a missing segment, generating a plurality of hypotheses for the missing segment, identifying a best hypothesis for the missing segment, and recognizing the received speech by inserting the identified best hypothesis for the missing segment. In another method embodiment, the final step is replaced with synthesizing the received speech by inserting the identified best hypothesis for the missing segment. In one aspect, the method further includes identifying a duration for the missing segment and generating the plurality of hypotheses of the identified duration for the missing segment. The step of identifying the best hypothesis for the missing segment can be based on speech context, a pronouncing lexicon, and/or a language model. Each hypothesis can have an identical acoustic score.
Abstract:
In an embodiment, a method of providing modified media content is disclosed and includes receiving media content that includes audio data and video data having a first number of video frames. The method also includes generating abstracted media content that includes portions of the video data and audio elements of the audio data, where the abstracted media content includes less than all of the video data and includes fewer video frames than the first number of video frames.
Abstract:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for assigning saliency weights to words of an ASR model. The saliency values assigned to words within an ASR model are based on human perception judgments of previous transcripts. These saliency values are applied as weights to modify an ASR model such that the results of the weighted ASR model in converting a spoken document to a transcript provide a more accurate and useful transcription to the user.
Abstract:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for assigning saliency weights to words of an ASR model. The saliency values assigned to words within an ASR model are based on human perception judgments of previous transcripts. These saliency values are applied as weights to modify an ASR model such that the results of the weighted ASR model in converting a spoken document to a transcript provide a more accurate and useful transcription to the user.
Abstract:
Disclosed herein are systems, methods, and computer-readable storage media for training adaptation-specific acoustic models. A system practicing the method receives speech and generates a full size model and a reduced size model, the reduced size model starting with a single distribution for each speech sound in the received speech. The system finds speech segment boundaries in the speech using the full size model and adapts features of the speech data using the reduced size model based on the speech segment boundaries and an overall centroid for each speech sound. The system then recognizes speech using the adapted features of the speech. The model can be a Hidden Markov Model (HMM). The reduced size model can also be of a reduced complexity, such as having fewer mixture components than a model of full complexity. Adapting features of speech can include moving the features closer to an overall feature distribution center.
Abstract:
A system and method for performing speech recognition is disclosed. The method comprises receiving an utterance, applying the utterance to a recognizer with a language model having pronunciation probabilities associated with unique word identifiers for words given their pronunciations and presenting a recognition result for the utterance. Recognition improvement is found by moving a pronunciation model from a dictionary to the language model.
Abstract:
A method of providing modified media content is disclosed that includes providing media content to a destination device via a network, where the media content comprises video data and audio data have a first viewing rate. The method further includes receiving data indicating a selection of a second viewing rate via the network and modifying the media content to produce modified media content having approximately the second viewing rate. The modified media content includes modified video data and modified audio data synchronized at approximately the second viewing rate.
Abstract:
A system and method for performing speech recognition is disclosed. The method comprises receiving an utterance, applying the utterance to a recognizer with a language model having pronunciation probabilities associated with unique word identifiers for words given their pronunciations and presenting a recognition result for the utterance. Recognition improvement is found by moving a pronunciation model from a dictionary to the language model.
Abstract:
Disclosed herein are systems, computer-implemented methods, and computer-readable media for speech recognition. The method includes receiving speech utterances, assigning a pronunciation weight to each unit of speech in the speech utterances, each respective pronunciation weight being normalized at a unit of speech level to sum to 1, for each received speech utterance, optimizing the pronunciation weight by (1) identifying word and phone alignments and corresponding likelihood scores, and (2) discriminatively adapting the pronunciation weight to minimize classification errors, and recognizing additional received speech utterances using the optimized pronunciation weights. A unit of speech can be a sentence, a word, a context-dependent phone, a context-independent phone, or a syllable. The method can further include discriminatively adapting pronunciation weights based on an objective function. The objective function can be maximum mutual information (MMI), maximum likelihood (MLE) training, minimum classification error (MCE) training, or other functions known to those of skill in the art. Speech utterances can be names. The speech utterances can be received as part of a multimodal search or input. The step of discriminatively adapting pronunciation weights can further include stochastically modeling pronunciations.
Abstract:
The present disclosure relates to systems, methods, and computer-readable media for generating a lexicon for use with speech recognition. The method includes receiving symbolic input as labeled speech data, overgenerating potential pronunciations based on the symbolic input, identifying potential pronunciations in a speech recognition context, and storing the identified potential pronunciations in a lexicon. Overgenerating potential pronunciations can include establishing a set of conversion rules for short sequences of letters, converting portions of the symbolic input into a number of possible lexical pronunciation variants based on the set of conversion rules, modeling the possible lexical pronunciation variants in one of a weighted network and a list of phoneme lists, and iteratively retraining the set of conversion rules based on improved pronunciations. Symbolic input can include multiple examples of a same spoken word. Speech data can be labeled explicitly or implicitly and can include words as text and recorded audio.