Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media for modeling phonemes. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps: processing the acoustic feature representation through each of one or more recurrent neural network layers to generate a recurrent output; processing the recurrent output using a softmax output layer to generate a set of scores, the set of scores comprising a respective score for each of a plurality of context dependent vocabulary phonemes, the score for each context dependent vocabulary phoneme representing a likelihood that the context dependent vocabulary phoneme represents the utterance at the time step; and determining, from the scores for the plurality of time steps, a context dependent phoneme representation of the sequence.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for transcribing utterances into written text are disclosed. The methods, systems, and apparatus include actions of obtaining a lexicon model that maps phones to spoken text and obtaining a language model that assigns probabilities to written text. Further includes generating a transducer that maps the written text to the spoken text, the transducer mapping multiple items of the written text to an item of the spoken text. Additionally, the actions include constructing a decoding network for transcribing utterances into written text, by composing the lexicon model, the inverse of the transducer, and the language model.
Abstract:
An automatic speech recognition system and method are provided for written-domain language modeling. According to one implementation, a process includes accessing decomposed training data that results from applying rewrite grammar rules to original training data, the decomposed training data comprising (i) regular words from the original training data that have not been rewritten using the set of rewrite grammar rules, and (ii) decomposed segments that result from rewriting non-lexical entities from the original training data using the rewrite grammar rules, generating a restriction model that (i) maps language model paths for regular words to themselves, and (ii) restricts language model paths for decomposed segments for non-lexical entities, training a n-gram language model over the training data, composing the restriction model and the language model to obtain a restricted language model, and constructing a decoding network by composing a context dependency model and a pronunciation lexicon with the restricted language model.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving a plurality of audio frames that collectively represent at least a portion of a spoken utterance; processing the plurality of audio frames using a long short term memory (LSTM) neural network to generate a respective language score for each of a plurality of languages, wherein the respective language score for each of the plurality of languages represents a likelihood that the spoken utterance was spoken in the language; and classifying the spoken utterance as being spoken in one of the plurality of languages using the language scores.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for language modeling of complete language sequences. Training data indicating language sequences is accessed, and counts for a number of times each language sequence occurs in the training data are determined. A proper subset of the language sequences is selected, and a first component of a language model is trained. The first component includes first probability data for assigning scores to the selected language sequences. A second component of the language model is trained based on the training data, where the second component includes second probability data for assigning scores to language sequences that are not included in the selected language sequences. Adjustment data that normalizes the second probability data with respect to the first probability data is generated, and the first component, the second component, and the adjustment data are stored.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media for learning pronunciations from acoustic sequences. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; processing the sequence of modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final CTC output layer to generate a neural network output, wherein processing the sequence of modified frames of acoustic data comprises: subsampling the modified frames of acoustic data; and processing each subsampled modified frame of acoustic data through the acoustic modeling neural network.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for recognizing speech using neural networks. One of the methods includes receiving an audio input; processing the audio input using an acoustic model to generate a respective phoneme score for each of a plurality of phoneme labels; processing one or more of the phoneme scores using an inverse pronunciation model to generate a respective grapheme score for each of a plurality of grapheme labels; and processing one or more of the grapheme scores using a language model to generate a respective text label score for each of a plurality of text labels.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media for learning pronunciations from acoustic sequences. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; processing the sequence of modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final CTC output layer to generate a neural network output, wherein processing the sequence of modified frames of acoustic data comprises: sub sampling the modified frames of acoustic data; and processing each subsampled modified frame of acoustic data through the acoustic modeling neural network.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating representations of input sequences. One of the methods includes receiving a grapheme sequence, the grapheme sequence comprising a plurality of graphemes arranged according to an input order; processing the sequence of graphemes using a long short-term memory (LSTM) neural network to generate an initial phoneme sequence from the grapheme sequence, the initial phoneme sequence comprising a plurality of phonemes arranged according to an output order; and generating a phoneme representation of the grapheme sequence from the initial phoneme sequence generated by the LSTM neural network, wherein generating the phoneme representation comprises removing, from the initial phoneme sequence, phonemes in one or more positions in the output order.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for recognizing speech using neural networks. One of the methods includes receiving an audio input; processing the audio input using an acoustic model to generate a respective phoneme score for each of a plurality of phoneme labels; processing one or more of the phoneme scores using an inverse pronunciation model to generate a respective grapheme score for each of a plurality of grapheme labels; and processing one or more of the grapheme scores using a language model to generate a respective text label score for each of a plurality of text labels.