Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for automatic speech recognition using multi-dimensional models. In some implementations, audio data that describes an utterance is received. A transcription for the utterance is determined using an acoustic model that includes a neural network having first memory blocks for time information and second memory blocks for frequency information. The transcription for the utterance is provided as output of an automated speech recognizer.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a deep neural network. One of the methods for training a deep neural network that includes a low rank hidden input layer and an adjoining hidden layer, the low rank hidden input layer including a first matrix A and a second matrix B with dimensions i×m and m×o, respectively, to identify a keyword includes receiving a feature vector including i values that represent features of an audio signal encoding an utterance, determining, using the low rank hidden input layer, an output vector including o values using the feature vector, determining, using the adjoining hidden layer, another vector using the output vector, determining a confidence score that indicates whether the utterance includes the keyword using the other vector, and adjusting weights for the low rank hidden input layer using the confidence score.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for recognizing keywords using a long short term memory neural network. One of the methods includes receiving, by a device for each of multiple variable length enrollment audio signals, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, processing each of the plurality of enrollment feature vectors using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector, and generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether another audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for detecting voice activity. In one aspect, a method include actions of receiving, by a neural network included in an automated voice activity detection system, a raw audio waveform, processing, by the neural network, the raw audio waveform to determine whether the audio waveform includes speech, and provide, by the neural network, a classification of the raw audio waveform indicating whether the raw audio waveform includes speech.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for implementing long-short term memory layers with compressed gating functions. One of the systems includes a first long short-term memory (LSTM) layer, wherein the first LSTM layer is configured to, for each of the plurality of time steps, generate a new layer state and a new layer output by applying a plurality of gates to a current layer input, a current layer state, and a current layer output, each of the plurality of gates being configured to, for each of the plurality of time steps, generate a respective intermediate gate output vector by multiplying a gate input vector and a gate parameter matrix. The gate parameter matrix for at least one of the plurality of gates is a structured matrix or is defined by a compressed parameter matrix and a projection matrix.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing audio waveforms. In some implementations, a time-frequency feature representation is generated based on audio data. The time-frequency feature representation is input to an acoustic model comprising a trained artificial neural network. The trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers. An output that is based on output of the trained artificial neural network is received. A transcription is provided, where the transcription is determined based on the output of the acoustic model.
Abstract:
This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a deep neural network. One of the methods for training a deep neural network that includes a low rank hidden input layer and an adjoining hidden layer, the low rank hidden input layer including a first matrix A and a second matrix B with dimensions i×m and m×o, respectively, to identify a keyword includes receiving a feature vector including i values that represent features of an audio signal encoding an utterance, determining, using the low rank hidden input layer, an output vector including o values using the feature vector, determining, using the adjoining hidden layer, another vector using the output vector, determining a confidence score that indicates whether the utterance includes the keyword using the other vector, and adjusting weights for the low rank hidden input layer using the confidence score.