Multichannel raw-waveform neural networks

    公开(公告)号:US10339921B2

    公开(公告)日:2019-07-02

    申请号:US14987146

    申请日:2016-01-04

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using neural networks. One of the methods includes receiving, by a neural network in a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal, the first raw audio signal and the second raw audio signal for the same period of time, generating, by a spatial filtering convolutional layer in the neural network, a spatial filtered output the first data and the second data, generating, by a spectral filtering convolutional layer in the neural network, a spectral filtered output using the spatial filtered output, and processing, by one or more additional layers in the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.

    Voice activity detection
    63.
    发明授权

    公开(公告)号:US10229700B2

    公开(公告)日:2019-03-12

    申请号:US14986985

    申请日:2016-01-04

    Applicant: GOOGLE LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for detecting voice activity. In one aspect, a method include actions of receiving, by a neural network included in an automated voice activity detection system, a raw audio waveform, processing, by the neural network, the raw audio waveform to determine whether the audio waveform includes speech, and provide, by the neural network, a classification of the raw audio waveform indicating whether the raw audio waveform includes speech.

    ADAPTIVE AUDIO ENHANCEMENT FOR MULTICHANNEL SPEECH RECOGNITION

    公开(公告)号:US20180197534A1

    公开(公告)日:2018-07-12

    申请号:US15848829

    申请日:2017-12-20

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.

    TWO-PASS END TO END SPEECH RECOGNITION

    公开(公告)号:US20240420687A1

    公开(公告)日:2024-12-19

    申请号:US18815537

    申请日:2024-08-26

    Applicant: GOOGLE LLC

    Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.

    Universal Monolingual Output Layer for Multilingual Speech Recognition

    公开(公告)号:US20240135923A1

    公开(公告)日:2024-04-25

    申请号:US18485271

    申请日:2023-10-11

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/005 G10L15/02

    Abstract: A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.

Patent Agency Ranking