Deliberation by Text-Only and Semi-Supervised Training

    公开(公告)号:US20230298563A1

    公开(公告)日:2023-09-21

    申请号:US18186157

    申请日:2023-03-18

    Applicant: Google LLC

    CPC classification number: G10L13/08 G10L15/16 G10L15/063

    Abstract: A method of text-only and semi-supervised training for deliberation includes receiving training data including unspoken textual utterances that are each not paired with any corresponding spoken utterance of non-synthetic speech, and training a deliberation model that includes a text encoder and a deliberation decoder on the unspoken textual utterances. The method also includes receiving, at the trained deliberation model, first-pass hypotheses and non-causal acoustic embeddings. The first-pass hypotheses is generated by a recurrent neural network-transducer (RNN-T) decoder for the non-causal acoustic embeddings encoded by a non-causal encoder. The method also includes encoding, using the text encoder, the first-pass hypotheses generated by the RNN-T decoder, and generating, using the deliberation decoder attending to both the first-pass hypotheses and the non-causal acoustic embeddings, second-pass hypotheses.

    CONTEXTUAL BIASING FOR SPEECH RECOGNITION

    公开(公告)号:US20220366897A1

    公开(公告)日:2022-11-17

    申请号:US17815049

    申请日:2022-07-26

    Applicant: Google LLC

    Abstract: A method includes receiving audio data encoding an utterance and obtaining a set of bias phrases corresponding to a context of the utterance. Each bias phrase includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio to generate an output from the speech recognition model. The speech recognition model includes a first encoder configured to receive the acoustic features, a bias encoder configured to receive data indicating the obtained set of bias phrases, a bias encoder, and a decoder configured to determine likelihoods of sequences of speech elements based on output of the first attention module and output of the bias attention module. The method also includes determining a transcript for the utterance based on the likelihoods of sequences of speech elements.

    Large-scale multilingual speech recognition with a streaming end-to-end model

    公开(公告)号:US11468244B2

    公开(公告)日:2022-10-11

    申请号:US16834342

    申请日:2020-03-30

    Applicant: Google LLC

    Abstract: A method of transcribing speech using a multilingual end-to-end (E2E) speech recognition model includes receiving audio data for an utterance spoken in a particular native language, obtaining a language vector identifying the particular language, and processing, using the multilingual E2E speech recognition model, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance. The multilingual E2E speech recognition model includes a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language. The method also includes providing the transcription for output.

    ADAPTIVE AUDIO ENHANCEMENT FOR MULTICHANNEL SPEECH RECOGNITION

    公开(公告)号:US20220148582A1

    公开(公告)日:2022-05-12

    申请号:US17649058

    申请日:2022-01-26

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.

    COMPLEX LINEAR PROJECTION FOR ACOUSTIC MODELING

    公开(公告)号:US20200286468A1

    公开(公告)日:2020-09-10

    申请号:US16879322

    申请日:2020-05-20

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech recognition using complex linear projection are disclosed. In one aspect, a method includes the actions of receiving audio data corresponding to an utterance. The method further includes generating frequency domain data using the audio data. The method further includes processing the frequency domain data using complex linear projection. The method further includes providing the processed frequency domain data to a neural network trained as an acoustic model. The method further includes generating a transcription for the utterance that is determined based at least on output that the neural network provides in response to receiving the processed frequency domain data.

    CONVOLUTIONAL NEURAL NETWORKS
    18.
    发明申请

    公开(公告)号:US20200051551A1

    公开(公告)日:2020-02-13

    申请号:US16654041

    申请日:2019-10-16

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for keyword spotting. One of the methods includes training, by a keyword detection system, a convolutional neural network for keyword detection by providing a two-dimensional set of input values to the convolutional neural network, the input values including a first dimension in time and a second dimension in frequency, and performing convolutional multiplication on the two-dimensional set of input values for a filter using a frequency stride greater than one to generate a feature map.

    Adaptive audio enhancement for multichannel speech recognition

    公开(公告)号:US10515626B2

    公开(公告)日:2019-12-24

    申请号:US15848829

    申请日:2017-12-20

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.

Patent Agency Ranking