Deliberation by Text-Only and Semi-Supervised Training

    公开(公告)号:US20230298563A1

    公开(公告)日:2023-09-21

    申请号:US18186157

    申请日:2023-03-18

    Applicant: Google LLC

    CPC classification number: G10L13/08 G10L15/16 G10L15/063

    Abstract: A method of text-only and semi-supervised training for deliberation includes receiving training data including unspoken textual utterances that are each not paired with any corresponding spoken utterance of non-synthetic speech, and training a deliberation model that includes a text encoder and a deliberation decoder on the unspoken textual utterances. The method also includes receiving, at the trained deliberation model, first-pass hypotheses and non-causal acoustic embeddings. The first-pass hypotheses is generated by a recurrent neural network-transducer (RNN-T) decoder for the non-causal acoustic embeddings encoded by a non-causal encoder. The method also includes encoding, using the text encoder, the first-pass hypotheses generated by the RNN-T decoder, and generating, using the deliberation decoder attending to both the first-pass hypotheses and the non-causal acoustic embeddings, second-pass hypotheses.

    Text Injection For Training Auxiliary Tasks In Speech Recognition Models

    公开(公告)号:US20240296840A1

    公开(公告)日:2024-09-05

    申请号:US18592590

    申请日:2024-03-01

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/02 G10L15/063

    Abstract: A joint auxiliary task and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher-order feature representation for a corresponding acoustic frame. The model also includes a multi-output HAT decoder to generate at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the output step corresponds to an auxiliary token associated with a particular auxiliary task. The model is trained by a JEIT training process based on: a paired training data set including paired audio data and transcriptions, the transcriptions annotated with ground-truth auxiliary tokens associated with the particular auxiliary task; and an unpaired training data set including textual utterances not paired with any corresponding audio data, the textual utterances annotated with the ground-truth auxiliary tokens associated with the particular auxiliary task.

    Optimizing Inference Performance for Conformer

    公开(公告)号:US20230130634A1

    公开(公告)日:2023-04-27

    申请号:US17936547

    申请日:2022-09-29

    Applicant: Google LLC

    Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.

    Joint Speech and Text Streaming Model for ASR

    公开(公告)号:US20240028829A1

    公开(公告)日:2024-01-25

    申请号:US18346232

    申请日:2023-07-01

    Applicant: Google LLC

    CPC classification number: G06F40/284 G06F40/40

    Abstract: A method includes receiving training data that includes a set of unspoken textual utterances. For each respective unspoken textual utterance, the method includes, tokenizing the respective textual utterance into a sequence of sub-word units, generating a first higher order textual feature representation for a corresponding sub-word unit tokenized from the respective unspoken textual utterance, receiving the first higher order textual feature representation generated by a text encoder, and generating a first probability distribution over possible text units. The method also includes training an encoder based on the first probability distribution over possible text units generated by a first-pass decoder for each respective unspoken textual utterance in the set of unspoken textual utterances.

    Optimizing inference performance for conformer

    公开(公告)号:US12190869B2

    公开(公告)日:2025-01-07

    申请号:US17936547

    申请日:2022-09-29

    Applicant: Google LLC

    Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.

    CONNECTING DIFFERENT ASR APPLICATION DOMAINS WITH SPEAKER-TAGS

    公开(公告)号:US20240304181A1

    公开(公告)日:2024-09-12

    申请号:US18598523

    申请日:2024-03-07

    Applicant: Google LLC

    CPC classification number: G10L15/063

    Abstract: A method includes receiving a plurality of training samples spanning multiple different domains. Each corresponding training sample includes audio data characterizing an utterance paired with a corresponding transcription of the utterance. The method also includes re-labeling each corresponding training sample of the plurality of training samples by annotating the corresponding transcription of the utterance with one or more speaker tags. Each speaker tag indicates a respective segment of the transcription for speech that was spoken by a particular type of speaker. The method also includes training a multi-domain speech recognition model on the re-labeled training samples to teach the multi-domain speech recognition model to learn to share parameters for recognizing speech across each of the different multiple different domains.

Patent Agency Ranking