Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection

    公开(公告)号:US20240029719A1

    公开(公告)日:2024-01-25

    申请号:US18340093

    申请日:2023-06-23

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/063 G10L25/93

    Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.

    EFFICIENT STREAMING NON-RECURRENT ON-DEVICE END-TO-END MODEL

    公开(公告)号:US20230343328A1

    公开(公告)日:2023-10-26

    申请号:US18336211

    申请日:2023-06-16

    Applicant: Google LLC

    CPC classification number: G10L15/063 G10L15/02 G10L15/22 G10L15/30

    Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.

    Optimizing Personal VAD for On-Device Speech Recognition

    公开(公告)号:US20230298591A1

    公开(公告)日:2023-09-21

    申请号:US18123060

    申请日:2023-03-17

    Applicant: Google LLC

    CPC classification number: G10L17/06 G10L17/22

    Abstract: A computer-implemented method includes receiving a sequence of acoustic frames corresponding to an utterance and generating a reference speaker embedding for the utterance. The method also includes receiving a target speaker embedding for a target speaker and generating feature-wise linear modulation (FiLM) parameters including a scaling vector and a shifting vector based on the target speaker embedding. The method also includes generating an affine transformation output that scales and shifts the reference speaker embedding based on the FiLM parameters. The method also includes generating a classification output indicating whether the utterance was spoken by the target speaker based on the affine transformation output.

    4-bit Conformer with Accurate Quantization Training for Speech Recognition

    公开(公告)号:US20230298569A1

    公开(公告)日:2023-09-21

    申请号:US18186774

    申请日:2023-03-20

    Applicant: Google LLC

    CPC classification number: G10L15/063 G10L15/16

    Abstract: A method for training a model includes obtaining a plurality of training samples. Each respective training sample of the plurality of training samples includes a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The method includes training, using quantization aware training with native integer operations, an automatic speech recognition (ASR) model on the plurality of training samples. The method also includes quantizing the trained ASR model to an integer target fixed-bit width. The quantized trained ASR model includes a plurality of weights. Each weight of the plurality of weights includes an integer with the target fixed-bit width. The method includes providing the quantized trained ASR model to a user device.

    Deliberation by Text-Only and Semi-Supervised Training

    公开(公告)号:US20230298563A1

    公开(公告)日:2023-09-21

    申请号:US18186157

    申请日:2023-03-18

    Applicant: Google LLC

    CPC classification number: G10L13/08 G10L15/16 G10L15/063

    Abstract: A method of text-only and semi-supervised training for deliberation includes receiving training data including unspoken textual utterances that are each not paired with any corresponding spoken utterance of non-synthetic speech, and training a deliberation model that includes a text encoder and a deliberation decoder on the unspoken textual utterances. The method also includes receiving, at the trained deliberation model, first-pass hypotheses and non-causal acoustic embeddings. The first-pass hypotheses is generated by a recurrent neural network-transducer (RNN-T) decoder for the non-causal acoustic embeddings encoded by a non-causal encoder. The method also includes encoding, using the text encoder, the first-pass hypotheses generated by the RNN-T decoder, and generating, using the deliberation decoder attending to both the first-pass hypotheses and the non-causal acoustic embeddings, second-pass hypotheses.

    VOICE SHORTCUT DETECTION WITH SPEAKER VERIFICATION

    公开(公告)号:US20230169984A1

    公开(公告)日:2023-06-01

    申请号:US18103324

    申请日:2023-01-30

    Applicant: Google LLC

    CPC classification number: G10L17/24 G10L17/06 G10L21/028

    Abstract: Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance. Additionally or alternatively, the text representation of the utterance can be processed to determine whether at least a portion of the text representation of the utterance captures a particular keyphrase. When the system determines the registered and/or verified user spoke the utterance and the system determines the text representation of the utterance captures the particular keyphrase, the system can cause a computing device to perform one or more actions corresponding to the particular keyphrase.

    VOICE SHORTCUT DETECTION WITH SPEAKER VERIFICATION

    公开(公告)号:US20220335953A1

    公开(公告)日:2022-10-20

    申请号:US17233253

    申请日:2021-04-16

    Applicant: Google LLC

    Abstract: Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance. Additionally or alternatively, the text representation of the utterance can be processed to determine whether at least a portion of the text representation of the utterance captures a particular keyphrase. When the system determines the registered and/or verified user spoke the utterance and the system determines the text representation of the utterance captures the particular keyphrase, the system can cause a computing device to perform one or more actions corresponding to the particular keyphrase.

    Learning Word-Level Confidence for Subword End-To-End Automatic Speech Recognition

    公开(公告)号:US20220270597A1

    公开(公告)日:2022-08-25

    申请号:US17182592

    申请日:2021-02-23

    Applicant: Google LLC

    Abstract: A method includes receiving a speech recognition result, and using a confidence estimation module (CEM), for each sub-word unit in a sequence of hypothesized sub-word units for the speech recognition result: obtaining a respective confidence embedding that represents a set of confidence features; generating, using a first attention mechanism, a confidence feature vector; generating, using a second attention mechanism, an acoustic context vector; and generating, as output from an output layer of the CEM, a respective confidence output score for each corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the CEM. For each of the one or more words formed by the sequence of hypothesized sub-word units, the method also includes determining a respective word-level confidence score for the word. The method also includes determining an utterance-level confidence score by aggregating the word-level confidence scores.

Patent Agency Ranking