Flickering Reduction with Partial Hypothesis Re-ranking for Streaming ASR

    公开(公告)号:US20240029718A1

    公开(公告)日:2024-01-25

    申请号:US18352211

    申请日:2023-07-13

    Applicant: Google LLC

    CPC classification number: G10L15/10 G10L15/26

    Abstract: A method includes processing, using a speech recognizer, a first portion of audio data to generate a first lattice, and generating a first partial transcription for an utterance based on the first lattice. The method includes processing, using the recognizer, a second portion of the data to generate, based on the first lattice, a second lattice representing a plurality of partial speech recognition hypotheses for the utterance and a plurality of corresponding speech recognition scores. For each particular partial speech recognition hypothesis, the method includes generating a corresponding re-ranked score based on the corresponding speech recognition score and whether the particular partial speech recognition hypothesis shares a prefix with the first partial transcription. The method includes generating a second partial transcription for the utterance by selecting the partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses having the highest corresponding re-ranked score.

    Optimizing Inference Performance for Conformer

    公开(公告)号:US20230130634A1

    公开(公告)日:2023-04-27

    申请号:US17936547

    申请日:2022-09-29

    Applicant: Google LLC

    Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.

    ENABLING NATURAL CONVERSATIONS WITH SOFT ENDPOINTING FOR AN AUTOMATED ASSISTANT

    公开(公告)号:US20230053341A1

    公开(公告)日:2023-02-23

    申请号:US17532819

    申请日:2021-11-22

    Applicant: GOOGLE LLC

    Abstract: As part of a dialog session between a user and an automated assistant, implementations can process, using a streaming ASR model, a stream of audio data that captures a portion of a spoken utterance to generate ASR output, process, using an NLU model, the ASR output to generate NLU output, and cause, based on the NLU output, a stream of fulfillment data to be generated. Further, implementations can further determine, based on processing the stream of audio data, audio-based characteristics associated with the portion of the spoken utterance captured in the stream of audio data. Based on the audio-based characteristics and/the stream of NLU output, implementations can determine whether the user has paused in providing the spoken utterance or has completed providing of the spoken utterance. If the user has paused, implementations can cause natural conversation output to be provided for presentation to the user.

    Emitting word timings with end-to-end models

    公开(公告)号:US11580956B2

    公开(公告)日:2023-02-14

    申请号:US17204852

    申请日:2021-03-17

    Applicant: Google LLC

    Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.

    Disfluency Detection Models for Natural Conversational Voice Systems

    公开(公告)号:US20250140239A1

    公开(公告)日:2025-05-01

    申请号:US19010299

    申请日:2025-01-06

    Applicant: Google LLC

    Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances. At each of a plurality of output steps, the method also includes generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame of the sequence of acoustic frames, generating, by a prediction network of the speech recognition model, a hidden representation for a corresponding sequence of non-blank symbols output by a final softmax layer of the speech recognition model, and generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the dense representation generated by the prediction network, a probability distribution that the corresponding time step corresponds to a pause and an end of speech.

    Adapter Finetuning with Teacher Pseudo-Labeling for Tail Languages in Streaming Multilingual ASR

    公开(公告)号:US20250078830A1

    公开(公告)日:2025-03-06

    申请号:US18826743

    申请日:2024-09-06

    Applicant: Google LLC

    Abstract: A method includes receiving a sequence of acoustic frames characterizing a spoken utterance in a particular native language. The method also includes generating a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames by a causal encoder that includes an initial stack of multi-head attention layers. The method also includes generating a second higher order feature representation for a corresponding first higher order feature representation by a non-causal encoder that includes a final stack of multi-head attention layers. The method also includes receiving, as input at each corresponding language-dependent adapter (LDA) module, a language ID vector identifying the particular native language to activate corresponding language-dependent weights specific to the particular native language. The method also includes generating a first probability distribution over possible speech recognition hypotheses by a decoder.

Patent Agency Ranking