Cascaded encoders for simplified streaming and non-streaming ASR

    公开(公告)号:US12154581B2

    公开(公告)日:2024-11-26

    申请号:US17237021

    申请日:2021-04-21

    Applicant: Google LLC

    Abstract: An automated speech recognition (ASR) model includes a first encoder, a second encoder, and a decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The second encoder receives, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps, and generates, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame. The decoder receives, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps, and generates, at each of the plurality of time steps, a first probability distribution over possible speech recognition hypotheses.

    Universal Monolingual Output Layer for Multilingual Speech Recognition

    公开(公告)号:US20240135923A1

    公开(公告)日:2024-04-25

    申请号:US18485271

    申请日:2023-10-11

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/005 G10L15/02

    Abstract: A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.

    Emitting Word Timings with End-to-End Models

    公开(公告)号:US20210350794A1

    公开(公告)日:2021-11-11

    申请号:US17204852

    申请日:2021-03-17

    Applicant: Google LLC

    Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.

    EPHEMERAL LEARNING OF MACHINE LEARNING MODEL(S)

    公开(公告)号:US20250016387A1

    公开(公告)日:2025-01-09

    申请号:US18890050

    申请日:2024-09-19

    Applicant: GOOGLE LLC

    Abstract: Implementations disclosed herein are directed to ephemeral learning of machine learning (“ML”) model(s) based on gradient(s) generated at a remote system (e.g., remote server(s)). Processor(s) of the remote system can receive stream(s) of audio data capturing spoken utterance(s) from a client device of a user. A fulfillment pipeline can process the stream(s) of audio data to cause certain fulfillment(s) of the spoken utterance(s) to be performed. Meanwhile, a training pipeline can process the stream(s) of audio data to generate gradient(s) using unsupervised learning techniques. Subsequent to the processing by the fulfillment pipeline and/or the training pipeline, the stream(s) of audio data are discarded by the remote system. Accordingly, the ML model(s) can be trained at the remote system without storing or logging of the stream(s) of audio data by non-transient memory thereof, thereby providing more efficient training mechanisms for training the ML model(s) and also increasing security of user data.

Patent Agency Ranking