Efficient Streaming Non-Recurrent On-Device End-to-End Model

    公开(公告)号:US20220310062A1

    公开(公告)日:2022-09-29

    申请号:US17316198

    申请日:2021-05-10

    Applicant: Google LLC

    Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.

    Context-aware Neural Confidence Estimation for Rare Word Speech Recognition

    公开(公告)号:US20240029720A1

    公开(公告)日:2024-01-25

    申请号:US18340175

    申请日:2023-06-23

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/02 G10L15/22 G10L15/063 G10L15/19

    Abstract: An automatic speech recognition (ASR) system that includes an ASR model, a neural associative memory (NAM) biasing model, and a confidence estimation model (CEM). The ASR model includes an audio encoder configured to encode a sequence of audio frames characterizing a spoken utterance into a sequence of higher-order feature representations, and a decoder configured to receive the sequence of higher-order feature representations and output a final speech recognition result. The NAM biasing model is configured to receive biasing contextual information and modify the sequence of higher-order feature representations based on the biasing contextual information to generate, as output, biasing context vectors. The CEM is configured to compute a confidence of the final speech recognition result output by the decoder. The CEM is connected to the biasing context vectors generated by the NAM biasing model.

    Flickering Reduction with Partial Hypothesis Re-ranking for Streaming ASR

    公开(公告)号:US20240029718A1

    公开(公告)日:2024-01-25

    申请号:US18352211

    申请日:2023-07-13

    Applicant: Google LLC

    CPC classification number: G10L15/10 G10L15/26

    Abstract: A method includes processing, using a speech recognizer, a first portion of audio data to generate a first lattice, and generating a first partial transcription for an utterance based on the first lattice. The method includes processing, using the recognizer, a second portion of the data to generate, based on the first lattice, a second lattice representing a plurality of partial speech recognition hypotheses for the utterance and a plurality of corresponding speech recognition scores. For each particular partial speech recognition hypothesis, the method includes generating a corresponding re-ranked score based on the corresponding speech recognition score and whether the particular partial speech recognition hypothesis shares a prefix with the first partial transcription. The method includes generating a second partial transcription for the utterance by selecting the partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses having the highest corresponding re-ranked score.

    Joint Endpointing And Automatic Speech Recognition

    公开(公告)号:US20200335091A1

    公开(公告)日:2020-10-22

    申请号:US16809403

    申请日:2020-03-04

    Applicant: Google LLC

    Abstract: A method includes receiving audio data of an utterance and processing the audio data to obtain, as output from a speech recognition model configured to jointly perform speech decoding and endpointing of utterances: partial speech recognition results for the utterance; and an endpoint indication indicating when the utterance has ended. While processing the audio data, the method also includes detecting, based on the endpoint indication, the end of the utterance. In response to detecting the end of the utterance, the method also includes terminating the processing of any subsequent audio data received after the end of the utterance was detected.

Patent Agency Ranking