Multi-dialect and multilingual speech recognition

    公开(公告)号:US12254865B2

    公开(公告)日:2025-03-18

    申请号:US18418246

    申请日:2024-01-20

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

    Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection

    公开(公告)号:US20240029719A1

    公开(公告)日:2024-01-25

    申请号:US18340093

    申请日:2023-06-23

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/063 G10L25/93

    Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.

    Backplane adaptable to drive emissive pixel arrays of differing pitches

    公开(公告)号:US11568802B2

    公开(公告)日:2023-01-31

    申请号:US17584668

    申请日:2022-01-26

    Applicant: Google LLC

    Inventor: Bo Li Kaushik Sheth

    Abstract: A backplane suitable to pulse width modulate an array of emissive pixels with a current that is substantially constant over a wide range of temperatures. A current control circuit provides means to provide a constant current to an array of current mirror pixel drive elements. The current control circuit comprises a thermally stable bias resistor and a thermally stable band-gap voltage source to provide thermally stable controls and a large L p-channel reference current FET with an associated large L n-channel bias FET configured to provide a reference current at a required voltage to the gate of a large L p-channel current source FET. The current control circuit and the current mirror pixel drive elements are similar circuits with one current control circuit able to control a substantial number of pixel drive elements.

    Learning Word-Level Confidence for Subword End-To-End Automatic Speech Recognition

    公开(公告)号:US20220270597A1

    公开(公告)日:2022-08-25

    申请号:US17182592

    申请日:2021-02-23

    Applicant: Google LLC

    Abstract: A method includes receiving a speech recognition result, and using a confidence estimation module (CEM), for each sub-word unit in a sequence of hypothesized sub-word units for the speech recognition result: obtaining a respective confidence embedding that represents a set of confidence features; generating, using a first attention mechanism, a confidence feature vector; generating, using a second attention mechanism, an acoustic context vector; and generating, as output from an output layer of the CEM, a respective confidence output score for each corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the CEM. For each of the one or more words formed by the sequence of hypothesized sub-word units, the method also includes determining a respective word-level confidence score for the word. The method also includes determining an utterance-level confidence score by aggregating the word-level confidence scores.

    ADAPTIVE AUDIO ENHANCEMENT FOR MULTICHANNEL SPEECH RECOGNITION

    公开(公告)号:US20220148582A1

    公开(公告)日:2022-05-12

    申请号:US17649058

    申请日:2022-01-26

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.

Patent Agency Ranking