Patent search ap:("Google LLC") AND inv:"Tara N. Sainath" Page 1

1.

发明申请
Language Agnostic Multilingual End-To-End Streaming On-Device ASR System 有权

公开(公告)号：US20250095634A1

公开(公告)日：2025-03-20

申请号：US18965193

申请日：2024-12-02

Applicant: Google LLC

Inventor： Bo Li , Tara N. Sainath , Ruoming Pang , Shuo-yiin Chang , Qiumin Xu , Trevor Strohman , Vince Chen , Qiao Liang , Heguang Liu , Yanzhang He , Parisa Haghani , Sameer Bidichandani

IPC: G10L15/00 , G10L15/06 , G10L15/22 , G10L15/30

Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.

2.

发明授权
Multi-dialect and multilingual speech recognition 有权

公开(公告)号：US12254865B2

公开(公告)日：2025-03-18

申请号：US18418246

申请日：2024-01-20

Applicant: Google LLC

Inventor： Zhifeng Chen , Bo Li , Eugene Weinstein , Yonghui Wu , Pedro J. Moreno Mengibar , Ron J. Weiss , Khe Chai Sim , Tara N. Sainath , Patrick An Phu Nguyen

IPC: G10L15/00 , G10L15/06 , G10L15/07 , G10L15/16

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

3.

发明授权
Two-pass end to end speech recognition 有权

公开(公告)号：US12073824B2

公开(公告)日：2024-08-27

申请号：US17616135

申请日：2020-12-03

Applicant: GOOGLE LLC

Inventor： Tara N. Sainath , Yanzhang He , Bo Li , Arun Narayanan , Ruoming Pang , Antoine Jean Bruguier , Shuo-Yiin Chang , Wei Li

IPC: G10L15/00 , G06N3/08 , G10L15/05 , G10L15/06 , G10L15/16 , G10L15/22

CPC classification number: G10L15/16 , G06N3/08 , G10L15/05 , G10L15/063 , G10L15/22 , G10L2015/0635

Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.

4.

发明授权
Emitting word timings with end-to-end models 有权

公开(公告)号：US12027154B2

公开(公告)日：2024-07-02

申请号：US18167050

申请日：2023-02-09

Applicant: Google LLC

Inventor： Tara N. Sainath , Basilio Garcia Castillo , David Rybach , Trevor Strohman , Ruoming Pang

IPC: G10L25/30 , G10L15/06 , G10L25/78

CPC classification number: G10L15/063 , G10L25/30 , G10L25/78

Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.

5.

发明授权
Large-scale language model data selection for rare-word speech recognition 有权

公开(公告)号：US12014725B2

公开(公告)日：2024-06-18

申请号：US17643861

申请日：2021-12-13

Applicant: Google LLC

Inventor： Ronny Huang , Tara N. Sainath

IPC: G10L15/16 , G06N3/02 , G10L15/06 , G10L15/197 , G10L15/22

CPC classification number: G10L15/063 , G06N3/02 , G10L15/16 , G10L15/197 , G10L15/22

Abstract: A method of training a language model for rare-word speech recognition includes obtaining a set of training text samples, and obtaining a set of training utterances used for training a speech recognition model. Each training utterance in the plurality of training utterances includes audio data corresponding to an utterance and a corresponding transcription of the utterance. The method also includes applying rare word filtering on the set of training text samples to identify a subset of rare-word training text samples that include words that do not appear in the transcriptions from the set of training utterances or appear in the transcriptions from the set of training utterances less than a threshold number of times. The method further includes training the external language model on the transcriptions from the set of training utterances and the identified subset of rare-word training text samples.

6.

发明公开
MULTI-DIALECT AND MULTILINGUAL SPEECH RECOGNITION 审中-公开

公开(公告)号：US20240161732A1

公开(公告)日：2024-05-16

申请号：US18418246

申请日：2024-01-20

Applicant: Google LLC

Inventor： Zhifeng Chen , Bo Li , Eugene Weinstein , Yonghui Wu , Pedro J. Moreno Mengibar , Ron J. Weiss , Khe Chai Sim , Tara N. Sainath , Patrick An Phu Nguyen

IPC: G10L15/00 , G10L15/07 , G10L15/16

CPC classification number: G10L15/005 , G10L15/07 , G10L15/16 , G10L2015/0631

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

7.

发明授权
Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models 有权

公开(公告)号：US11942076B2

公开(公告)日：2024-03-26

申请号：US17651315

申请日：2022-02-16

Applicant: Google LLC

Inventor： Ke Hu , Golan Pundak , Rohit Prakash Prabhavalkar , Antoine Jean Bruguier , Tara N. Sainath

IPC: G10L15/30 , G10L15/02 , G10L15/06 , G10L15/187 , G10L15/193 , G10L15/28 , G10L15/32 , G10L25/30

CPC classification number: G10L15/063 , G10L15/02 , G10L15/187 , G10L15/193 , G10L15/285 , G10L15/32 , G10L25/30 , G10L2015/025

Abstract: A method includes receiving audio data encoding an utterance spoken by a native speaker of a first language, and receiving a biasing term list including one or more terms in a second language different than the first language. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data to generate speech recognition scores for both wordpieces and corresponding phoneme sequences in the first language. The method also includes rescoring the speech recognition scores for the phoneme sequences based on the one or more terms in the biasing term list, and executing, using the speech recognition scores for the wordpieces and the rescored speech recognition scores for the phoneme sequences, a decoding graph to generate a transcription for the utterance.

8.

发明授权
Minimum word error rate training for attention-based sequence-to-sequence models 有权

公开(公告)号：US11922932B2

公开(公告)日：2024-03-05

申请号：US18194586

申请日：2023-03-31

Applicant: Google LLC

Inventor： Rohit Prakash Prabhavalkar , Tara N. Sainath , Yonghui Wu , Patrick An Phu Nguyen , Zhifeng Chen , Chung-Cheng Chiu , Anjuli Patricia Kannan

IPC: G10L15/197 , G10L15/02 , G10L15/06 , G10L15/16 , G10L15/22

CPC classification number: G10L15/197 , G10L15/02 , G10L15/063 , G10L15/16 , G10L15/22 , G10L2015/025

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses a set of speech recognition hypothesis samples, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

9.

发明公开
Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection 审中-公开

公开(公告)号：US20240029719A1

公开(公告)日：2024-01-25

申请号：US18340093

申请日：2023-06-23

Applicant: Google LLC

Inventor： Shaan Jagdeep Patrick Bijwadia , Shuo-yiin Chang , Bo Li , Yanzhang He , Tara N. Sainath , Chao Zhang

IPC: G10L15/16 , G10L15/06 , G10L25/93

CPC classification number: G10L15/16 , G10L15/063 , G10L25/93

Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.

10.

发明公开
Unified Cascaded Encoder ASR model for Dynamic Model Sizes 审中-公开

公开(公告)号：US20230326461A1

公开(公告)日：2023-10-12

申请号：US18182925

申请日：2023-03-13

Applicant: Google LLC

Inventor： Shaojin Ding , Yangzhang He , Xin Wang , Weiran Wang , Trevor Strohman , Tara N. Sainath , Rohit Parkash Prabhavalkar , Robert David , Rina Panigrahy , Rami Botros , Qiao Liang , Ian Mcgraw , Ding Zhao , Dongseong Hwang

IPC: G10L15/32 , G10L15/16 , G10L15/22

CPC classification number: G10L15/32 , G10L15/16 , G10L15/22 , G10L2015/223

Abstract: An automated speech recognition (ASR) model includes a first encoder, a first encoder, a second encoder, and a second decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The first decoder receives, as input, the first higher order feature representation generated by the first encoder, and generates a first probability distribution over possible speech recognition hypotheses. The second encoder receives, as input, the first higher order feature representation generated by the first encoder, and generates a second higher order feature representation for a corresponding first higher order feature frame. The second decoder receives, as input, the second higher order feature representation generated by the second encoder, and generates a second probability distribution over possible speech recognition hypotheses.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification