-
公开(公告)号:US20240153495A1
公开(公告)日:2024-05-09
申请号:US18494984
申请日:2023-10-26
Applicant: Google LLC
Inventor: Weiran Wang , Ding Zhao , Shaojin Ding , Hao Zhang , Shuo-yiin Chang , David Johannes Rybach , Tara N. Sainath , Yanzhang He , Ian McGraw , Shankar Kumar
IPC: G10L15/06 , G06F40/284 , G10L15/26
CPC classification number: G10L15/063 , G06F40/284 , G10L15/26
Abstract: A method includes receiving a training dataset that includes one or more spoken training utterances for training an automatic speech recognition (ASR) model. Each spoken training utterance in the training dataset paired with a corresponding transcription and a corresponding target sequence of auxiliary tokens. For each spoken training utterance, the method includes generating a speech recognition hypothesis for a corresponding spoken training utterance, determining a speech recognition loss based on the speech recognition hypothesis and the corresponding transcription, generating a predicted auxiliary token for the corresponding spoken training utterance, and determining an auxiliary task loss based on the predicted auxiliary token and the corresponding target sequence of auxiliary tokens. The method also includes the ASR model jointly on the speech recognition loss and the auxiliary task loss determined for each spoken training utterance.
-
公开(公告)号:US20230326461A1
公开(公告)日:2023-10-12
申请号:US18182925
申请日:2023-03-13
Applicant: Google LLC
Inventor: Shaojin Ding , Yangzhang He , Xin Wang , Weiran Wang , Trevor Strohman , Tara N. Sainath , Rohit Parkash Prabhavalkar , Robert David , Rina Panigrahy , Rami Botros , Qiao Liang , Ian Mcgraw , Ding Zhao , Dongseong Hwang
CPC classification number: G10L15/32 , G10L15/16 , G10L15/22 , G10L2015/223
Abstract: An automated speech recognition (ASR) model includes a first encoder, a first encoder, a second encoder, and a second decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The first decoder receives, as input, the first higher order feature representation generated by the first encoder, and generates a first probability distribution over possible speech recognition hypotheses. The second encoder receives, as input, the first higher order feature representation generated by the first encoder, and generates a second higher order feature representation for a corresponding first higher order feature frame. The second decoder receives, as input, the second higher order feature representation generated by the second encoder, and generates a second probability distribution over possible speech recognition hypotheses.
-
公开(公告)号:US20230298563A1
公开(公告)日:2023-09-21
申请号:US18186157
申请日:2023-03-18
Applicant: Google LLC
Inventor: Ke Hu , Tara N. Sainath , Yanzhang He , Rohit Prabhavalkar , Sepand Mavandadi , Weiran Wang , Trevor Strohman
CPC classification number: G10L13/08 , G10L15/16 , G10L15/063
Abstract: A method of text-only and semi-supervised training for deliberation includes receiving training data including unspoken textual utterances that are each not paired with any corresponding spoken utterance of non-synthetic speech, and training a deliberation model that includes a text encoder and a deliberation decoder on the unspoken textual utterances. The method also includes receiving, at the trained deliberation model, first-pass hypotheses and non-causal acoustic embeddings. The first-pass hypotheses is generated by a recurrent neural network-transducer (RNN-T) decoder for the non-causal acoustic embeddings encoded by a non-causal encoder. The method also includes encoding, using the text encoder, the first-pass hypotheses generated by the RNN-T decoder, and generating, using the deliberation decoder attending to both the first-pass hypotheses and the non-causal acoustic embeddings, second-pass hypotheses.
-
公开(公告)号:US20240296840A1
公开(公告)日:2024-09-05
申请号:US18592590
申请日:2024-03-01
Applicant: Google LLC
Inventor: Shaan Jagdeep Patrick Bijwadia , Shuo-yiin Chang , Tara N. Sainath , Weiran Wang , Zhong Meng
IPC: G10L15/197 , G10L15/02 , G10L15/06
CPC classification number: G10L15/197 , G10L15/02 , G10L15/063
Abstract: A joint auxiliary task and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher-order feature representation for a corresponding acoustic frame. The model also includes a multi-output HAT decoder to generate at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the output step corresponds to an auxiliary token associated with a particular auxiliary task. The model is trained by a JEIT training process based on: a paired training data set including paired audio data and transcriptions, the transcriptions annotated with ground-truth auxiliary tokens associated with the particular auxiliary task; and an unpaired training data set including textual utterances not paired with any corresponding audio data, the textual utterances annotated with the ground-truth auxiliary tokens associated with the particular auxiliary task.
-
公开(公告)号:US20230130634A1
公开(公告)日:2023-04-27
申请号:US17936547
申请日:2022-09-29
Applicant: Google LLC
Inventor: Tara N. Sainath , Rami Botros , Anmol Gulati , Krzysztof Choromanski , Ruoming Pang , Trevor Strohman , Weiran Wang , Jiahui Yu
Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.
-
公开(公告)号:US20240028829A1
公开(公告)日:2024-01-25
申请号:US18346232
申请日:2023-07-01
Applicant: Google LLC
Inventor: Tara N. Sainath , Zhouyuan Huo , Zhehuai Chen , Yu Zhang , Weiran Wang , Trevor Strohman , Rohit Prakash Prabhavalkar , Bo Li , Ankur Bapna
IPC: G06F40/284 , G06F40/40
CPC classification number: G06F40/284 , G06F40/40
Abstract: A method includes receiving training data that includes a set of unspoken textual utterances. For each respective unspoken textual utterance, the method includes, tokenizing the respective textual utterance into a sequence of sub-word units, generating a first higher order textual feature representation for a corresponding sub-word unit tokenized from the respective unspoken textual utterance, receiving the first higher order textual feature representation generated by a text encoder, and generating a first probability distribution over possible text units. The method also includes training an encoder based on the first probability distribution over possible text units generated by a first-pass decoder for each respective unspoken textual utterance in the set of unspoken textual utterances.
-
公开(公告)号:US20230298570A1
公开(公告)日:2023-09-21
申请号:US18187222
申请日:2023-03-21
Applicant: Google LLC
Inventor: Weiran Wang , Tongzhou Chen , Tara N. Sainath , Ehsan Variani , Rohit Prakash Prabhavalkar , Ronny Huang , Bhuvana Ramabhadran , Neeraj Gaur , Sepand Mavandadi , Charles Caleb Peyser , Trevor Strohman , Yangzhang He , David Rybach
CPC classification number: G10L15/063 , G10L15/19 , G10L15/22 , G10L15/16 , G10L15/02
Abstract: A method includes generating, using an audio encoder, a higher-order feature representation for each acoustic frame in a sequence of acoustic frames; generating, using a decoder, based on the higher-order feature representation, a plurality of speech recognition hypotheses, each hypotheses corresponding to a candidate transcription of an utterance and having an associated first likelihood score; generating, using an external language model, for each speech recognition hypothesis, a second likelihood score; determining, using a learnable fusion module, for each speech recognition hypothesis, a set of fusion weights based on the higher-order feature representation and the speech recognition hypothesis; and generating, using the learnable fusion module, for each speech recognition hypothesis, a third likelihood score based on the first likelihood score, the second likelihood score, and the set of fusion weights, the audio encoder and decoder trained using minimum additive error rate training in the presence of the external language model.
-
8.
公开(公告)号:US20250078815A1
公开(公告)日:2025-03-06
申请号:US18826135
申请日:2024-09-05
Applicant: Google LLC
Inventor: Shaojin Ding , David Qiu , David Rim , Amir Yazdanbakhsh , Yanzhang He , Zhonglin Han , Rohit Prakash Prabhavalkar , Weiran Wang , Bo Li , Jian Li , Tara N. Sainath , Shivani Agrawal , Oleg Rybakov
IPC: G10L15/06
Abstract: A method includes obtaining a plurality of training samples that each include a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The method also includes fine-tuning, using quantization and sparsity aware training with native integer operations, a pre-trained automatic speech recognition (ASR) model on the plurality of training samples. Here, the pre-trained ASR model includes a plurality of weights and the fine-tuning includes pruning one or more weights of the plurality of weights using a sparsity mask and quantizing each weight of the plurality of weights based on an integer with a fixed-bit width. The method also includes providing the fine-tuned ASR model to a user device.
-
公开(公告)号:US12190869B2
公开(公告)日:2025-01-07
申请号:US17936547
申请日:2022-09-29
Applicant: Google LLC
Inventor: Tara N. Sainath , Rami Botros , Anmol Gulati , Krzysztof Choromanski , Ruoming Pang , Trevor Strohman , Weiran Wang , Jiahui Yu
Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.
-
公开(公告)号:US20240304181A1
公开(公告)日:2024-09-12
申请号:US18598523
申请日:2024-03-07
Applicant: Google LLC
Inventor: Guru Prakash Arumugam , Shuo-yiin Chang , Shaan Jagdeep Patrick Bijwadia , Weiran Wang , Quan Wang , Rohit Prakash Prabhavalkar , Tara N. Sainath
IPC: G10L15/06
CPC classification number: G10L15/063
Abstract: A method includes receiving a plurality of training samples spanning multiple different domains. Each corresponding training sample includes audio data characterizing an utterance paired with a corresponding transcription of the utterance. The method also includes re-labeling each corresponding training sample of the plurality of training samples by annotating the corresponding transcription of the utterance with one or more speaker tags. Each speaker tag indicates a respective segment of the transcription for speech that was spoken by a particular type of speaker. The method also includes training a multi-domain speech recognition model on the re-labeled training samples to teach the multi-domain speech recognition model to learn to share parameters for recognizing speech across each of the different multiple different domains.
-
-
-
-
-
-
-
-
-