-
公开(公告)号:US12183322B2
公开(公告)日:2024-12-31
申请号:US17934555
申请日:2022-09-22
Applicant: Google LLC
Inventor: Bo Li , Tara N. Sainath , Ruoming Pang , Shuo-yiin Chang , Qiumin Xu , Trevor Strohman , Vince Chen , Qiao Liang , Heguang Liu , Yanzhang He , Parisa Haghani , Sameer Bidichandani
Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.
-
公开(公告)号:US12154581B2
公开(公告)日:2024-11-26
申请号:US17237021
申请日:2021-04-21
Applicant: Google LLC
Inventor: Arun Narayanan , Tara Sainath , Chung-Cheng Chiu , Ruoming Pang , Rohit Prabhavalkar , Jiahui Yu , Ehsan Variani , Trevor Strohman
Abstract: An automated speech recognition (ASR) model includes a first encoder, a second encoder, and a decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The second encoder receives, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps, and generates, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame. The decoder receives, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps, and generates, at each of the plurality of time steps, a first probability distribution over possible speech recognition hypotheses.
-
公开(公告)号:US20240135923A1
公开(公告)日:2024-04-25
申请号:US18485271
申请日:2023-10-11
Applicant: Google LLC
Inventor: Chao Zhang , Bo Li , Tara N. Sainath , Trevor Strohman , Shuo-yiin Chang
IPC: G10L15/197 , G10L15/00 , G10L15/02
CPC classification number: G10L15/197 , G10L15/005 , G10L15/02
Abstract: A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.
-
公开(公告)号:US20230335117A1
公开(公告)日:2023-10-19
申请号:US18186872
申请日:2023-03-20
Applicant: Google LLC
Inventor: Shuo-yiin Chang , Guru Prakash Arumugam , Zelin Wu , Tara N. Sainath , Bo LI , Qiao Liang , Adam Stambler , Shyam Upadhyay , Manaal Faruqui , Trevor Strohman
CPC classification number: G10L15/16 , G10L15/22 , G10L15/063 , G10L2015/223
Abstract: A method includes receiving, as input to a speech recognition model, audio data corresponding to a spoken utterance. The method also includes performing, using the speech recognition model, speech recognition on the audio data by, at each of a plurality of time steps, encoding, using an audio encoder, the audio data corresponding to the spoken utterance into a corresponding audio encoding, and decoding, using a speech recognition joint network, the corresponding audio encoding into a probability distribution over possible output labels. At each of the plurality of time steps, the method also includes determining, using an intended query (IQ) joint network configured to receive a label history representation associated with a sequence of non-blank symbols output by a final softmax layer, an intended query decision indicating whether or not the spoken utterance includes a query intended for a digital assistant.
-
公开(公告)号:US11715458B2
公开(公告)日:2023-08-01
申请号:US17316198
申请日:2021-05-10
Applicant: Google LLC
Inventor: Tara Sainath , Arun Narayanan , Rami Botros , Yanzhang He , Ehsan Variani , Cyril Allauzen , David Rybach , Ruoming Pang , Trevor Strohman
CPC classification number: G10L15/063 , G10L15/02 , G10L15/22 , G10L15/30
Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.
-
公开(公告)号:US11594212B2
公开(公告)日:2023-02-28
申请号:US17155010
申请日:2021-01-21
Applicant: Google LLC
Inventor: Tara N. Sainath , Ruoming Pang , Ron Weiss , Yanzhang He , Chung-Cheng Chiu , Trevor Strohman
IPC: G10L15/06 , G06N3/08 , G10L15/16 , G10L15/197
Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.
-
公开(公告)号:US20210350794A1
公开(公告)日:2021-11-11
申请号:US17204852
申请日:2021-03-17
Applicant: Google LLC
Inventor: Tara N. Sainath , Basi Garcia , David Rybach , Trevor Strohman , Ruoming Pang
Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.
-
公开(公告)号:US20250095634A1
公开(公告)日:2025-03-20
申请号:US18965193
申请日:2024-12-02
Applicant: Google LLC
Inventor: Bo Li , Tara N. Sainath , Ruoming Pang , Shuo-yiin Chang , Qiumin Xu , Trevor Strohman , Vince Chen , Qiao Liang , Heguang Liu , Yanzhang He , Parisa Haghani , Sameer Bidichandani
Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.
-
公开(公告)号:US20250016387A1
公开(公告)日:2025-01-09
申请号:US18890050
申请日:2024-09-19
Applicant: GOOGLE LLC
Inventor: Françoise Beaufays , Khe Chai Sim , Trevor Strohman , Oren Litvin
IPC: H04N21/233 , G06F18/214 , G06N20/00 , H04N21/232
Abstract: Implementations disclosed herein are directed to ephemeral learning of machine learning (“ML”) model(s) based on gradient(s) generated at a remote system (e.g., remote server(s)). Processor(s) of the remote system can receive stream(s) of audio data capturing spoken utterance(s) from a client device of a user. A fulfillment pipeline can process the stream(s) of audio data to cause certain fulfillment(s) of the spoken utterance(s) to be performed. Meanwhile, a training pipeline can process the stream(s) of audio data to generate gradient(s) using unsupervised learning techniques. Subsequent to the processing by the fulfillment pipeline and/or the training pipeline, the stream(s) of audio data are discarded by the remote system. Accordingly, the ML model(s) can be trained at the remote system without storing or logging of the stream(s) of audio data by non-transient memory thereof, thereby providing more efficient training mechanisms for training the ML model(s) and also increasing security of user data.
-
公开(公告)号:US12051416B2
公开(公告)日:2024-07-30
申请号:US18228948
申请日:2023-08-01
Applicant: GOOGLE LLC
Inventor: Lior Alon , Rafael Goldfarb , Dekel Auster , Dan Rasin , Michael Andrew Goodman , Trevor Strohman , Nino Tasca , Valerie Nygaard , Jaclyn Konzelmann
CPC classification number: G10L15/22 , G06F3/167 , G10L15/083 , G10L15/1815 , G10L15/285 , G10L2015/223
Abstract: Implementations described herein relate to reducing latency in automated assistant interactions. In some implementations, a client device can receive audio data that captures a spoken utterance of a user. The audio data can be processed to determine an assistant command to be performed by an automated assistant. The assistant command can be processed, using a latency prediction model, to generate a predicted latency to fulfill the assistant command. Further, the client device (or the automated assistant) can determine, based on the predicted latency, whether to audibly render pre-cached content for presentation to the user prior to audibly rendering content that is responsive to the spoken utterance. The pre-cached content can be tailored to the assistant command and audibly rendered for presentation to the user while the content is being obtained, and the content can be audibly rendered for presentation to the user subsequent to the pre-cached content.
-
-
-
-
-
-
-
-
-