-
公开(公告)号:US20240135923A1
公开(公告)日:2024-04-25
申请号:US18485271
申请日:2023-10-11
申请人: Google LLC
发明人: Chao Zhang , Bo Li , Tara N. Sainath , Trevor Strohman , Shuo-yiin Chang
IPC分类号: G10L15/197 , G10L15/00 , G10L15/02
CPC分类号: G10L15/197 , G10L15/005 , G10L15/02
摘要: A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.
-
公开(公告)号:US20230335117A1
公开(公告)日:2023-10-19
申请号:US18186872
申请日:2023-03-20
申请人: Google LLC
发明人: Shuo-yiin Chang , Guru Prakash Arumugam , Zelin Wu , Tara N. Sainath , Bo LI , Qiao Liang , Adam Stambler , Shyam Upadhyay , Manaal Faruqui , Trevor Strohman
CPC分类号: G10L15/16 , G10L15/22 , G10L15/063 , G10L2015/223
摘要: A method includes receiving, as input to a speech recognition model, audio data corresponding to a spoken utterance. The method also includes performing, using the speech recognition model, speech recognition on the audio data by, at each of a plurality of time steps, encoding, using an audio encoder, the audio data corresponding to the spoken utterance into a corresponding audio encoding, and decoding, using a speech recognition joint network, the corresponding audio encoding into a probability distribution over possible output labels. At each of the plurality of time steps, the method also includes determining, using an intended query (IQ) joint network configured to receive a label history representation associated with a sequence of non-blank symbols output by a final softmax layer, an intended query decision indicating whether or not the spoken utterance includes a query intended for a digital assistant.
-
公开(公告)号:US11715458B2
公开(公告)日:2023-08-01
申请号:US17316198
申请日:2021-05-10
申请人: Google LLC
发明人: Tara Sainath , Arun Narayanan , Rami Botros , Yanzhang He , Ehsan Variani , Cyril Allauzen , David Rybach , Ruoming Pang , Trevor Strohman
CPC分类号: G10L15/063 , G10L15/02 , G10L15/22 , G10L15/30
摘要: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.
-
公开(公告)号:US11594212B2
公开(公告)日:2023-02-28
申请号:US17155010
申请日:2021-01-21
申请人: Google LLC
发明人: Tara N. Sainath , Ruoming Pang , Ron Weiss , Yanzhang He , Chung-Cheng Chiu , Trevor Strohman
IPC分类号: G10L15/06 , G06N3/08 , G10L15/16 , G10L15/197
摘要: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.
-
公开(公告)号:US20210350794A1
公开(公告)日:2021-11-11
申请号:US17204852
申请日:2021-03-17
申请人: Google LLC
发明人: Tara N. Sainath , Basi Garcia , David Rybach , Trevor Strohman , Ruoming Pang
摘要: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.
-
公开(公告)号:US20240311575A1
公开(公告)日:2024-09-19
申请号:US18123141
申请日:2023-03-17
申请人: GOOGLE LLC
摘要: Implementations relate to dialog management of a large language model (LLM) utilized in generating natural language (NL) output during an ongoing dialog. Processor(s) of a system can: receive NL based input as part of the ongoing dialog, generate NL based output utilizing the LLM, and cause the NL based output to be rendered. Further, the processor(s) can receive subsequent NL based input as part of the ongoing dialog. In some implementations, the processor(s) can determine whether to modify a corresponding dialog context in generating subsequent NL based output, and modify the corresponding dialog context accordingly. For example, the processor(s) can restrict the corresponding dialog context, or supplant the corresponding dialog context with a corresponding curated dialog context. In additional or alternative implementations, the processor(s) can modify a corresponding NL based output threshold utilized in generating the subsequent NL based response to ensure the resulting NL based output is desirable.
-
公开(公告)号:US20240311402A1
公开(公告)日:2024-09-19
申请号:US18136634
申请日:2023-04-19
申请人: GOOGLE LLC
发明人: Martin Baeuml , Yanping Huang , Wenhao Jia , Chang Lan , Yuanzhong Xu , Junwhan Ahn , Alexander Bailey , Leif Schelin , Trevor Strohman , Emanuel Taropa , Sidharth Mudgal , Yanyan Zheng , Zhifeng Chen , Ahmad Beirami
IPC分类号: G06F16/332 , G06F40/40
CPC分类号: G06F16/3322 , G06F16/3329 , G06F40/40
摘要: Implementations relate to reducing latency in generating and/or rendering natural language (NL) output generated using a large language model (LLM). Processor(s) of a system can: receive NL based input associated with a client device, and generate the NL based output utilizing the LLM. The NL based output can be a stream of NL based output in that it includes a plurality of segments, and is generated on a segment-by-segment basis. In some implementations, a first segment of the stream of NL based output is selected for inclusion in the stream of NL based output as a second segment (and any subsequent segment) is being generated to reduce latency in evaluating the NL based output as a whole prior to rendering thereof. In some versions of those implementations, the first segment is rendered as the second segment (and any subsequent segment) is being generated to further reduce latency in rendering thereof.
-
公开(公告)号:US20240028829A1
公开(公告)日:2024-01-25
申请号:US18346232
申请日:2023-07-01
申请人: Google LLC
发明人: Tara N. Sainath , Zhouyuan Huo , Zhehuai Chen , Yu Zhang , Weiran Wang , Trevor Strohman , Rohit Prakash Prabhavalkar , Bo Li , Ankur Bapna
IPC分类号: G06F40/284 , G06F40/40
CPC分类号: G06F40/284 , G06F40/40
摘要: A method includes receiving training data that includes a set of unspoken textual utterances. For each respective unspoken textual utterance, the method includes, tokenizing the respective textual utterance into a sequence of sub-word units, generating a first higher order textual feature representation for a corresponding sub-word unit tokenized from the respective unspoken textual utterance, receiving the first higher order textual feature representation generated by a text encoder, and generating a first probability distribution over possible text units. The method also includes training an encoder based on the first probability distribution over possible text units generated by a first-pass decoder for each respective unspoken textual utterance in the set of unspoken textual utterances.
-
公开(公告)号:US20230410803A1
公开(公告)日:2023-12-21
申请号:US18228948
申请日:2023-08-01
申请人: GOOGLE LLC
发明人: Lior Alon , Rafael Goldfarb , Dekel Auster , Dan Rasin , Michael Andrew Goodman , Trevor Strohman , Nino Tasca , Valerie Nygaard , Jaclyn Konzelmann
CPC分类号: G10L15/22 , G06F3/167 , G10L2015/223 , G10L15/1815 , G10L15/285 , G10L15/083
摘要: Implementations described herein relate to reducing latency in automated assistant interactions. In some implementations, a client device can receive audio data that captures a spoken utterance of a user. The audio data can be processed to determine an assistant command to be performed by an automated assistant. The assistant command can be processed, using a latency prediction model, to generate a predicted latency to fulfill the assistant command. Further, the client device (or the automated assistant) can determine, based on the predicted latency, whether to audibly render pre-cached content for presentation to the user prior to audibly rendering content that is responsive to the spoken utterance. The pre-cached content can be tailored to the assistant command and audibly rendered for presentation to the user while the content is being obtained, and the content can be audibly rendered for presentation to the user subsequent to the pre-cached content.
-
公开(公告)号:US20230298570A1
公开(公告)日:2023-09-21
申请号:US18187222
申请日:2023-03-21
申请人: Google LLC
发明人: Weiran Wang , Tongzhou Chen , Tara N. Sainath , Ehsan Variani , Rohit Prakash Prabhavalkar , Ronny Huang , Bhuvana Ramabhadran , Neeraj Gaur , Sepand Mavandadi , Charles Caleb Peyser , Trevor Strohman , Yangzhang He , David Rybach
CPC分类号: G10L15/063 , G10L15/19 , G10L15/22 , G10L15/16 , G10L15/02
摘要: A method includes generating, using an audio encoder, a higher-order feature representation for each acoustic frame in a sequence of acoustic frames; generating, using a decoder, based on the higher-order feature representation, a plurality of speech recognition hypotheses, each hypotheses corresponding to a candidate transcription of an utterance and having an associated first likelihood score; generating, using an external language model, for each speech recognition hypothesis, a second likelihood score; determining, using a learnable fusion module, for each speech recognition hypothesis, a set of fusion weights based on the higher-order feature representation and the speech recognition hypothesis; and generating, using the learnable fusion module, for each speech recognition hypothesis, a third likelihood score based on the first likelihood score, the second likelihood score, and the set of fusion weights, the audio encoder and decoder trained using minimum additive error rate training in the presence of the external language model.
-
-
-
-
-
-
-
-
-