-
公开(公告)号:US12027158B2
公开(公告)日:2024-07-02
申请号:US18164923
申请日:2023-02-06
Applicant: Google LLC
Inventor: Ke Hu , Tara N. Sainath , Ruoming Pang , Rohit Prakash Prabhavalkar
CPC classification number: G10L15/1815 , G06N3/049 , G10L15/063 , G10L15/16 , G10L15/187 , G10L19/0018
Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector. The method also includes decoding the first context vector and the second context vector at a context vector decoder to form a second-pass hypothesis.
-
公开(公告)号:US20240185841A1
公开(公告)日:2024-06-06
申请号:US18490808
申请日:2023-10-20
Applicant: Google LLC
Inventor: Bo Li , Yu Zhang , Nanxin Chen , Rohit Prakash Prabhavalkar , Chao-Han Huck Yang , Tara N. Sainath , Trevor Strohman
IPC: G10L15/065 , G10L15/00
CPC classification number: G10L15/065 , G10L15/005
Abstract: A method includes obtaining an ASR model trained to recognize speech in a first language and receiving transcribed training utterances in a second language. The method also includes integrating the ASR model with an input reprogramming module and a latent reprogramming module. The method also includes adapting the ASR model to learn how to recognize speech in the second language by training the input reprogramming module and the latent reprogramming module while parameters of the ASR model are frozen.
-
93.
公开(公告)号:US20240169981A1
公开(公告)日:2024-05-23
申请号:US18512110
申请日:2023-11-17
Applicant: Google LLC
Inventor: Wenqian Ronny Huang , Shuo-yiin Chang , Tara N. Sainath , Yanzhang He
IPC: G10L15/197 , G10L15/02 , G10L15/05 , G10L15/06 , G10L15/16
CPC classification number: G10L15/197 , G10L15/02 , G10L15/05 , G10L15/063 , G10L15/16 , G10L2015/025 , G10L15/22
Abstract: A unified end-to-end segmenter and two-pass automatic speech recognition (ASR) model includes a first encoder, a first decoder, a second encoder, and a second decoder. The first encoder is configured to receive a sequence of acoustic frames and generate a first higher order feature representation. The first decoder is configured to receive the first higher order feature representation and generate, at each of a plurality of output steps, a first probability distribution and an indication of whether the output step corresponds to an end of speech segment, and emit an end of speech timestamp. The second encoder is configured to receive the first higher order feature representation and the end of speech timestamp, and generate a second higher order feature representation. The second decoder is configured to receive the second higher order feature representation and generate a second probability distribution.
-
公开(公告)号:US20240153498A1
公开(公告)日:2024-05-09
申请号:US18490861
申请日:2023-10-20
Applicant: Google LLC
Inventor: Tara N. Sainath , Rohit Prakash Prabhavalkar , Diamantino Antonio Caseiro , Patrick Maxim Rondon , Cyril Allauzen
IPC: G10L15/16 , G10L15/06 , G10L15/183
CPC classification number: G10L15/16 , G10L15/063 , G10L15/183
Abstract: A method includes receiving context biasing data that includes a set of unspoken textual utterances corresponding to a particular context. The method also includes obtaining a list of carrier phrases associated with the particular context. For each respective unspoken textual utterance, the method includes generating a corresponding training data pair that includes the respective unspoken textual utterance and a carrier phrase. For each respective training data pair, the method includes tokenizing the respective training data pair into a sequence of sub-word units, generating a first higher order textual feature representation for a corresponding sub-word unit, receiving the first higher order textual feature representation, and generating a first probability distribution over possible text units. The method also includes training a speech recognition model based on the first probability distribution over possible text units.
-
公开(公告)号:US20240144917A1
公开(公告)日:2024-05-02
申请号:US18494763
申请日:2023-10-25
Applicant: Google LLC
Inventor: Rami Magdi Fahmi Botros , Rohit Prakash Prabhavalkar , Johan Schalkwyk , Tara N. Sainath , Ciprian Ioan Chelba , Francoise Beaufays
IPC: G10L15/16
CPC classification number: G10L15/16
Abstract: A method includes obtaining a base encoder from a pre-trained model, and receiving training data comprising a sequence of acoustic frames characterizing an utterance paired with a ground-truth transcription of the utterance. At each of a plurality of output steps, the method includes: generating, by the base encoder, a first encoded representation for a corresponding acoustic frame; generating, by an exporter network configured to receive a continuous sequence of first encoded representations generated by the base encoder, a second encoded representation for a corresponding acoustic frame; generating, by an exporter decoder, a probability distribution over possible logits; and determining an exporter decoder loss based on the probability distribution over possible logits generated by the exporter decoder at the corresponding output step and the ground-truth transcription. The method also includes training the exporter network based on the exporter decoder losses while parameters of the base encoder are frozen.
-
公开(公告)号:US11900915B2
公开(公告)日:2024-02-13
申请号:US17572238
申请日:2022-01-10
Applicant: Google LLC
Inventor: Zhifeng Chen , Bo Li , Eugene Weinstein , Yonghui Wu , Pedro J. Moreno Mengibar , Ron J. Weiss , Khe Chai Sim , Tara N. Sainath , Patrick An Phu Nguyen
CPC classification number: G10L15/005 , G10L15/07 , G10L15/16 , G10L2015/0631
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.
-
公开(公告)号:US11783849B2
公开(公告)日:2023-10-10
申请号:US17303822
申请日:2021-06-08
Applicant: Google LLC
Inventor: Ehsan Variani , Kevin William Wilson , Ron J. Weiss , Tara N. Sainath , Arun Narayanan
IPC: G10L15/16 , G10L25/30 , G10L21/028 , G10L21/0388 , G10L19/008 , G10L15/20 , G10L21/0208 , G10L21/0216
CPC classification number: G10L25/30 , G10L15/16 , G10L15/20 , G10L19/008 , G10L21/028 , G10L21/0388 , G10L2021/02087 , G10L2021/02166
Abstract: This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.
-
公开(公告)号:US20230186907A1
公开(公告)日:2023-06-15
申请号:US18164923
申请日:2023-02-06
Applicant: Google LLC
Inventor: Ke Hu , Tara N. Sainath , Ruoming Pang , Rohit Prakash Prabhavalkar
CPC classification number: G10L15/1815 , G06N3/049 , G10L15/063 , G10L15/16 , G10L15/187 , G10L19/0018
Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector. The method also includes decoding the first context vector and the second context vector at a context vector decoder to form a second-pass hypothesis
-
公开(公告)号:US20230186901A1
公开(公告)日:2023-06-15
申请号:US18167454
申请日:2023-02-10
Applicant: Google LLC
Inventor: Tara N. Sainath , Ruoming Pang , Ron Weiss , Yanzhang He , Chung-Cheng Chiu , Trevor Strohman
IPC: G10L15/06 , G06N3/08 , G10L15/16 , G10L15/197
CPC classification number: G10L15/063 , G06N3/08 , G10L15/16 , G10L15/197 , G10L2015/0635
Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.
-
公开(公告)号:US20230130634A1
公开(公告)日:2023-04-27
申请号:US17936547
申请日:2022-09-29
Applicant: Google LLC
Inventor: Tara N. Sainath , Rami Botros , Anmol Gulati , Krzysztof Choromanski , Ruoming Pang , Trevor Strohman , Weiran Wang , Jiahui Yu
Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.
-
-
-
-
-
-
-
-
-