-
公开(公告)号:US11804212B2
公开(公告)日:2023-10-31
申请号:US17348118
申请日:2021-06-15
Applicant: Google LLC
Inventor: Thibault Doutre , Wei Han , Min Ma , Zhiyun Lu , Chung-Cheng Chiu , Ruoming Pang , Arun Narayanan , Ananya Misra , Yu Zhang , Liangliang Cao
CPC classification number: G10L15/063 , G06N3/045 , G10L15/083 , G10L15/18
Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.
-
公开(公告)号:US11774596B2
公开(公告)日:2023-10-03
申请号:US17901224
申请日:2022-09-01
Applicant: Google LLC
Inventor: Jonathon Shlens , Vijay Vasudevan , Jiquan Ngiam , Wei Han , Zhifeng Chen , Brandon Chauloon Yang , Benjamin James Caine , Zhengdong Zhang , Christoph Sprunk , Ouais Alsharif , Junhua Mao , Chen Wu
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing data generated by a sensing system that rotationally senses an environment. In one aspect, a method comprises partitioning a predetermined period of time into a plurality of sub-periods, wherein the predetermined period of time is a period of time for which data generated by the sensing system constitutes a complete rotational sensing of the environment; for each sub-period: receiving current data generated by the sensing system during the sub-period and characterizing a respective partial scene of the environment; processing the current data using an object detection neural network to generate a current object detection output that is specific to the respective partial scene of the environment.
-
公开(公告)号:US12094453B2
公开(公告)日:2024-09-17
申请号:US17447285
申请日:2021-09-09
Applicant: Google LLC
Inventor: Jiahui Yu , Chung-cheng Chiu , Bo Li , Shuo-yiin Chang , Tara Sainath , Wei Han , Anmol Gulati , Yanzhang He , Arun Narayanan , Yonghui Wu , Ruoming Pang
IPC: G10L15/06 , G10L15/16 , G10L15/187 , G10L15/22 , G10L15/30
CPC classification number: G10L15/063 , G10L15/16 , G10L15/22 , G10L15/30 , G10L15/187
Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.
-
公开(公告)号:US20240290321A1
公开(公告)日:2024-08-29
申请号:US18585168
申请日:2024-02-23
Applicant: Google LLC
Inventor: Yongqiang Wang , Yu Zhang , Wei Han , Parisa Haghani , Pedro J. Moreno Mengibar
CPC classification number: G10L15/063 , G10L15/26
Abstract: A method includes receiving training data including a corpus of multilingual unspoken textual utterances, a corpus of multilingual un-transcribed non-synthetic speech utterances, and a corpus of multilingual transcribed non-synthetic speech utterances. For each un-transcribed non-synthetic speech utterance, the method includes generating a target quantized vector token and a target token index, generating contrastive context vectors from corresponding masked audio features, and deriving a contrastive loss term. The method also includes generating an alignment output, generating a first probability distribution over possible speech recognition hypotheses for the alignment output, and determining an alignment output loss term. The method also includes generating a second probability distribution over possible speech recognition hypotheses and determining a non-synthetic speech loss term. The method also includes pre-training an audio encoder based on the contrastive loss term, the alignment output loss term, and the non-synthetic speech loss term.
-
公开(公告)号:US20240029716A1
公开(公告)日:2024-01-25
申请号:US18480827
申请日:2023-10-04
Applicant: Google LLC
Inventor: Thibault Doutre , Wei Han , Min Ma , Zhiyun Lu , Chung-Cheng Chiu , Ruoming Pang , Arun Narayanan , Ananya Misra , Yu Zhang , Liangliang Cao
CPC classification number: G10L15/063 , G10L15/083 , G10L15/18 , G06N3/045
Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.
-
公开(公告)号:US20230237993A1
公开(公告)日:2023-07-27
申请号:US18011571
申请日:2021-10-01
Applicant: Google LLC
Inventor: Jiahui Yu , Ruoming Pang , Wei Han , Anmol Gulati , Chung-Cheng Chiu , Bo Li , Tara N. Sainath , Yonghui Hu
Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.
-
公开(公告)号:US11508147B2
公开(公告)日:2022-11-22
申请号:US16812154
申请日:2020-03-06
Applicant: Google LLC
Inventor: Jonathon Shlens , Vijay Vasudevan , Jiquan Ngiam , Wei Han , Zhifeng Chen , Brandon Chauloon Yang , Benjamin James Caine , Zhengdong Zhang , Christoph Sprunk , Ouais Alsharif , Junhua Mao , Chen Wu
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing data generated by a sensing system that rotationally senses an environment. In one aspect, a method comprises partitioning a predetermined period of time into a plurality of sub-periods, wherein the predetermined period of time is a period of time for which data generated by the sensing system constitutes a complete rotational sensing of the environment; for each sub-period: receiving current data generated by the sensing system during the sub-period and characterizing a respective partial scene of the environment; processing the current data using an object detection neural network to generate a current object detection output that is specific to the respective partial scene of the environment.
-
公开(公告)号:US20220207321A1
公开(公告)日:2022-06-30
申请号:US17139525
申请日:2020-12-31
Applicant: Google LLC
Inventor: Anmol Gulati , Ruoming Pang , Niki Parmar , Jiahui Yu , Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Shibo Wang , Weikeng Qin , Zhengdong Zhang
Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.
-
公开(公告)号:US11335333B2
公开(公告)日:2022-05-17
申请号:US16717746
申请日:2019-12-17
Applicant: Google LLC
Inventor: Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Patrick Nguyen , Sergey Kishchenko
Abstract: A method includes obtaining audio data for a long-form utterance and segmenting the audio data for the long-form utterance into a plurality of overlapping segments. The method also includes, for each overlapping segment of the plurality of overlapping segments: providing features indicative of acoustic characteristics of the long-form utterance represented by the corresponding overlapping segment as input to an encoder neural network; processing an output of the encoder neural network using an attender neural network to generate a context vector; and generating word elements using the context vector and a decoder neural network. The method also includes generating a transcription for the long-form utterance by merging the word elements from the plurality of overlapping segments and providing the transcription as an output of the automated speech recognition system.
-
-
-
-
-
-
-
-