-
公开(公告)号:US20250095630A1
公开(公告)日:2025-03-20
申请号:US18966088
申请日:2024-12-02
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick An Phu Nguyen
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
-
公开(公告)号:US12183322B2
公开(公告)日:2024-12-31
申请号:US17934555
申请日:2022-09-22
Applicant: Google LLC
Inventor: Bo Li , Tara N. Sainath , Ruoming Pang , Shuo-yiin Chang , Qiumin Xu , Trevor Strohman , Vince Chen , Qiao Liang , Heguang Liu , Yanzhang He , Parisa Haghani , Sameer Bidichandani
Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.
-
公开(公告)号:US20240420687A1
公开(公告)日:2024-12-19
申请号:US18815537
申请日:2024-08-26
Applicant: GOOGLE LLC
Inventor: Tara N. Sainath , Yanzhang He , Bo Li , Arun Narayanan , Ruoming Pang , Antoine Jean Bruguier , Shuo-yiin Chang , Wei Li
Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.
-
公开(公告)号:US12154581B2
公开(公告)日:2024-11-26
申请号:US17237021
申请日:2021-04-21
Applicant: Google LLC
Inventor: Arun Narayanan , Tara Sainath , Chung-Cheng Chiu , Ruoming Pang , Rohit Prabhavalkar , Jiahui Yu , Ehsan Variani , Trevor Strohman
Abstract: An automated speech recognition (ASR) model includes a first encoder, a second encoder, and a decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The second encoder receives, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps, and generates, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame. The decoder receives, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps, and generates, at each of the plurality of time steps, a first probability distribution over possible speech recognition hypotheses.
-
公开(公告)号:US20240104352A1
公开(公告)日:2024-03-28
申请号:US18012391
申请日:2022-07-28
Applicant: Google LLC
Inventor: Yu Zhang , Yu-An Chung , Wei Han , Chung-Cheng Chiu , Weikeng Qin , Ruoming Pang , Yonghui Wu
IPC: G06N3/0455
CPC classification number: G06N3/0455
Abstract: Provided are improved end-to-end self-supervised pre-training frameworks that leverage a combination of contrastive and masked modeling loss terms. In particular, the present disclosure provides framework that combines contrastive learning and masked modeling, where the former trains the model to discretize input data (e.g., continuous signals such as continuous speech signals) into a finite set of discriminative tokens, and the latter trains the model to learn contextualized representations via solving a masked prediction task consuming the discretized tokens. In contrast to certain existing masked modeling-based pre-training frameworks which rely on an iterative re-clustering and re-training process or other existing frameworks which concatenate two separately trained modules, the proposed framework can enable a model to be optimized in an end-to-end fashion by solving the two self-supervised tasks (the contrastive task and masked modeling) simultaneously.
-
公开(公告)号:US11908461B2
公开(公告)日:2024-02-20
申请号:US17149018
申请日:2021-01-14
Applicant: Google LLC
Inventor: Ke Hu , Tara N. Sainath , Ruoming Pang , Rohit Prakash Prabhavalkar
CPC classification number: G10L15/1815 , G06N3/049 , G10L15/063 , G10L15/16 , G10L15/187 , G10L19/0018
Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector. The method also includes decoding the first context vector and the second context vector at a context vector decoder to form a second-pass hypothesis.
-
公开(公告)号:US11848002B2
公开(公告)日:2023-12-19
申请号:US17813361
申请日:2022-07-19
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick An Phu Nguyen
CPC classification number: G10L13/04 , G10L17/04 , G10L19/00 , G06N3/08 , G10L2013/021
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
-
公开(公告)号:US11804212B2
公开(公告)日:2023-10-31
申请号:US17348118
申请日:2021-06-15
Applicant: Google LLC
Inventor: Thibault Doutre , Wei Han , Min Ma , Zhiyun Lu , Chung-Cheng Chiu , Ruoming Pang , Arun Narayanan , Ananya Misra , Yu Zhang , Liangliang Cao
CPC classification number: G10L15/063 , G06N3/045 , G10L15/083 , G10L15/18
Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.
-
公开(公告)号:US20230244904A1
公开(公告)日:2023-08-03
申请号:US18154321
申请日:2023-01-13
Applicant: Google LLC
Inventor: Mingxing Tan , Quoc Le , Bo Chen , Vijay Vasudevan , Ruoming Pang
Abstract: The present disclosure is directed to an automated neural architecture search approach for designing new neural network architectures such as, for example, resource-constrained mobile CNN models. In particular, the present disclosure provides systems and methods to perform neural architecture search using a novel factorized hierarchical search space that permits layer diversity throughout the network, thereby striking the right balance between flexibility and search space size. The resulting neural architectures are able to be run relatively faster and using relatively fewer computing resources (e.g., less processing power, less memory usage, less power consumption, etc.), all while remaining competitive with or even exceeding the performance (e.g., accuracy) of current state-of-the-art mobile-optimized models.
-
公开(公告)号:US11715458B2
公开(公告)日:2023-08-01
申请号:US17316198
申请日:2021-05-10
Applicant: Google LLC
Inventor: Tara Sainath , Arun Narayanan , Rami Botros , Yanzhang He , Ehsan Variani , Cyril Allauzen , David Rybach , Ruoming Pang , Trevor Strohman
CPC classification number: G10L15/063 , G10L15/02 , G10L15/22 , G10L15/30
Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.
-
-
-
-
-
-
-
-
-