-
公开(公告)号:US20210350786A1
公开(公告)日:2021-11-11
申请号:US16869552
申请日:2020-05-07
Applicant: Google LLC
Inventor: Zhehuai Chen , Andrew M. Rosenberg , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar
Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.
-
公开(公告)号:US20210233510A1
公开(公告)日:2021-07-29
申请号:US17152760
申请日:2021-01-19
Applicant: Google LLC
Inventor: Arindrima Datta , Bhuvana Ramabhadran , Jesse Emond , Brian Roak
Abstract: A method includes obtaining a plurality of training data sets each associated with a respective native language and includes a plurality of respective training data samples. For each respective training data sample of each training data set in the respective native language, the method includes transliterating the corresponding transcription in the respective native script into corresponding transliterated text representing the respective native language of the corresponding audio in a target script and associating the corresponding transliterated text in the target script with the corresponding audio in the respective native language to generate a respective normalized training data sample. The method also includes training, using the normalized training data samples, a multilingual end-to-end speech recognition model to predict speech recognition results in the target script for corresponding speech utterances spoken in any of the different native languages associated with the plurality of training data sets.
-
公开(公告)号:US20250095639A1
公开(公告)日:2025-03-20
申请号:US18962686
申请日:2024-11-27
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Gary Wang , Bhuvana Ramabhadran , Fadi Biadsy
IPC: G10L15/06 , G10L13/02 , G10L15/16 , G10L15/197 , G10L15/22 , G10L19/00 , G10L19/038 , G10L21/003
Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation. The method also includes determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term.
-
公开(公告)号:US20250078813A1
公开(公告)日:2025-03-06
申请号:US18817181
申请日:2024-08-27
Applicant: Google LLC
Inventor: Kartik Audhkhasi , Gowtham Ramesh , Bhuvana Ramabhadran
IPC: G10L15/06
Abstract: A method includes training, using an un-supervised learning technique, an auxiliary ASR model based on a first set of un-transcribed source task speech utterances to determine a first task vector, training, using the un-supervised learning technique, the auxiliary ASR model based on a second set of un-transcribed speech utterances to determine a second task vector, and training, using the un-supervised learning technique, the auxiliary ASR model based on un-transcribed target task speech utterances to determine a target task vector. The method also includes determining a first correlation between the first and target task vectors, determining a second correlation between the second and target task vectors, and adapting parameters of a trained primary ASR model based on the first and second source task vectors and the first and second correlations to teach the primary ASR model to learn how to recognize speech associated with the target task.
-
公开(公告)号:US20240420692A1
公开(公告)日:2024-12-19
申请号:US18818010
申请日:2024-08-28
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/00 , G10L15/16 , G10L15/22
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
公开(公告)号:US20240304178A1
公开(公告)日:2024-09-12
申请号:US18439630
申请日:2024-02-12
Applicant: Google LLC
Inventor: Andrew M Rosenberg , Yacob Yochai Blau , Bhuvana Ramabhadran , Genady Beryozkin , Gary Wang , Zhehuai Chen , Rohan Agrawal , Parisa Haghani
CPC classification number: G10L15/063 , G10L15/22 , G10L15/26
Abstract: A method includes receiving training data including transcribed speech utterances spoken in a general domain, modified speech utterances in a target domain, and unspoken textual utterances corresponding to the transcriptions of the modified speech utterances in the target domain. The modified speech utterances include utterances spoken in the target domain that have been modified to obfuscate one or more classes of sensitive information recited in the utterances. The method also includes generating a corresponding alignment output for each unspoken textual utterance of the received training data using an alignment model. The method also includes training a speech recognition model on the alignment outputs generated for the corresponding to the unspoken textual utterances, the un-transcribed speech utterances, and the transcribed speech utterances to teach the speech recognition model to learn to recognize speech in the target domain and phrases within the one or more classes of sensitive information.
-
公开(公告)号:US12087273B2
公开(公告)日:2024-09-10
申请号:US18161217
申请日:2023-01-30
Applicant: Google LLC
Inventor: Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran
IPC: G10L21/00 , G10L13/00 , G10L13/047
CPC classification number: G10L13/047
Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
-
公开(公告)号:US12087272B2
公开(公告)日:2024-09-10
申请号:US17756995
申请日:2019-12-13
Applicant: Google LLC
Inventor: Andrew Rosenberg , Bhuvana Ramabhadran , Fadi Biadsy , Yu Zhang
IPC: G10L15/16 , G10L13/047 , G10L13/08 , G10L15/06
CPC classification number: G10L13/047 , G10L13/086 , G10L15/063 , G10L15/16
Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.
-
公开(公告)号:US20240296832A1
公开(公告)日:2024-09-05
申请号:US18590918
申请日:2024-02-28
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Murali Karthick Baskar , Bhuvana Ramabhadran
IPC: G10L15/06 , G10L15/01 , G10L15/16 , G10L15/197
CPC classification number: G10L15/063 , G10L15/01 , G10L15/16 , G10L15/197
Abstract: A method includes, for each training sample of a plurality of training samples, processing, using an RNN-T model, a corresponding sequence of acoustic frames to obtain an n-best list of speech recognition hypotheses, and, for each speech recognition hypothesis of the n-best list, determining a corresponding number of word errors relative to a corresponding ground-truth transcription. For a top-ranked hypothesis from the n-best list, the method includes determining a first loss based on the corresponding ground-truth transcription. The method includes identifying, as an oracle hypothesis, the speech recognition hypothesis from the n-best list having the smallest corresponding number of word errors relative to the corresponding ground-truth transcription, and determining a second loss for the oracle hypothesis based on the corresponding ground-truth transcription. The method includes determining a corresponding self-training combined loss based on the first and second losses, and training the model based on the corresponding self-training combined loss.
-
公开(公告)号:US11990117B2
公开(公告)日:2024-05-21
申请号:US17451613
申请日:2021-10-20
Applicant: Google LLC
Inventor: Zhehuai Chen , Bhuvana Ramabhadran , Andrew Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar
IPC: G10L13/047 , G10L13/08 , G10L13/10
CPC classification number: G10L13/047 , G10L13/086 , G10L13/10
Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.
-
-
-
-
-
-
-
-
-