-
公开(公告)号:US12087272B2
公开(公告)日:2024-09-10
申请号:US17756995
申请日:2019-12-13
Applicant: Google LLC
Inventor: Andrew Rosenberg , Bhuvana Ramabhadran , Fadi Biadsy , Yu Zhang
IPC: G10L15/16 , G10L13/047 , G10L13/08 , G10L15/06
CPC classification number: G10L13/047 , G10L13/086 , G10L15/063 , G10L15/16
Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.
-
公开(公告)号:US12020685B2
公开(公告)日:2024-06-25
申请号:US17643684
申请日:2021-12-10
Applicant: Google LLC
Inventor: Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu
IPC: G10L13/08 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047
CPC classification number: G10L13/086 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047
Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.
-
公开(公告)号:US20240185841A1
公开(公告)日:2024-06-06
申请号:US18490808
申请日:2023-10-20
Applicant: Google LLC
Inventor: Bo Li , Yu Zhang , Nanxin Chen , Rohit Prakash Prabhavalkar , Chao-Han Huck Yang , Tara N. Sainath , Trevor Strohman
IPC: G10L15/065 , G10L15/00
CPC classification number: G10L15/065 , G10L15/005
Abstract: A method includes obtaining an ASR model trained to recognize speech in a first language and receiving transcribed training utterances in a second language. The method also includes integrating the ASR model with an input reprogramming module and a latent reprogramming module. The method also includes adapting the ASR model to learn how to recognize speech in the second language by training the input reprogramming module and the latent reprogramming module while parameters of the ASR model are frozen.
-
公开(公告)号:US11990117B2
公开(公告)日:2024-05-21
申请号:US17451613
申请日:2021-10-20
Applicant: Google LLC
Inventor: Zhehuai Chen , Bhuvana Ramabhadran , Andrew Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar
IPC: G10L13/047 , G10L13/08 , G10L13/10
CPC classification number: G10L13/047 , G10L13/086 , G10L13/10
Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.
-
公开(公告)号:US20230359898A1
公开(公告)日:2023-11-09
申请号:US18350464
申请日:2023-07-11
Applicant: Google LLC
Inventor: Daniel Sung-Joon Park , Quoc Le , William Chan , Ekin Dogus Cubuk , Barret Zoph , Yu Zhang , Chung-Cheng Chiu
CPC classification number: G06N3/084 , G06N20/00 , G10L15/16 , G10L15/063 , G10L15/12 , G06V10/7747 , G10L15/28 , G06V10/82 , G06F18/2148
Abstract: Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.
-
公开(公告)号:US20230325658A1
公开(公告)日:2023-10-12
申请号:US18010426
申请日:2021-09-02
Applicant: Google LLC
Inventor: Nanxin Chen , Byungha Chun , William Chan , Ron J. Weiss , Mohammad Norouzi , Yu Zhang , Yonghui Wu
CPC classification number: G06N3/08 , G06V10/26 , G06V10/764 , G06V10/82 , G10L13/02 , G10L25/18 , G10L25/30
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating outputs conditioned on network inputs using neural networks. In one aspect, a method comprises obtaining the network input; initializing a current network output; and generating the final network output by updating the current network output at each of a plurality of iterations, wherein each iteration corresponds to a respective noise level, and wherein the updating comprises, at each iteration: processing a model input for the iteration comprising (i) the current network output and (ii) the network input using a noise estimation neural network that is configured to process the model input to generate a noise output, wherein the noise output comprises a respective noise estimate for each value in the current network output; and updating the current network output using the noise estimate and the noise level for the iteration.
-
公开(公告)号:US20230317059A1
公开(公告)日:2023-10-05
申请号:US18168470
申请日:2023-02-13
Applicant: Google LLC
Inventor: Andrew M Rosenberg , Zhehuai Chen , Yu Zhang , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G06F40/289 , G10L15/16 , G10L15/06
CPC classification number: G10L15/063 , G06F40/289 , G10L15/16 , G10L15/197 , G10L2015/0635
Abstract: A method includes receiving training data that includes unspoken textual utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken textual utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance paired with a corresponding transcription. The method also includes generating a corresponding alignment output for each unspoken textual utterance of the received training data using an alignment model. The method also includes pre-training an audio encoder on the alignment outputs generated for corresponding to the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.
-
公开(公告)号:US20230064749A1
公开(公告)日:2023-03-02
申请号:US18054604
申请日:2022-11-11
Applicant: Google LLC
Inventor: Lev Finkelstein , Chun-an Chan , Byungha Chun , Ye Jia , Yu Zhang , Robert Andrew James Clark , Vincent Wan
Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.
-
公开(公告)号:US11580952B2
公开(公告)日:2023-02-14
申请号:US16855042
申请日:2020-04-22
Applicant: Google LLC
Inventor: Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran
IPC: G10L13/00 , G10L13/047
Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
-
公开(公告)号:US11514888B2
公开(公告)日:2022-11-29
申请号:US16992410
申请日:2020-08-13
Applicant: Google LLC
Inventor: Lev Finkelstein , Chun-An Chan , Byungha Chun , Ye Jia , Yu Zhang , Robert Andrew James Clark , Vincent Wan
Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.
-
-
-
-
-
-
-
-
-