Patent search ap:("GOOGLE LLC") AND inv:"Byungha Chun" Page 2

11.

发明公开
Parallel Tacotron Non-Autoregressive and Controllable TTS 审中-公开

公开(公告)号：US20240161730A1

公开(公告)日：2024-05-16

申请号：US18421116

申请日：2024-01-24

Applicant: Google LLC

Inventor： Isaac Elias , Jonathan Shen , Yu Zhang , Ye Jia , Ron J. Weiss , Yonghui Wu , Byungha Chun

IPC: G10L13/08 , G06F40/126 , G06N3/044 , G06N3/045 , G06N3/08 , G06N3/088 , G10L13/047 , G10L21/10

CPC classification number: G10L13/08 , G06F40/126 , G06N3/044 , G06N3/045 , G06N3/08 , G06N3/088 , G10L13/047 , G10L21/10 , G06N3/048

Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.

12.

发明公开
MASSIVE MULTILINGUAL SPEECH-TEXT JOINT SEMI-SUPERVISED LEARNING FOR TEXT-TO-SPEECH 审中-公开

公开(公告)号：US20240153484A1

公开(公告)日：2024-05-09

申请号：US18494324

申请日：2023-10-25

Applicant: Google LLC

Inventor： Andrew M. Rosenberg , Takaaki Saeki , Zhehuai Chen , Byungha Chun , Bhuvana Ramabhadran

IPC: G10L13/047 , G10L15/06 , G10L15/16

CPC classification number: G10L13/047 , G10L15/063 , G10L15/16

Abstract: A method includes receiving training data that includes a plurality of sets of text-to-speech (TTS) spoken utterances each associated with a respective language and including TTS utterances of synthetic speech spoken that includes a corresponding reference speech representation paired with a corresponding input text sequence. For each TTS utterance in each set of the TTS spoken training utterances of the received training data, the method includes generating a corresponding TTS encoded textual representation for the corresponding input text sequence, generating a corresponding speech encoding for the corresponding TTS utterance of synthetic speech, generating a shared encoder output, generating a predicted speech representation for the corresponding TTS utterance of synthetic speech, and determining a reconstruction loss. The method also includes training a TTS model based on the reconstruction losses determined for the TTS utterances in each set of the TTS spoken training utterances.

13.

发明公开
Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech 审中-公开

公开(公告)号：US20240062743A1

公开(公告)日：2024-02-22

申请号：US18499031

申请日：2023-10-31

Applicant: Google LLC

Inventor： Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu

IPC: G10L13/08 , G10L13/04

CPC classification number: G10L13/08 , G10L13/04

Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

14.

发明申请
Phonemes And Graphemes for Neural Text-to-Speech 有权

公开(公告)号：US20220310059A1

公开(公告)日：2022-09-29

申请号：US17643684

申请日：2021-12-10

Applicant: Google LLC

Inventor： Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu

IPC: G10L13/08 , G06F40/279 , G06F40/263 , G06N3/08

Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.

15.

发明申请
TEXT-TO-SPEECH USING DURATION PREDICTION 有权

公开(公告)号：US20220108680A1

公开(公告)日：2022-04-07

申请号：US17492543

申请日：2021-10-01

Applicant: Google LLC

Inventor： Yu Zhang , Isaac Elias , Byungha Chun , Ye Jia , Yonghui Wu , Mike Chrzanowski , Jonathan Shen

IPC: G10L13/027 , G10L13/04

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.

16.

发明申请
MULTILINGUAL SPEECH SYNTHESIS AND CROSS-LANGUAGE VOICE CLONING 审中-公开

公开(公告)号：US20200380952A1

公开(公告)日：2020-12-03

申请号：US16855042

申请日：2020-04-22

Applicant: Google LLC

Inventor： Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran

IPC: G10L13/047

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

17.

发明授权
Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech 有权

公开(公告)号：US12249315B2

公开(公告)日：2025-03-11

申请号：US18499031

申请日：2023-10-31

Applicant: Google LLC

Inventor： Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu

IPC: G10L13/08 , G10L13/04

Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

18.

发明申请
MULTILINGUAL SPEECH SYNTHESIS AND CROSS-LANGUAGE VOICE CLONING 有权

公开(公告)号：US20240404506A1

公开(公告)日：2024-12-05

申请号：US18797760

申请日：2024-08-08

Applicant: Google LLC

Inventor： Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran

IPC: G10L13/047

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

19.

发明公开
RESIDUAL ADAPTERS FOR FEW-SHOT TEXT-TO-SPEECH SPEAKER ADAPTATION 审中-公开

公开(公告)号：US20240233704A9

公开(公告)日：2024-07-11

申请号：US18493770

申请日：2023-10-24

Applicant: Google LLC

Inventor： Nobuyuki Morioka , Byungha Chun , Nanxin Chen , Yu Zhang , Yifan Ding

IPC: G10L13/027

CPC classification number: G10L13/027

Abstract: A method for residual adapters for few-shot text-to-speech speaker adaptation includes obtaining a text-to-speech (TTS) model configured to convert text into representations of synthetic speech, the TTS model pre-trained on an initial training data set. The method further includes augmenting the TTS model with a stack of residual adapters. The method includes receiving an adaption training data set including one or more spoken utterances spoken by a target speaker, each spoken utterance in the adaptation training data set paired with corresponding input text associated with a transcription of the spoken utterance. The method also includes adapting, using the adaption training data set, the TTS model augmented with the stack of residual adapters to learn how to synthesize speech in a voice of the target speaker by optimizing the stack of residual adapters while parameters of the TTS model are frozen.

20.

发明授权
Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech 有权

公开(公告)号：US11823656B2

公开(公告)日：2023-11-21

申请号：US17326542

申请日：2021-05-21

Applicant: Google LLC

Inventor： Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu

IPC: G10L13/08 , G10L13/04

CPC classification number: G10L13/08 , G10L13/04

Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification