-
公开(公告)号:US12175963B2
公开(公告)日:2024-12-24
申请号:US18525475
申请日:2023-11-30
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick An Phu Nguyen
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
-
公开(公告)号:US20240404506A1
公开(公告)日:2024-12-05
申请号:US18797760
申请日:2024-08-08
Applicant: Google LLC
Inventor: Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran
IPC: G10L13/047
Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
-
公开(公告)号:US20240312449A1
公开(公告)日:2024-09-19
申请号:US18676743
申请日:2024-05-29
Applicant: Google LLC
Inventor: Ruoming Pang , Andros Tjandra , Yu Zhang , Shigeki Karita
IPC: G10L13/027 , G10L21/0308
CPC classification number: G10L13/027 , G10L21/0308
Abstract: A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.
-
公开(公告)号:US20240233704A9
公开(公告)日:2024-07-11
申请号:US18493770
申请日:2023-10-24
Applicant: Google LLC
Inventor: Nobuyuki Morioka , Byungha Chun , Nanxin Chen , Yu Zhang , Yifan Ding
IPC: G10L13/027
CPC classification number: G10L13/027
Abstract: A method for residual adapters for few-shot text-to-speech speaker adaptation includes obtaining a text-to-speech (TTS) model configured to convert text into representations of synthetic speech, the TTS model pre-trained on an initial training data set. The method further includes augmenting the TTS model with a stack of residual adapters. The method includes receiving an adaption training data set including one or more spoken utterances spoken by a target speaker, each spoken utterance in the adaptation training data set paired with corresponding input text associated with a transcription of the spoken utterance. The method also includes adapting, using the adaption training data set, the TTS model augmented with the stack of residual adapters to learn how to synthesize speech in a voice of the target speaker by optimizing the stack of residual adapters while parameters of the TTS model are frozen.
-
公开(公告)号:US12027151B2
公开(公告)日:2024-07-02
申请号:US17455667
申请日:2021-11-18
Applicant: Google LLC
Inventor: Ruoming Pang , Andros Tjandra , Yu Zhang , Shigeki Karita
IPC: G10L13/027 , G10L21/0308
CPC classification number: G10L13/027 , G10L21/0308
Abstract: A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.
-
公开(公告)号:US20240028829A1
公开(公告)日:2024-01-25
申请号:US18346232
申请日:2023-07-01
Applicant: Google LLC
Inventor: Tara N. Sainath , Zhouyuan Huo , Zhehuai Chen , Yu Zhang , Weiran Wang , Trevor Strohman , Rohit Prakash Prabhavalkar , Bo Li , Ankur Bapna
IPC: G06F40/284 , G06F40/40
CPC classification number: G06F40/284 , G06F40/40
Abstract: A method includes receiving training data that includes a set of unspoken textual utterances. For each respective unspoken textual utterance, the method includes, tokenizing the respective textual utterance into a sequence of sub-word units, generating a first higher order textual feature representation for a corresponding sub-word unit tokenized from the respective unspoken textual utterance, receiving the first higher order textual feature representation generated by a text encoder, and generating a first probability distribution over possible text units. The method also includes training an encoder based on the first probability distribution over possible text units generated by a first-pass decoder for each respective unspoken textual utterance in the set of unspoken textual utterances.
-
37.
公开(公告)号:US20240013777A1
公开(公告)日:2024-01-11
申请号:US18320458
申请日:2023-05-19
Applicant: Google LLC
Inventor: Zhiyun Lu , Yu Zhang , Wei Han , Yongqiang Wang , Parisa Haghani , Zhehuai Chen
CPC classification number: G10L15/16 , G10L15/063
Abstract: A method includes obtaining a corpus of unlabeled training data including a plurality of spoken utterances, each corresponding spoken utterance of the plurality of spoken utterances includes audio data characterizing the corresponding spoken utterance. The method also includes receiving a target domain. The method also includes selecting, using a contrastive data selection model, a subset of the utterances from the corpus of unlabeled training data that correspond to the target domain. The method includes training an automatic speech recognition (ASR) model on the subset of utterances.
-
公开(公告)号:US11823656B2
公开(公告)日:2023-11-21
申请号:US17326542
申请日:2021-05-21
Applicant: Google LLC
Inventor: Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu
Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.
-
公开(公告)号:US20230178068A1
公开(公告)日:2023-06-08
申请号:US18161217
申请日:2023-01-30
Applicant: Google LLC
Inventor: Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran
IPC: G10L13/047
CPC classification number: G10L13/047
Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
-
公开(公告)号:US11610586B2
公开(公告)日:2023-03-21
申请号:US17182592
申请日:2021-02-23
Applicant: Google LLC
Inventor: David Qiu , Qiujia Li , Yanzhang He , Yu Zhang , Bo Li , Liangliang Cao , Rohit Prabhavalkar , Deepti Bhatia , Wei Li , Ke Hu , Tara Sainath , Ian Mcgraw
Abstract: A method includes receiving a speech recognition result, and using a confidence estimation module (CEM), for each sub-word unit in a sequence of hypothesized sub-word units for the speech recognition result: obtaining a respective confidence embedding that represents a set of confidence features; generating, using a first attention mechanism, a confidence feature vector; generating, using a second attention mechanism, an acoustic context vector; and generating, as output from an output layer of the CEM, a respective confidence output score for each corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the CEM. For each of the one or more words formed by the sequence of hypothesized sub-word units, the method also includes determining a respective word-level confidence score for the word. The method also includes determining an utterance-level confidence score by aggregating the word-level confidence scores.
-
-
-
-
-
-
-
-
-