-
公开(公告)号:US20240153484A1
公开(公告)日:2024-05-09
申请号:US18494324
申请日:2023-10-25
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Takaaki Saeki , Zhehuai Chen , Byungha Chun , Bhuvana Ramabhadran
IPC: G10L13/047 , G10L15/06 , G10L15/16
CPC classification number: G10L13/047 , G10L15/063 , G10L15/16
Abstract: A method includes receiving training data that includes a plurality of sets of text-to-speech (TTS) spoken utterances each associated with a respective language and including TTS utterances of synthetic speech spoken that includes a corresponding reference speech representation paired with a corresponding input text sequence. For each TTS utterance in each set of the TTS spoken training utterances of the received training data, the method includes generating a corresponding TTS encoded textual representation for the corresponding input text sequence, generating a corresponding speech encoding for the corresponding TTS utterance of synthetic speech, generating a shared encoder output, generating a predicted speech representation for the corresponding TTS utterance of synthetic speech, and determining a reconstruction loss. The method also includes training a TTS model based on the reconstruction losses determined for the TTS utterances in each set of the TTS spoken training utterances.
-
公开(公告)号:US20250078805A1
公开(公告)日:2025-03-06
申请号:US18823661
申请日:2024-09-03
Applicant: Google LLC
Inventor: Andrew M Rosenberg , Takaaki Saeki , Francoise Beaufays , Bhuvana Ramabhadran
Abstract: A method includes receiving training data that includes a plurality of sets of training utterances each associated with a respective language. Each training utterance includes a corresponding reference speech representation paired with a corresponding input text sequence. For each training utterance, the method includes generating a corresponding encoded textual representation for the corresponding input text sequence, generating a corresponding speech encoding for the corresponding reference speech representation, generating a shared encoder output, and determining a text-to-speech (TTS) loss based on the corresponding encoded textual representation, the corresponding speech encoding, and the shared encoder output. The method also includes training a TTS model based on the TTS losses determined for the training utterances in each set of the training utterances to teach the TTS model to learn how to synthesize speech in each of the respective languages.
-