MASSIVE MULTILINGUAL SPEECH-TEXT JOINT SEMI-SUPERVISED LEARNING FOR TEXT-TO-SPEECH

    公开(公告)号:US20240153484A1

    公开(公告)日:2024-05-09

    申请号:US18494324

    申请日:2023-10-25

    Applicant: Google LLC

    CPC classification number: G10L13/047 G10L15/063 G10L15/16

    Abstract: A method includes receiving training data that includes a plurality of sets of text-to-speech (TTS) spoken utterances each associated with a respective language and including TTS utterances of synthetic speech spoken that includes a corresponding reference speech representation paired with a corresponding input text sequence. For each TTS utterance in each set of the TTS spoken training utterances of the received training data, the method includes generating a corresponding TTS encoded textual representation for the corresponding input text sequence, generating a corresponding speech encoding for the corresponding TTS utterance of synthetic speech, generating a shared encoder output, generating a predicted speech representation for the corresponding TTS utterance of synthetic speech, and determining a reconstruction loss. The method also includes training a TTS model based on the reconstruction losses determined for the TTS utterances in each set of the TTS spoken training utterances.

    Scaling Multilingual Speech Synthesis with Zero Supervision of Found Data

    公开(公告)号:US20250078805A1

    公开(公告)日:2025-03-06

    申请号:US18823661

    申请日:2024-09-03

    Applicant: Google LLC

    Abstract: A method includes receiving training data that includes a plurality of sets of training utterances each associated with a respective language. Each training utterance includes a corresponding reference speech representation paired with a corresponding input text sequence. For each training utterance, the method includes generating a corresponding encoded textual representation for the corresponding input text sequence, generating a corresponding speech encoding for the corresponding reference speech representation, generating a shared encoder output, and determining a text-to-speech (TTS) loss based on the corresponding encoded textual representation, the corresponding speech encoding, and the shared encoder output. The method also includes training a TTS model based on the TTS losses determined for the training utterances in each set of the training utterances to teach the TTS model to learn how to synthesize speech in each of the respective languages.

Patent Agency Ranking