Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech

    公开(公告)号:US12249315B2

    公开(公告)日:2025-03-11

    申请号:US18499031

    申请日:2023-10-31

    Applicant: Google LLC

    Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

    MULTILINGUAL SPEECH SYNTHESIS AND CROSS-LANGUAGE VOICE CLONING

    公开(公告)号:US20240404506A1

    公开(公告)日:2024-12-05

    申请号:US18797760

    申请日:2024-08-08

    Applicant: Google LLC

    Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

    Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech

    公开(公告)号:US11823656B2

    公开(公告)日:2023-11-21

    申请号:US17326542

    申请日:2021-05-21

    Applicant: Google LLC

    CPC classification number: G10L13/08 G10L13/04

    Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

    Text-to-speech using duration prediction

    公开(公告)号:US12100382B2

    公开(公告)日:2024-09-24

    申请号:US17492543

    申请日:2021-10-01

    Applicant: Google LLC

    CPC classification number: G10L13/027 G10L13/04

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.

    Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech

    公开(公告)号:US20240062743A1

    公开(公告)日:2024-02-22

    申请号:US18499031

    申请日:2023-10-31

    Applicant: Google LLC

    CPC classification number: G10L13/08 G10L13/04

    Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

Patent Agency Ranking