Self-training WaveNet for text-to-speech

    公开(公告)号:US11295725B2

    公开(公告)日:2022-04-05

    申请号:US16925230

    申请日:2020-07-09

    Applicant: Google LLC

    Abstract: A method of self-training WaveNet includes receiving a plurality of recorded speech samples and training a first autoregressive neural network using the plurality of recorded speech samples. The trained first autoregressive neural network is configured to output synthetic speech as an audible representations of a text input. The method further includes generating a plurality of synthetic speech samples using the trained first autoregressive neural network. The method additionally includes training a second autoregressive neural network using the plurality of synthetic speech samples from the trained first autoregressive neural network and distilling the trained second autoregressive neural network into a feedforward neural network.

    Self-Training WaveNet for Text-to-Speech

    公开(公告)号:US20220013105A1

    公开(公告)日:2022-01-13

    申请号:US16925230

    申请日:2020-07-09

    Applicant: Google LLC

    Abstract: A method of self-training WaveNet includes receiving a plurality of recorded speech samples and training a first autoregressive neural network using the plurality of recorded speech samples. The trained first autoregressive neural network is configured to output synthetic speech as an audible representations of a text input. The method further includes generating a plurality of synthetic speech samples using the trained first autoregressive neural network. The method additionally includes training a second autoregressive neural network using the plurality of synthetic speech samples from the trained first autoregressive neural network and distilling the trained second autoregressive neural network into a feedforward neural network.

    Attention-based clockwork hierarchical variational encoder

    公开(公告)号:US12080272B2

    公开(公告)日:2024-09-03

    申请号:US17756264

    申请日:2019-12-10

    Applicant: Google LLC

    CPC classification number: G10L13/10 G10L25/30 G10L2013/105

    Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.

    Key frame networks
    4.
    发明授权

    公开(公告)号:US12046227B2

    公开(公告)日:2024-07-23

    申请号:US17659840

    申请日:2022-04-19

    Applicant: Google LLC

    Abstract: A method for generating frame values using a key frame network includes receiving a text utterance having at least one phoneme, and for each respective phoneme of the at least one phoneme, predicting, using a predictive model, a fixed quantity of key frames. Each respective key frame of the fixed quantity of key frames includes a representation of a component of the respective phoneme. The method also includes generating, using the fixed quantity of key frames, a plurality of frame values. Here, each respective frame value of the plurality of frame values is representative of a fixed-duration of audio.

    Key Frame Networks
    5.
    发明公开
    Key Frame Networks 审中-公开

    公开(公告)号:US20230335110A1

    公开(公告)日:2023-10-19

    申请号:US17659840

    申请日:2022-04-19

    Applicant: Google LLC

    Abstract: A method for generating frame values using a key frame network includes receiving a text utterance having at least one phoneme, and for each respective phoneme of the at least one phoneme, predicting, using a predictive model, a fixed quantity of key frames. Each respective key frame of the fixed quantity of key frames includes a representation of a component of the respective phoneme. The method also includes generating, using the fixed quantity of key frames, a plurality of frame values. Here, each respective frame value of the plurality of frame values is representative of a fixed-duration of audio.

    Attention-Based Clockwork Hierarchical Variational Encoder

    公开(公告)号:US20220415306A1

    公开(公告)日:2022-12-29

    申请号:US17756264

    申请日:2019-12-10

    Applicant: Google LLC

    Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.

    Attention-Based Clockwork Hierarchical Variational Encoder

    公开(公告)号:US20240038214A1

    公开(公告)日:2024-02-01

    申请号:US18487227

    申请日:2023-10-16

    Applicant: Google LLC

    CPC classification number: G10L13/10 G10L25/30 G10L2013/105

    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.

    Attention-based clockwork hierarchical variational encoder

    公开(公告)号:US12272349B2

    公开(公告)日:2025-04-08

    申请号:US18487227

    申请日:2023-10-16

    Applicant: Google LLC

    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.

    Clockwork Hierarchical Variational Encoder

    公开(公告)号:US20210134266A1

    公开(公告)日:2021-05-06

    申请号:US17147548

    申请日:2021-01-13

    Applicant: Google LLC

    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.

    Clockwork hierarchical variational encoder

    公开(公告)号:US10923107B2

    公开(公告)日:2021-02-16

    申请号:US16382722

    申请日:2019-04-12

    Applicant: Google LLC

    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.

Patent Agency Ranking