-
公开(公告)号:US11295725B2
公开(公告)日:2022-04-05
申请号:US16925230
申请日:2020-07-09
Applicant: Google LLC
Inventor: Manish Sharma , Tom Marius Kenter , Robert Clark
IPC: G10L13/047 , G10L25/30
Abstract: A method of self-training WaveNet includes receiving a plurality of recorded speech samples and training a first autoregressive neural network using the plurality of recorded speech samples. The trained first autoregressive neural network is configured to output synthetic speech as an audible representations of a text input. The method further includes generating a plurality of synthetic speech samples using the trained first autoregressive neural network. The method additionally includes training a second autoregressive neural network using the plurality of synthetic speech samples from the trained first autoregressive neural network and distilling the trained second autoregressive neural network into a feedforward neural network.
-
公开(公告)号:US20220013105A1
公开(公告)日:2022-01-13
申请号:US16925230
申请日:2020-07-09
Applicant: Google LLC
Inventor: Manish Sharma , Tom Marius Kenter , Robert Clark
IPC: G10L13/047 , G10L25/30
Abstract: A method of self-training WaveNet includes receiving a plurality of recorded speech samples and training a first autoregressive neural network using the plurality of recorded speech samples. The trained first autoregressive neural network is configured to output synthetic speech as an audible representations of a text input. The method further includes generating a plurality of synthetic speech samples using the trained first autoregressive neural network. The method additionally includes training a second autoregressive neural network using the plurality of synthetic speech samples from the trained first autoregressive neural network and distilling the trained second autoregressive neural network into a feedforward neural network.
-
公开(公告)号:US12080272B2
公开(公告)日:2024-09-03
申请号:US17756264
申请日:2019-12-10
Applicant: Google LLC
Inventor: Robert Clark , Chun-an Chan , Vincent Wan
CPC classification number: G10L13/10 , G10L25/30 , G10L2013/105
Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.
-
公开(公告)号:US12046227B2
公开(公告)日:2024-07-23
申请号:US17659840
申请日:2022-04-19
Applicant: Google LLC
Inventor: Tom Marius Kenter , Tobias Alexander Hawker , Robert Clark
IPC: G10L13/08 , G10L15/02 , G10L15/06 , G10L15/187
CPC classification number: G10L13/08 , G10L15/02 , G10L15/063 , G10L15/187 , G10L2015/025
Abstract: A method for generating frame values using a key frame network includes receiving a text utterance having at least one phoneme, and for each respective phoneme of the at least one phoneme, predicting, using a predictive model, a fixed quantity of key frames. Each respective key frame of the fixed quantity of key frames includes a representation of a component of the respective phoneme. The method also includes generating, using the fixed quantity of key frames, a plurality of frame values. Here, each respective frame value of the plurality of frame values is representative of a fixed-duration of audio.
-
公开(公告)号:US20230335110A1
公开(公告)日:2023-10-19
申请号:US17659840
申请日:2022-04-19
Applicant: Google LLC
Inventor: Tom Marius Kenter , Tobias Alexander Hawker , Robert Clark
IPC: G10L13/08 , G10L15/02 , G10L15/06 , G10L15/187
CPC classification number: G10L13/08 , G10L15/02 , G10L15/063 , G10L15/187 , G10L2015/025
Abstract: A method for generating frame values using a key frame network includes receiving a text utterance having at least one phoneme, and for each respective phoneme of the at least one phoneme, predicting, using a predictive model, a fixed quantity of key frames. Each respective key frame of the fixed quantity of key frames includes a representation of a component of the respective phoneme. The method also includes generating, using the fixed quantity of key frames, a plurality of frame values. Here, each respective frame value of the plurality of frame values is representative of a fixed-duration of audio.
-
公开(公告)号:US20220415306A1
公开(公告)日:2022-12-29
申请号:US17756264
申请日:2019-12-10
Applicant: Google LLC
Inventor: Robert Clark , Chun-an Chan , Vincent Wan
Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.
-
公开(公告)号:US20240038214A1
公开(公告)日:2024-02-01
申请号:US18487227
申请日:2023-10-16
Applicant: Google LLC
Inventor: Robert Clark , Chun-an Chan , Vincent Wan
CPC classification number: G10L13/10 , G10L25/30 , G10L2013/105
Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.
-
公开(公告)号:US12272349B2
公开(公告)日:2025-04-08
申请号:US18487227
申请日:2023-10-16
Applicant: Google LLC
Inventor: Robert Clark , Chun-An Chan , Vincent Wan
Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.
-
公开(公告)号:US20210134266A1
公开(公告)日:2021-05-06
申请号:US17147548
申请日:2021-01-13
Applicant: Google LLC
Inventor: Robert Clark , Chun-an Chan , Vincent Wan
IPC: G10L13/10 , G10L13/047
Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
-
公开(公告)号:US10923107B2
公开(公告)日:2021-02-16
申请号:US16382722
申请日:2019-04-12
Applicant: Google LLC
Inventor: Robert Clark , Chun-an Chan , Vincent Wan
IPC: G10L13/10 , G10L13/047
Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
-
-
-
-
-
-
-
-
-