-
公开(公告)号:US12100382B2
公开(公告)日:2024-09-24
申请号:US17492543
申请日:2021-10-01
Applicant: Google LLC
Inventor: Yu Zhang , Isaac Elias , Byungha Chun , Ye Jia , Yonghui Wu , Mike Chrzanowski , Jonathan Shen
IPC: G10L13/027 , G10L13/04
CPC classification number: G10L13/027 , G10L13/04
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.
-
公开(公告)号:US20220108680A1
公开(公告)日:2022-04-07
申请号:US17492543
申请日:2021-10-01
Applicant: Google LLC
Inventor: Yu Zhang , Isaac Elias , Byungha Chun , Ye Jia , Yonghui Wu , Mike Chrzanowski , Jonathan Shen
IPC: G10L13/027 , G10L13/04
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.
-