-
公开(公告)号:US11694674B1
公开(公告)日:2023-07-04
申请号:US17331427
申请日:2021-05-26
发明人: Syed Ammar Abbas , Bajibabu Bollepalli , Alexis Pierre Moinet , Thomas Renaud Drugman , Arnaud Vincent Pierre Yves Joly , Panagiota Karanasou , Sri Vishnu Kumar Karlapati , Simon Slangen , Petr Makarov
摘要: Techniques for performing text-to-speech are described. An exemplary method includes receiving a request to generate audio from input text; generating audio from the input text by: generating a first number of vectors from phoneme embeddings representing the input text, predicting one or more spectrograms having the first number of frames using multiple scales wherein a coarser scale influences a finer scale, concatenating the first number of vectors and the predicted one or more spectrograms, generating at least one mel spectrogram from the concatenated vectors and the predicted one or more spectrograms, and converting, with a vocoder, the at least one mel spectrogram frames to audio; and outputting the generated audio according to the request.
-
公开(公告)号:US11978431B1
公开(公告)日:2024-05-07
申请号:US17326886
申请日:2021-05-21
发明人: Arnaud Joly , Simon Slangen , Alexis Pierre Moinet , Thomas Renaud Drugman , Panagiota Karanasou , Syed Ammar Abbas , Sri Vishnu Kumar Karlapati
IPC分类号: G10L13/027 , G10L13/06 , G10L13/07 , G10L13/08 , G10L15/32
CPC分类号: G10L13/027 , G10L13/06 , G10L13/07 , G10L13/08 , G10L15/32
摘要: A speech-processing system receives input data representing text. One or more encoders trained to predict audio properties corresponding to the text process the text to predict those properties. A speech decoder processes phoneme embeddings as well as the predicted properties to create data representing synthesized speech.
-
公开(公告)号:US11830476B1
公开(公告)日:2023-11-28
申请号:US17342206
申请日:2021-06-08
发明人: Panagiota Karanasou , Sri Vishnu Kumar Karlapati , Alexis Pierre Moinet , Arnaud Vincent Pierre Yves Joly , Syed Ammar Abbas , Thomas Renaud Drugman , Jaime Lorenzo Trueba
CPC分类号: G10L13/10 , G06N3/08 , G10L13/07 , G10L13/086 , G10L25/30
摘要: Devices and techniques are generally described for learned condition text-to-speech synthesis. In some examples, first data representing a selection of a type of prosodic expressivity may be received. In some further examples, a selection of content comprising text data may be received. First audio data may be determined that includes an audio representation of the text data. The first audio data may be generated based at least in part on sampling from a first latent distribution generated using a conditional primary variational autoencoder (VAE). The sampling from the first latent distribution may be conditioned on a first learned distribution associated with the type of prosodic expressivity. In various examples, the first audio data may be sent to a first computing device.
-
-