-
公开(公告)号:US11443733B2
公开(公告)日:2022-09-13
申请号:US16665886
申请日:2019-10-28
发明人: Roberto Barra Chicote , Javier Latorre , Adam Franciszek Nadolski , Viacheslav Klimkov , Thomas Edward Merritt
IPC分类号: G10L13/10 , G10L13/033 , G10L13/047
摘要: A text-to-speech (TTS) system that is capable of considering characteristics of various portions of text data in order to create continuity between segments of synthesized speech. The system can analyze text portions of a work and create feature vectors including data corresponding to characteristics of the individual portions and/or the overall work. A TTS processing component can then consider feature vector(s) from other portions when performing TTS processing on text of a first portion, thus giving the TTS component some intelligence regarding other portions of the work, which can then result in more continuity between synthesized speech segments.
-
公开(公告)号:US20230043916A1
公开(公告)日:2023-02-09
申请号:US17848831
申请日:2022-06-24
发明人: Roberto Barra Chicote , Vatsal Aggarwal , Andrew Paul Breen , Javier Gonzalez Hernandez , Nishant Prateek
IPC分类号: G10L13/10 , G06F40/30 , G10L13/033 , G10L13/047
摘要: During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.
-
公开(公告)号:US10699695B1
公开(公告)日:2020-06-30
申请号:US16023370
申请日:2018-06-29
发明人: Adam Franciszek Nadolski , Daniel Korzekwa , Thomas Edward Merritt , Marco Nicolis , Bartosz Putrycz , Roberto Barra Chicote , Rafal Kuklinski , Wiktor Dolecki
IPC分类号: G10L13/10 , G10L13/06 , G10L13/047
摘要: During text-to-speech processing, audio data corresponding to a word part, word, or group of words is generated using a trained model and used by a unit selection engine to create output audio. The audio data is generated at least when an input word is unrecognized or when a cost of a unit selection is too high.
-
公开(公告)号:US10692484B1
公开(公告)日:2020-06-23
申请号:US16007757
申请日:2018-06-13
发明人: Thomas Edward Merritt , Adam Franciszek Nadolski , Nishant Prateek , Bartosz Putrycz , Roberto Barra Chicote , Vatsal Aggarwal , Andrew Paul Breen
IPC分类号: G10L13/04 , G10L13/08 , G10L25/24 , G10L25/60 , G10L13/047
摘要: A speech model is trained using multi-task learning. A first task may correspond to how well predicted audio matches training audio; a second task may correspond to a metric of perceived audio quality. The speech model may include, during training, layers related to the second task that are discarded at runtime.
-
公开(公告)号:US11373633B2
公开(公告)日:2022-06-28
申请号:US16586007
申请日:2019-09-27
发明人: Roberto Barra Chicote , Vatsal Aggarwal , Andrew Paul Breen , Javier Gonzalez Hernandez , Nishant Prateek
IPC分类号: G10L13/033 , G10L13/047 , G10L15/18 , G10L13/10 , G06F40/30
摘要: During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.
-
公开(公告)号:US20210097976A1
公开(公告)日:2021-04-01
申请号:US16586007
申请日:2019-09-27
发明人: Roberto Barra Chicote , Vatsal Aggarwal , Andrew Paul Breen , Javier Gonzalez Hernandez , Nishant Prateek
IPC分类号: G10L13/10 , G10L13/047 , G06F17/27 , G10L13/033
摘要: During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.
-
公开(公告)号:US20200152169A1
公开(公告)日:2020-05-14
申请号:US16665886
申请日:2019-10-28
发明人: Roberto Barra Chicote , Javier Latorre , Adam Franciszek Nadolski , Viacheslav Klimkov , Thomas Edward Merritt
IPC分类号: G10L13/10 , G10L13/033 , G10L13/047
摘要: A text-to-speech (TTS) system that is capable of considering characteristics of various portions of text data in order to create continuity between segments of synthesized speech. The system can analyze text portions of a work and create feature vectors including data corresponding to characteristics of the individual portions and/or the overall work. A TTS processing component can then consider feature vector(s) from other portions when performing TTS processing on text of a first portion, thus giving the TTS component some intelligence regarding other portions of the work, which can then result in more continuity between synthesized speech segments.
-
公开(公告)号:US12100383B1
公开(公告)日:2024-09-24
申请号:US17707203
申请日:2022-03-29
发明人: Abdelhamid Ezzerg , Piotr Tadeusz Bilinski , Thomas Edward Merritt , Roberto Barra Chicote , Daniel Korzekwa , Kamil Pokora
IPC分类号: G10L13/047 , G06N3/045 , G10L25/30
CPC分类号: G10L13/047 , G06N3/045 , G10L25/30
摘要: Voice customization is an application of voice synthesis that involves synthesizing speech having certain voice characteristics, and/or modifying the voice characteristics of human speech. Certain techniques for voice customization may be used in conjunction with compressing speech for storage and/or transmission. For example, speech may be received at a first device and transformed into a latent representation and/or compressed for storage and/or transmission to a second device. The system may use normalizing flows to transform the source audio to a latent representation having a desired variable distribution, and to transform the latent representation back into audio data. A flow model may conditioned using first speech attributes when transforming the source audio, and an inverse flow model may use second speech attributes when transforming the latent representation back into audio data. The first and/or second speech attributes may be modified to alter voice characteristics of the transmitted speech.
-
公开(公告)号:US11017763B1
公开(公告)日:2021-05-25
申请号:US16712466
申请日:2019-12-12
IPC分类号: G10L15/22 , G10L15/26 , G10L13/08 , G10L13/047 , G10L13/033
摘要: During text-to-speech processing, a sequence-to-sequence neural network model may process text data and determine corresponding spectrogram data. A normalizing flow component may then process this spectrogram data to predict corresponding phase data. An inverse Fourier transform may then be performed on the spectrogram and phase data to create an audio waveform that includes speech corresponding to the text.
-
公开(公告)号:US10706837B1
公开(公告)日:2020-07-07
申请号:US16007811
申请日:2018-06-13
发明人: Roberto Barra Chicote , Adam Franciszek Nadolski , Thomas Edward Merritt , Bartosz Putrycz , Andrew Paul Breen
IPC分类号: G10L13/033 , G10L13/04 , G10L13/10
摘要: A speech model includes a sub-model corresponding to a vocal attribute. The speech model generates an output waveform using a sample model, which receives text data, and a conditioning model, which receives text metadata and produces a prosody output for use by the sample model. If, during training or runtime, a different vocal attribute is desired or needed, the sub-model is re-trained or switched to a different sub-model corresponding to the different vocal attribute.
-
-
-
-
-
-
-
-
-