Real-time neural text-to-speech

发明授权

US11705107B2 Real-time neural text-to-speech 有权

请登陆查看更多内容

专利标题： Real-time neural text-to-speech
申请号： US17061433

申请日： 2020-10-01
公开(公告)号： US11705107B2

公开(公告)日： 2023-07-18
发明人: Sercan O. Arik , Mike Chrzanowski , Adam Coates , Gregory Diamos , Andrew Gibiansky , John Miller , Andrew Ng , Jonathan Raiman , Shubhahrata Sengupta , Mohammad Shoeybi
申请人： Baidu USA, LLC
申请人地址： US CA Sunnyvale
专利权人： Baidu USA LLC
当前专利权人： Baidu USA LLC
当前专利权人地址： US CA Sunnyvale
代理机构： North Weber & Baugh LLP
主分类号： G10L13/08
IPC分类号： G10L13/08 ; G10L13/027 ; G10L25/30 ; G06N3/082 ; G06N3/044 ; G06N3/045 ; G06N3/02 ; G06F40/242 ; G06N3/047

摘要：

Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

公开/授权文献

US20210027762A1 REAL-TIME NEURAL TEXT-TO-SPEECH 公开/授权日：2021-01-28

信息查询

Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L13/00	语音合成；文本-语音合成系统
G10L13/08	.文本分析或文本以外的语音合成参数的产生，例如语义图翻译为音素、韵律产生、重音或声调测定