Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech

Invention Application

US20220301543A1 Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech 有权

Please log in to see more content

Patent Title: Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech
Application No.: US17326542

Application Date: 2021-05-21
Publication No.: US20220301543A1

Publication Date: 2022-09-22
Inventor: Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu
Applicant: Google LLC
Applicant Address: US CA Mountain View
Assignee: Google LLC
Current Assignee: Google LLC
Current Assignee Address: US CA Mountain View
Main IPC: G10L13/08
IPC: G10L13/08 ; G10L13/04

Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech

Abstract:

A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

Public/Granted literature

US11823656B2 Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech Public/Granted day:2023-11-21

Information query

Global Dossier Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L13/00	语音合成；文本-语音合成系统
G10L13/08	.文本分析或文本以外的语音合成参数的产生，例如语义图翻译为音素、韵律产生、重音或声调测定