Patent search ap:("Amazon Technologies Page Inc.") AND inv:"Ruizhe Li"

1.

发明授权
Emphasizing portions of synthesized speech 有权

公开(公告)号：US12243511B1

公开(公告)日：2025-03-04

申请号：US17709788

申请日：2022-03-31

Applicant: Amazon Technologies, Inc.

Inventor： Arnaud Vincent Pierre Yves Joly , Marco Nicolis , Elena Sergeevna Sokolova , Jedrzej Sobanski , Mateusz Aleksander Lajszczak , Arent van Korlaar , Ruizhe Li

IPC: G10L13/10 , G10L13/033 , G10L13/04 , G10L13/06 , G10L15/26

Abstract: A neural text-to-speech system may be configured to emphasize words. Applying emphasis where appropriate enables the TTS system to better reproduce prosodic characteristics of human speech. Emphasis may make the resulting synthesized speech more understandable and engaging than synthesized speech lacking emphasis. Emphasis may be manually annotated to, and/or predicted from, a source text (e.g., a book). In some implementations, the system may use a generative model such as a variational autoencoder to generate word acoustic embeddings indicating how emphasis is to be reflected in the synthesized speech. A phoneme encoder of the TTS system may process phonemes to generate phoneme embeddings. A decoder may process the word acoustic embeddings and the phoneme embeddings to generate spectrogram data representing the synthesized speech.

2.

发明授权
Augmenting datasets for training audio generation models 有权

公开(公告)号：US12254864B1

公开(公告)日：2025-03-18

申请号：US17854439

申请日：2022-06-30

Applicant: Amazon Technologies, Inc.

Inventor： Mateusz Aleksander Lajszczak , Adam Marek Gabrys , Arent van Korlaar , Ruizhe Li , Elena Sergeevna Sokolova , Jaime Lorenzo Trueba , Arnaud Vincent Pierre Yves Joly , Marco Nicolis , Ekaterina Petrova

IPC: G10L13/047 , G10L15/02 , G10L15/16 , G10L15/18 , G10L15/22 , G10L25/18

Abstract: A target voice dataset may be augmented using speech prediction. Encoder and decoder models may be trained to encode audio data into encoded speech data, and convert it back to audio. The encoded units may include semantic information (e.g., phonemes and/or words) as well as feature data indicating prosody, timbre, speaker identity, speech style, emotion, etc. of speech. An acoustic/semantic language model (ASLM) may be configured to predict encoded speech data in a manner analogous to a language model predicting words; for example, based on preceding encoded speech data. The models may be used to generate synthesized speech samples having voice characteristics (e.g., feature data) similar to those of the target voice dataset. The augmented dataset may be used to train a text-to-speech (TTS) model to reproduce the target voice characteristics, and may improve performance of the TTS model over training with only the original target voice dataset.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification