-
公开(公告)号:US20230260502A1
公开(公告)日:2023-08-17
申请号:US17671006
申请日:2022-02-14
Applicant: Amazon Technologies, Inc.
Inventor: Adam Marek Gabrys , Jaime Lorenzo Trueba , Goeric Sydney Huybrechts
IPC: G10L13/047 , G10L13/08 , G10L13/027 , G06N3/04 , G10L19/16
CPC classification number: G10L13/047 , G10L13/08 , G10L13/027 , G06N3/0454 , G10L19/16
Abstract: A text-to-speech (TTS) system may be configured to imitate characteristics of a target voice based on a limited dataset. The TTS system may include a machine learning model pre-trained using a synthetic parallel dataset and fine-tuned using examples of the target voice. A TTS component trained using a large single-speaker dataset may be used to generate the synthetic parallel dataset based on a multi-speaker dataset. The synthetic parallel dataset may include target audio data representing speech in the multi-speaker dataset and predicted audio data generated by the TTS component based on transcripts of the speech. The machine learning model may be pre-trained using the synthetic parallel dataset and fine-tuned using audio data representing target voice speech and predicted audio generated by the TTS component based on transcripts of the target voice speech. The trained model may be used to modify synthetic speech to approximate the characteristics of the target speech.
-
公开(公告)号:US12254864B1
公开(公告)日:2025-03-18
申请号:US17854439
申请日:2022-06-30
Applicant: Amazon Technologies, Inc.
Inventor: Mateusz Aleksander Lajszczak , Adam Marek Gabrys , Arent van Korlaar , Ruizhe Li , Elena Sergeevna Sokolova , Jaime Lorenzo Trueba , Arnaud Vincent Pierre Yves Joly , Marco Nicolis , Ekaterina Petrova
Abstract: A target voice dataset may be augmented using speech prediction. Encoder and decoder models may be trained to encode audio data into encoded speech data, and convert it back to audio. The encoded units may include semantic information (e.g., phonemes and/or words) as well as feature data indicating prosody, timbre, speaker identity, speech style, emotion, etc. of speech. An acoustic/semantic language model (ASLM) may be configured to predict encoded speech data in a manner analogous to a language model predicting words; for example, based on preceding encoded speech data. The models may be used to generate synthesized speech samples having voice characteristics (e.g., feature data) similar to those of the target voice dataset. The augmented dataset may be used to train a text-to-speech (TTS) model to reproduce the target voice characteristics, and may improve performance of the TTS model over training with only the original target voice dataset.
-
公开(公告)号:US11915683B2
公开(公告)日:2024-02-27
申请号:US17671006
申请日:2022-02-14
Applicant: Amazon Technologies, Inc.
Inventor: Adam Marek Gabrys , Jaime Lorenzo Trueba , Goeric Sydney Huybrechts
IPC: G10L19/16 , G10L13/08 , G10L13/047 , G06N3/045 , G10L13/027
CPC classification number: G10L13/047 , G06N3/045 , G10L13/027 , G10L13/08 , G10L19/16
Abstract: A text-to-speech (TTS) system may be configured to imitate characteristics of a target voice based on a limited dataset. The TTS system may include a machine learning model pre-trained using a synthetic parallel dataset and fine-tuned using examples of the target voice. A TTS component trained using a large single-speaker dataset may be used to generate the synthetic parallel dataset based on a multi-speaker dataset. The synthetic parallel dataset may include target audio data representing speech in the multi-speaker dataset and predicted audio data generated by the TTS component based on transcripts of the speech. The machine learning model may be pre-trained using the synthetic parallel dataset and fine-tuned using audio data representing target voice speech and predicted audio generated by the TTS component based on transcripts of the target voice speech. The trained model may be used to modify synthetic speech to approximate the characteristics of the target speech.
-
-