CROSS-SPEAKER STYLE TRANSFER SPEECH SYNTHESIS

    公开(公告)号:US20230081659A1

    公开(公告)日:2023-03-16

    申请号:US17799031

    申请日:2021-02-01

    摘要: This disclosure provides methods and apparatuses for training an acoustic model which is for implementing cross-speaker style transfer and comprises at least a style encoder. Training data may be obtained, which comprises a text, a speaker ID, a style ID and acoustic features corresponding to a reference audio. A reference embedding vector may be generated, through the style encoder, based on the acoustic features. Adversarial training may be performed to the reference embedding vector with at least the style ID and the speaker ID, to remove speaker information and retain style information. A style embedding vector may be generated, through the style encoder, based at least on the reference embedding vector being performed the adversarial training. Predicted acoustic features may be generated based at least on a state sequence corresponding to the text, a speaker embedding vector corresponding to the speaker ID, and the style embedding vector.

    Voice synthesis method, apparatus, device and storage medium

    公开(公告)号:US11600259B2

    公开(公告)日:2023-03-07

    申请号:US16565784

    申请日:2019-09-10

    发明人: Jie Yang

    摘要: Provided are a voice synthesis method, an apparatus, a device, and a storage medium, involving obtaining text information and determining characters in the text information and a text content of each of the characters; performing a character recognition on the text content of each of the characters, to determine character attribute information of each of the characters; obtaining speakers in one-to-one correspondence with the characters according to the character attribute information of each of the characters, where the speakers are pre-stored pronunciation object having the character attribute information; and generating multi-character synthesized voices according to the text information and the speakers corresponding to the characters of the text information. These improve pronunciation diversities of different characters in the synthesized voices, improve an audience's discrimination between different characters in the synthesized voices, and thereby improve experience of a user.

    Audio file processing method, electronic device, and storage medium

    公开(公告)号:US11538456B2

    公开(公告)日:2022-12-27

    申请号:US16844283

    申请日:2020-04-09

    发明人: Chunjiang Lai

    摘要: An audio file processing method is provided for an electronic device. The method includes extracting at least one audio segment from a first audio file, recognizing at least one to-be-replaced audio segment representing a target role from the at least one audio segment, and determining time frame information of each to-be-replaced audio segment in the first audio file. The method also includes obtaining to-be-dubbed audio data for each to-be-replaced audio segment, and replacing data in the to-be-replaced audio segment with the to-be-dubbed audio data according to the time frame information, to obtain a second audio file. The at least one to-be-replaced audio segment is divided from the at least one audio segment based on a structure and a word count in a sentence corresponding to each to-be-replaced audio segment.

    System Providing Expressive and Emotive Text-to-Speech

    公开(公告)号:US20220392430A1

    公开(公告)日:2022-12-08

    申请号:US17880190

    申请日:2022-08-03

    摘要: A speech to text system includes a text and labels module receiving a text input and providing a text analysis and a label with a phonetic description of the text. A label buffer receives the label from the text and labels module. A parameter generation module accesses the label from the label buffer and generates a speech generation parameter. A parameter buffer receives the parameter from the parameter generation module. An audio generation module receives the text input, the label, and/or the parameter and generates a plurality of audio samples, A scheduler monitors and schedules the text and label module, the parameter generation module, and/or the audio generation module. The parameter generation module is further configured to initialize a voice identifier with a Voice Style Sheet (VSS) parameter, receive an input indicating a modification to the VSS parameter, and modify the VSS parameter according to the modification.

    Method of embodying online media service having multiple voice systems

    公开(公告)号:US11521593B2

    公开(公告)日:2022-12-06

    申请号:US17076121

    申请日:2020-10-21

    申请人: Jong Yup Lee

    发明人: Jong Yup Lee

    摘要: A method of embodying an online media service having a multiple voice system includes a first operation of collecting preset online articles and content from a specific media site and displaying the online articles and content on a screen of a personal terminal, a second operation of inputting a voice of a subscriber or setting a voice of a specific person among voices that are pre-stored in a database, a third operation of recognizing and classifying the online articles and content, a fourth operation of converting the classified online articles and content into speech, and a fifth operation of outputting the online articles and content using the voice of the subscriber or the specific person, which is set in the second operation.

    Systems and methods for generating a volume-based response for multiple voice-operated user devices

    公开(公告)号:US11481187B2

    公开(公告)日:2022-10-25

    申请号:US16738815

    申请日:2020-01-09

    申请人: Rovi Guides, Inc.

    摘要: Systems and methods are provided herein for responding to a voice command at a volume level based on a volume level of the voice command. For example, a media guidance application may detect, through a first voice-operated user device of a plurality of voice-operated user devices, a voice command spoken by a user. The media guidance application may determine a first volume level of the voice command. Based on the volume level of the voice command, the media guidance application may determine that a second voice-operated user device of the plurality of voice-operated user devices is closer to the user than any of the other voice-operated user devices. The media guidance application may generate an audible response, through the second voice-operated user device, at a second volume level that is set based on the first volume level of the voice command.