Abstract:
An electronic apparatus, a terminal apparatus, and a controlling method thereof. The electronic apparatus includes an input interface; and a processor including a prosody module configured to extract an acoustic feature and a vocoder module configured to generate a speech waveform, wherein the processor is configured to: receive a text input using the input interface; identify a first acoustic feature from the text input using the prosody module, wherein the first acoustic feature corresponds to a first sampling rate; generate a modified acoustic feature corresponding to a modified sampling rate different from the first sampling rate, based on the identified first acoustic feature; and generate a plurality of vocoder learning models by training the vocoder module based on the first acoustic feature and the modified acoustic feature.
Abstract:
An electronic apparatus, based on a text sentence being input, obtains prosody information of the text sentence, segments the text sentence into a plurality of sentence elements, obtains a speech in which prosody information is reflected to each of the plurality of sentence elements in parallel by inputting the plurality of sentence elements and the prosody information of the text sentence to a text to speech (TTS) module, and merges the speech for the plurality of sentence elements that are obtained in parallel to output speech for the text sentence.
Abstract:
A method, performed by an electronic device, of generating a speech signal corresponding to at least one text is provided. The method includes obtaining feature information with respect to a first sample included in the speech signal, based on the at least one text, obtaining condition information related to a condition under which a bunching operation, in which one or more sample values included in the speech signal are obtained, is performed, based on the feature information, configuring one or more bunching blocks for performing the bunching operation, based on the condition information, obtaining the one or more sample values based on the feature information with respect to the first sample by using the one or more bunching blocks, and generating the speech signal based on the obtained one or more sample values.
Abstract:
A speech synthesis method performed by an electronic apparatus to synthesize speech from text and includes: obtaining text input to the electronic apparatus; obtaining a text representation by encoding the text using a text encoder of the electronic apparatus; obtaining an audio representation of a first audio frame set from an audio encoder of the electronic apparatus, based on the text representation; obtaining an audio representation of a second audio frame set based on the text representation and the audio representation of the first audio frame set; obtaining an audio feature of the second audio frame set by decoding the audio representation of the second audio frame set; and synthesizing speech based on an audio feature of the first audio frame set and the audio feature of the second audio frame set.
Abstract:
A method, medium, and system decoding and/or encoding multiple channels. Accordingly, down-mixed multiple channels can be decoded/up-mixed to a left channel and a right channel during a first stage, thereby enabling a high quality sound output even in scalable channel decoding.
Abstract:
The electronic device may include a communication interface; a memory configured to store a first neural network model; and a processor configured to: receive, from an external electronic device via the communication interface, compressed information related to an acoustic feature obtained based on a text; decompress the compressed information to obtain decompressed information; and obtain sound information corresponding to the text by inputting the decompressed information into the first neural network model. The first neural network model may be obtained by training a relationship between a plurality of sample acoustic features and a plurality of sample sounds corresponding to the plurality of sample acoustic features.
Abstract:
A controlling method of a wearable electronic apparatus includes: receiving, by an IMU sensor, a bone conduction signal corresponding to vibration in the user's face, while the wearable electronic apparatus is operated in an ANC mode; identifying a presence or an absence of the user's voice based on the bone conduction signal, based on the identifying the presence of the user's voice, controlling an operation mode of the wearable electronic apparatus to be a different operation mode from the ANC mode; while the wearable electronic apparatus is operated in the different operation mode, identifying presence or absence of the user's voice based on the bone conduction signal, and based on the absence of the user's voice being identified for a predetermined time while the wearable electronic apparatus is operated in the different operation mode, controlling the different operation mode to return to the ANC mode.
Abstract:
Surround audio decoding for selectively generating an audio signal from a multi-channel signal. In the surround audio decoding, a down-mixed signal, e.g., as down-mixed by an encoding terminal, is selectively up-mixed to a stereo signal or a multi-channel signal, by generating spatial information for generating the stereo signal, using spatial information for up-mixing the down-mixed signal to the multi-channel signal.
Abstract:
The disclosure relates to an electronic device and a control method thereof. The electronic device includes a memory, and a processor configured to: obtain first feature data for estimating a waveform by inputting acoustic data of a first quality to a first encoder model; and obtain waveform data of a second quality that is a higher quality than the first quality by inputting the first feature data to a decoder model to.
Abstract:
An electronic apparatus, based on a text sentence being input, obtains prosody information of the text sentence, segments the text sentence into a plurality of sentence elements, obtains a speech in which prosody information is reflected to each of the plurality of sentence elements in parallel by inputting the plurality of sentence elements and the prosody information of the text sentence to a text to speech (TTS) module, and merges the speech for the plurality of sentence elements that are obtained in parallel to output speech for the text sentence.