TEXT-BASED SPEECH GENERATION
    1.
    发明公开

    公开(公告)号:US20240233706A1

    公开(公告)日:2024-07-11

    申请号:US18562962

    申请日:2022-05-23

    摘要: According to implementations of the subject matter described herein, a solution is proposed for text to speech. In this solution, an initial phoneme sequence corresponding to text is generated, the initial phoneme sequence comprising feature representations of a plurality of phonemes. A first phoneme sequence is generated by inserting a feature representation of an additional phoneme into the initial phoneme sequence, the additional phoneme being related to a characteristic of spontaneous speech. The duration of a phoneme among the plurality of phonemes and the additional phoneme is determined by using an expert model corresponding to the phoneme, and a second phoneme sequence is generated based on the first phoneme sequence. Spontaneous-style speech corresponding to the text is determined based on the second phoneme sequence. In this way, spontaneous-style speech with more varying rhythms can be generated based on spontaneous-style additional phonemes and multiple expert models.

    Wireless communication device using voice recognition and voice synthesis

    公开(公告)号:US11942072B2

    公开(公告)日:2024-03-26

    申请号:US17439197

    申请日:2021-02-03

    申请人: Sang Rae Park

    发明人: Sang Rae Park

    摘要: Disclosed is a wireless communication device including a voice recognition portion configured to convert a voice signal input through a microphone into a syllable information stream using voice recognition, an encoding portion configured to encode the syllable information stream to generate digital transmission data, a transmission portion configured to modulate from the digital transmission data to a transmission signal and transmit the transmission signal through an antenna, a reception portion configured to demodulate from a reception signal received through the antenna to a digital reception data and output the digital reception data, a decoding portion configured to decode the digital reception data to generate the syllable information stream and a voice synthesis portion configured to convert the syllable information stream into the voice signal using voice synthesis and output the voice signal through a speaker.

    POSE ESTIMATION MODEL LEARNING APPARATUS, POSE ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME

    公开(公告)号:US20230005468A1

    公开(公告)日:2023-01-05

    申请号:US17779518

    申请日:2019-11-26

    摘要: A pause estimation model learning apparatus includes: a morphological analysis unit configured to perform morphological analysis on training text data to provide M types of information, M being an integer that is equal to or larger than 2; a feature selection unit configured to combine N pieces of information, among the M pieces of information, to be an input feature when a predetermined certain condition is satisfied, and select predetermined one of the N pieces of information to be the input feature when the certain condition is not satisfied, N being an integer that is equal to or larger than 2 and equal to or smaller than M; and a learning unit configured to learn a pause estimation model by using the input feature selected by the feature selection unit and a pause correct label.

    Training method and apparatus for a speech synthesis model, and storage medium

    公开(公告)号:US11488577B2

    公开(公告)日:2022-11-01

    申请号:US16907006

    申请日:2020-06-19

    摘要: The present application discloses a training method and an apparatus for a speech synthesis model, electronic device, and storage medium. The method includes: taking a syllable input sequence, a phoneme input sequence and a Chinese character input sequence of a current sample as inputs of an encoder of a model to be trained, to obtain encoded representations of these three sequences at an output end of the encoder; fusing the encoded representations of these three sequences, to obtain a weighted combination of these three sequences; taking the weighted combination as an input of an attention module, to obtain a weighted average of the weighted combination at each moment at an output end of the attention module; taking the weighted average as an input of a decoder of the model to be trained, to obtain a speech Mel spectrum of the current sample at an output end of the decoder.

    Speech processing device, speech processing method, and computer program product using compensation parameters

    公开(公告)号:US11348569B2

    公开(公告)日:2022-05-31

    申请号:US16841839

    申请日:2020-04-07

    摘要: A speech processing device includes a hardware processor configured to receive input speech and extract speech frames from the input speech. The hardware processor is configured to calculate a spectrum parameter for each of the speech frames, calculate a first phase spectrum for each of the speech frames, calculate a group delay spectrum from the first phase spectrum based on a frequency component of the first phase spectrum, calculate a band group delay parameter in a predetermined frequency band from the group delay spectrum, and calculate a band group delay compensation parameter to compensate a difference between a second phase spectrum reconstructed from the band group delay parameter and the first phase spectrum. The hardware processor is configured to generate a speech waveform based on the spectrum parameter, the band group delay parameter, and the band group delay compensation parameter.

    Learnable speed control for speech synthesis

    公开(公告)号:US11302301B2

    公开(公告)日:2022-04-12

    申请号:US16807801

    申请日:2020-03-03

    发明人: Chengzhu Yu Dong Yu

    摘要: A method, computer program, and computer system is provided for synthesizing speech at one or more speeds. A context associated with one or more phonemes corresponding to a speaking voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a voice sample corresponding to the speaking voice is synthesized using the generated mel-spectrogram features.