Enhancing hybrid self-attention structure with relative-position-aware bias for speech synthesis

    公开(公告)号:US11011154B2

    公开(公告)日:2021-05-18

    申请号:US16271154

    申请日:2019-02-08

    摘要: A method of performing speech synthesis, includes encoding character embeddings, using any one or any combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs), applying a relative-position-aware self attention function to each of the character embeddings and an input mel-scale spectrogram, and encoding the character embeddings to which the relative-position-aware self attention function is applied. The method further includes concatenating the encoded character embeddings and the encoded character embeddings to which the relative-position-aware self attention function is applied, to generate an encoder output, applying a multi-head attention function to the encoder output and the input mel-scale spectrogram to which the relative-position-aware self attention function is applied, and predicting an output mel-scale spectrogram, based on the encoder output and the input mel-scale spectrogram to which the multi-head attention function is applied.

    Learning singing from speech
    2.
    发明授权

    公开(公告)号:US11430431B2

    公开(公告)日:2022-08-30

    申请号:US16783807

    申请日:2020-02-06

    摘要: A method, computer program, and computer system is provided for converting a singing voice of a first person associated with a first speaker to a singing voice of a second person using a speaking voice of the second person associated with a second speaker. A context associated with one or more phonemes corresponding to the singing voice of a first person is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes, the target acoustic frames, and a sample of the speaking voice of the second person. A sample corresponding to the singing voice of a first person is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.

    Singing voice conversion
    4.
    发明授权

    公开(公告)号:US11721318B2

    公开(公告)日:2023-08-08

    申请号:US17501182

    申请日:2021-10-14

    摘要: A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.

    Duration informed attention network for text-to-speech analysis

    公开(公告)号:US11468879B2

    公开(公告)日:2022-10-11

    申请号:US16397349

    申请日:2019-04-29

    摘要: A method and apparatus include receiving a text input that includes a sequence of text components. Respective temporal durations of the text components are determined using a duration model. A first set of spectra is generated based on the sequence of text components. A second set of spectra is generated based on the first set of spectra and the respective temporal durations of the sequence of text components. A spectrogram frame is generated based on the second set of spectra. An audio waveform is generated based on the spectrogram frame. The audio waveform is provided as an output.

    Singing voice conversion
    7.
    发明授权

    公开(公告)号:US11183168B2

    公开(公告)日:2021-11-23

    申请号:US16789674

    申请日:2020-02-13

    摘要: A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.

    LEARNING SINGING FROM SPEECH
    10.
    发明申请

    公开(公告)号:US20220343904A1

    公开(公告)日:2022-10-27

    申请号:US17861716

    申请日:2022-07-11

    摘要: A method, computer program, and computer system is provided for converting a singing voice of a first person associated with a first speaker to a singing voice of a second person using a speaking voice of the second person associated with a second speaker. A context associated with one or more phonemes corresponding to the singing voice of a first person is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes, the target acoustic frames, and a sample of the speaking voice of the second person. A sample corresponding to the singing voice of a first person is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.