-
公开(公告)号:US20230169953A1
公开(公告)日:2023-06-01
申请号:US17919982
申请日:2021-03-19
Applicant: Microsoft Technology Licensing, LLC
Inventor: Ran Zhang , Jian LUAN , Yahuan Cong
Abstract: The present disclosure provides methods and apparatuses for phrase-based end-to-end text-to-speech (TTS) synthesis.
A text may be obtained. A target phrase in the text may be identified. A phrase context of the target phrase may be determined. An acoustic feature corresponding to the target phrase may be generated based at least on the target phrase and the phrase context. A speech waveform corresponding to the target phrase may be generated based on the acoustic feature.-
公开(公告)号:US20220122580A1
公开(公告)日:2022-04-21
申请号:US17561895
申请日:2021-12-24
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Pei ZHAO , Kaisheng YAO , Max LEUNG , Bo YAN , Jian LUAN , Yu SHI , Malone MA , Mei-Yuh HWANG
Abstract: An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.
-
公开(公告)号:US20220059122A1
公开(公告)日:2022-02-24
申请号:US17432476
申请日:2020-02-03
Applicant: Microsoft Technology Licensing, LLC
Abstract: A method for providing emotion management assistance is provided. Sound streams may be received. A speech conversation between a user and at least one conversation object may be detected from the sound streams. Identity of the conversation object may be identified at least according to speech of the conversation object in the speech conversation. Emotion state of at least one speech segment of the user in the speech conversation may be determined. An emotion record corresponding to the speech conversation may be generated, wherein the emotion record at least including the identity of the conversation object, at least a portion of content of the speech conversation, and the emotion state of the at least one speech segment of the user.
-
公开(公告)号:US20200058289A1
公开(公告)日:2020-02-20
申请号:US16342416
申请日:2016-11-21
Applicant: Microsoft Technology Licensing, LLC
Inventor: Henry GABRYJELSKI , Jian LUAN , Dapeng Li
Abstract: An automatic dubbing method is disclosed. The method comprises: extracting speeches of a voice from an audio portion of a media content (504); obtaining a voice print model for the extracted speeches of the voice (506); processing the extracted speeches by utilizing the voice print model to generate replacement speeches (508); and replacing the extracted speeches of the voice with the generated replacement speeches in the audio portion of the media content (510).
-
公开(公告)号:US20200035209A1
公开(公告)日:2020-01-30
申请号:US16500995
申请日:2018-04-18
Applicant: MICROSOFT TECHNOLOGY LICENSING LLC
Inventor: Jian LUAN , Qinying LIAO , Zhen LIU , Nan YANG , Furu WEI
Abstract: In accordance with implementations of the subject matter described herein, there is provided a solution for supporting a machine to automatically generate a song. In this solution, an input from a user is used to determine a creation intention of the user with respect to a song to be generated. Lyrics of the song are generated based on the creation intention. Then, a template for the song is generated based at least in part on the lyrics. The template indicates a melody matching with the lyrics. In this way, it is feasible to automatically create the melody and lyrics which not only conform to the creation intention of the user but also match with each other.
-
公开(公告)号:US20230206899A1
公开(公告)日:2023-06-29
申请号:US17926994
申请日:2021-04-22
Applicant: Microsoft Technology Licensing, LLC
Inventor: Ran Zhang , Jian LUAN , Yahuan Cong
CPC classification number: G10L13/10 , G10L13/04 , G10L2013/105
Abstract: The present disclosure provides methods and apparatuses for spontaneous text-to-speech (TTS) synthesis. A target text may be obtained. A fluency reference factor may be determined based at least on the target text. An acoustic feature corresponding to the target text may be generated with the fluency reference factor. A speech waveform corresponding to the target text may be generated based on the acoustic feature.
-
公开(公告)号:US20230076258A1
公开(公告)日:2023-03-09
申请号:US17985016
申请日:2022-11-10
Applicant: Microsoft Technology Licensing, LLC
Inventor: Henry GABRYJELSKI , Jian LUAN , Dapeng LI
Abstract: A method and system for automatic dubbing method is disclosed, comprising, responsive to receiving a selection of media content for playback on a user device by a user of the user device, processing extracted speeches of a first voice from the media content to generate replacement speeches using a set of phenomes of a second voice of the user of the user device, and replacing the extracted speeches of the first voice with the generated replacement speeches in the audio portion of the media content for playback on the user device.
-
公开(公告)号:US20210225357A1
公开(公告)日:2021-07-22
申请号:US16309399
申请日:2017-06-07
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Pei ZHAO , Kaisheng YAO , Max LEUNG , Bo YAN , Jian LUAN , Yu SHI , Malone MA , Mei-Yuh HWANG
Abstract: An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.
-
公开(公告)号:US20210082396A1
公开(公告)日:2021-03-18
申请号:US17050153
申请日:2019-05-13
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Jian LUAN , Shihui LIU
IPC: G10L13/10 , G10L13/047
Abstract: The present disclosure provides a technical solution of highly empathetic TTS processing, which not only takes a semantic feature and a linguistic feature into consideration, but also assigns a sentence ID to each sentence in a training text to distinguish sentences in the training text. Such sentence IDs may be introduced as training features into a processing of training a machine learning model, so as to enable the machine learning model to learn a changing rule for the changing of acoustic codes of sentences with a context of sentence. A speech naturally changed in rhythm and tone may be output to make TTS more empathetic by performing TTS processing with the trained model. A highly empathetic audio book may be generated using the TTS processing provided herein, and an online system for generating a highly empathetic audio book may be established with the TTS processing as a core technology.
-
-
-
-
-
-
-
-