摘要:
A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.
摘要:
A natural language processing system may use system response configuration data to determine customized output data forms when outputting data for a user. The system response configuration data may represent various output attributes the system may use when creating output data. The system may have a certain number of existing profiles where a profile is associated with certain settings for the system response configuration data/attributes. The system may also use various data such as context data, sentiment data, or the like to customize system response configuration data during a dialog. Other components, such as natural language generation (NLG), text-to-speech (TTS), or the like, may use the customized system response configuration data to determine the form, timing, etc. of output data to be presented to a user.
摘要:
The method for generating captions, subtitles and dubbing for audiovisual media uses a machine learning-based approach for automatically generating captions from the audio portion of audiovisual media, and further translates the captions to produce both subtitles and dubbing. A speech component of an audio portion of audiovisual media is converted into at least one text string which includes at least one word. Temporal start and end points for the at least one word are determined, and the at least one word is visually inserted into the video portion of the audiovisual media. The temporal start and end points for the at least one word are synchronized with corresponding temporal start and end points of the speech component of the audio portion of the audiovisual media. A latency period may be selectively inserted into broadcast of the audiovisual media such that the synchronization may be selectively adjusted during the latency period.
摘要:
A natural language processing system may select a synthesized speech quality using user profile data. The system may receive a natural language input and determine responsive output data. The system may, based at least in part on user profile data associated with the input, determine response configuration data corresponding to a quality of synthesized speech. The system may then determine further output data for presentation using the responsive output data and response configuration data.
摘要:
A speech translation method using a multilingual text-to-speech synthesis model includes receiving input speech data of the first language and an articulatory feature of a speaker regarding the first language, converting the input speech data of the first language into a text of the first language, converting the text of the first language into a text of the second language, and generating output speech data for the text of the second language that simulates the speaker's speech by inputting the text of the second language and the articulatory feature of the speaker to a single artificial neural network text-to-speech synthesis model.
摘要:
A method for synthesizing speech from a textual input includes receiving the textual input, the textual input including native words in a native language and foreign words in a foreign language, and processing the textual input to determine a phonetic representation of the textual input. The processing includes determining a native phonetic representation of the of the native words, and determining a nativized phonetic representation of the foreign words. Determining the nativized phonetic representation includes forming a foreign phonetic representation of the foreign words using a foreign phoneme set, and mapping the foreign phonetic representation to the nativized phonetic representation according to a model of a native speaker's pronunciation of foreign words.
摘要:
The computer-implemented method provides for a digital virtual assistant (DVA) receiving input spoken in a first language by a user. The DVA determines a context of a current situation based on language and identity of individuals within a proximity of the DVA. The DVA determines whether the context of the current situation includes providing a response using a second language. In response to determining the context of the current situation calls for providing the response in the second language, the DVA determines the second language based on the context, and the DVA responds to the input spoken in the first language by the user, such that the response includes a dynamic selection of the second language and is based on an interaction context of the user and the DVA, and reference to a corpus of interaction context usage of the second language in a historically similar situation.
摘要:
An animation display system is provided. The animation display system includes a display; a storage configured to store a language model database, a phonetic-symbol lip-motion matching database and a lip motion synthesis database; and a processor electronically connected to the storage and the display, respectively. The processor includes a speech conversion module, a phonetic-symbol lip-motion matching module, and a lip motion synthesis module. A lip animation display method is also provided.
摘要:
An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.
摘要:
A system and method configured for use in a text-to-speech (TTS) system is provided. Embodiments may include identifying, using one or more processors, a word or phrase as a named entity and identifying a language of origin associated with the named entity. Embodiments may further include transliterating the named entity to a script associated with the language of origin. If the TTS system is operating in the language of origin, embodiments may include passing the transliterated script to the TTS system. If the TTS system is not operating in the language of origin, embodiments may include generating a phoneme sequence in the language of origin using a grapheme to phoneme (G2P) converter.