Abstract:
A method for synchronizing sound data and text data, said text data being obtained by manual transcription of said sound data during playback of the latter. The proposed method comprises the steps of repeatedly querying said sound data and said text data to obtain a current time position corresponding to a currently played sound datum and a currently transcribed text datum, respectively, correcting said current time position by applying a time correction value in accordance with a transcription delay, and generating at least one association datum indicative of a synchronization association between said corrected time position and said currently transcribed text datum. Thus, the proposed method achieves cost-effective synchronization of sound and text in connection with the manual transcription of sound data.
Abstract:
To modify a facial feature region in a video bitstream, the video bitstream is received and a feature region is extracted from the video bitstream. An audio characteristic, such as frequency, rhythm, or tempo is retrieved from an audio bitstream, and the feature region is modified according to the audio characteristic to generate a modified image. The modified image is outputted.
Abstract:
Dynamic texture mapping is used to create a photorealistic three dimensional animation of an individual with facial features synchronized with desired speech. Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model. An input audio feature vector corresponding to desired speech with which the animation will be synchronized is provided. The statistical model is used to generate a trajectory of visual feature vectors that corresponds to the input audio feature vector. These visual feature vectors are used to identify a matching image sequence from the image library. The resulting sequence of images, concatenated from the image library, provides a photorealistic image sequence with facial features, such as lip movements, synchronized with the desired speech. This image sequence is applied to the three-dimensional model.
Abstract:
A system and method of controlling the movement of a virtual agent while the agent is speaking to a human user during a conversation is disclosed. The method comprises receiving speech data to be spoken by the virtual agent, performing a prosodic analysis of the speech data, selecting matching prosody patterns from a speaking database and controlling the virtual agent movement according to the selected prosody patterns.
Abstract:
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. The processor reads first data comprising one or more parameters associated with noise-producing orifice images of sequences of at least three concatenated phonemes which correspond to an input stimulus. The processor reads, based on the first data, second data comprising images of a noise-producing entity. The processor generates an animated sequence of the noise-producing entity.
Abstract:
Exemplary methods, systems, and products are disclosed for synchronizing visual and speech events in a multimodal application, including receiving from a user speech; determining a semantic interpretation of the speech; calling a global application update handler; identifying, by the global application update handler, an additional processing function in dependence upon the semantic interpretation; and executing the additional function. Typical embodiments may include updating a visual element after executing the additional function. Typical embodiments may include updating a voice form after executing the additional function. Typical embodiments also may include updating a state table after updating the voice form. Typical embodiments also may include restarting the voice form after executing the additional function.
Abstract:
The present invention provides a text-to-speech conversion system (TTS) for interlocking synchronizing with multimedia and a method for organizing input data of the TTS which can enhance the natural naturalness of synthesized speech and accomplish the synchronization of multimedia with TTS by defining additional prosody information, the information required to interlock synchronize TTS with multimedia, and interface between these this information and TTS for use in the production of the synthesized speech.
Abstract:
A synchronization system and method. Text data is received by a computing device. The text data is associated with audio/video data. The audio/video data is generated during a related performance. The audio/video data and the text data are discrete data. The text data is synchronized to correspond with the audio/video data during the performance. The synchronized text data is displayed by the computing device during the performance.
Abstract:
An information processing apparatus is provided which includes a metadata extraction unit for analyzing an audio signal in which a plurality of instrument sounds are present in a mixed manner and for extracting, as a feature quantity of the audio signal, metadata changing along with passing of a playing time, and a player parameter determination unit for determining, based on the metadata extracted by the metadata extraction unit, a player parameter for controlling a movement of a player object corresponding to each instrument sound.
Abstract:
A method for generating animated sequences of talking heads in text-to-speech applications wherein a processor samples a plurality of frames comprising image samples. The processor reads first data comprising one or more parameters associated with noise-producing orifice images of sequences of at least three concatenated phonemes which correspond to an input stimulus. The processor reads, based on the first data. second data comprising images of a noise-producing entity. The processor generates an animated sequence of the noise-producing entity.