摘要:
According to MPEG-4's TTS architecture, facial animation can be driven by two streams simultaneously - text, and Facial Animation Parameters. In this architecture, text input is sent to a Text-To-Speech converter at a decoder that drives the mouth shapes of the face. Facial Animation Parameters are sent from an encoder to the face over the communication channel. The present invention includes codes (known as bookmarks) in the text string transmitted to the Text-to-Speech converter, which bookmarks are placed between words as well as inside them. According to the present invention, the bookmarks carry an encoder time stamp. Due to the nature of text-to-speech conversion, the encoder time stamp does not relate to real-world time, and should be interpreted as a counter. In addition, the Facial Animation Parameter stream carries the same encoder time stamp found in the bookmark of the text. The system of the present invention reads the bookmark and provides the encoder time stamp as well as a real-time time stamp to the facial animation system. Finally, the facial animation system associates the correct facial animation parameter with the real-time time stamp using the encoder time stamp of the bookmark as a reference.
摘要:
A bio-signal related to the impedance between two points 34, 36 on a speaker's skin 32 is monitored while a speech recognition system is trained to recognize a word or utterance. An utterance is identified for retraining when the bio-signal is above a upper threshold or below a lower threshold while the recognition system is being trained to recognize the utterance. The recognition system is retrained to recognize the utterance when the bio-signal is between the upper and lower thresholds.
摘要:
At least some of a sequence of spoken phonemes are indicated by analysing detected sounds to determine a group of phonemes to which a phoneme belongs, optically detecting the lipshape of the speaker and correlating the respective signals by a computer.
摘要:
A viewpoint of a user is detected in a viewpoint detecting process, and how long the detected viewpoint has stayed in an area is determined. The obtained viewpoint and its trace is displayed on a display unit. In a recognition information controlling process, relationship between the viewpoint (in an area) and/or its movement, and recognition information (words, sentences, grammar, etc.) is obtained as weight P(). When the user pronounces a word (or sentence), the speech is inputted and A/D converted via a speech input unit. Next, in a speech recognition process, a speech recognition probability PS() is obtained. Finally, speech recognition is performed on the basis of a product of the weight P() and the speech recognition probability PS(). Accordingly, classes of the recognition information are controlled in accordance with the movement of the user's viewpoint, thereby improving the speech recognition probability and speed of recognition.
摘要:
According to MPEG-4's TTS architecture, facial animation can be driven by two streams simultaneously - text, and Facial Animation Parameters. In this architecture, text input is sent to a Text-To-Speech converter at a decoder that drives the mouth shapes of the face. Facial Animation Parameters are sent from an encoder to the face over the communication channel. The present invention includes codes (known as bookmarks) in the text string transmitted to the Text-to-Speech converter, which bookmarks are placed between words as well as inside them. According to the present invention, the bookmarks carry an encoder time stamp. Due to the nature of text-to-speech conversion, the encoder time stamp does not relate to real-world time, and should be interpreted as a counter. In addition, the Facial Animation Parameter stream carries the same encoder time stamp found in the bookmark of the text. The system of the present invention reads the bookmark and provides the encoder time stamp as well as a real-time time stamp to the facial animation system. Finally, the facial animation system associates the correct facial animation parameter with the real-time time stamp using the encoder time stamp of the bookmark as a reference.
摘要:
In a computerized method, speech signals are analyzed using statistical trajectory modeling to produce time aligned acoustic-phonetic units. There is one acoustic-phonetic unit for each portion of the speech signal determined to be phonetically distinct. The acoustic-phonetic units are translated to corresponding time aligned image units representative of the acoustic-phonetic units. An image including the time aligned image units is displayed. The display of the time aligned image units is synchronized to a replaying of the digitized natural speech signal.
摘要:
A game apparatus of the invention includes: a voice input section for inputting at least one voice set including voice uttered by an operator, for converting the voice set into a first electric signal, and for outputting the first electric signal; a voice recognition section for recognizing the voice set on the basis of the first electric signal output from the voice input means; an image input section for optically detecting a movement of the lips of the operator, for converting the detected movement of lips into a second electric signal, and for outputting the second electric signal; a speech period detection section for receiving the second electric signal, and for obtaining a period in which the voice is uttered by the operator on the basis of the received second electric signal; an overall judgment section for extracting the voice uttered by the operator from the input voice set, on the basis of the voice set recognized by the voice recognition means and the period obtained by the speech period detection means; and a control means for controlling an object on the basis of the voice extracted by the overall judgment means.
摘要:
In a computerized method, speech signals are analyzed using statistical trajectory modeling to produce time aligned acoustic-phonetic units. There is one acoustic-phonetic unit for each portion of the speech signal determined to be phonetically distinct. The acoustic-phonetic units are translated to corresponding time aligned image units representative of the acoustic-phonetic units. An image including the time aligned image units is displayed. The display of the time aligned image units is synchronized to a replaying of the digitized natural speech signal.
摘要:
This invention is an animal's intention translational method. This method is, first of all, to receive either of the two or both of informational signals about a voice of the animal such as a baby, pet, and domestic animal that they utter and animal's actions. After that, it compares with the received informational signal and the data which are analysed by the animal behavioralism beforehand, and it selects the data. In addition, the received informational signal is indicated what the animal appeals in words or letters that people are able to understand. As for the above mentioned invention, people are able to communication correctly with the animal.