摘要:
A speech information processing apparatus includes a statistical processing unit for extracting features by performing statistical processing of a feature file formed by extracting features of speech, such as the fundamental frequency and its variations, and the power and its variations of speech, from a speech file, and a label file in which a phoneme environment, comprising the accent type, the number of moras, the mora position, phonemes and the like, is considered, and a pitch pattern forming unit for forming a pitch pattern, in which phoneme environment is considered, based on the result of the statistical processing.
摘要:
A document inputting apparatus or speech outputting apparatus inputs and displays document data, specifies accent information, pronunciation information and syllable-length information of words or characters of the document data. The apparatus displays the document data in accordance with the specified information so that information such as the accent positions or accent intensities can be recognized. Thus formed document data is stored in a memory with the accent information, the pronunciation information or the syllable-length information. Upon reading the document data from the memory and outputting it as speech, the specified information is referred to for speech synthesizing, thus outputting speech corresponding to the correct pronunciation.
摘要:
A speech synthesis method and a speech synthesis apparatus includes a system for synthesis by rule that prevents the quality of synthesized speech from deteriorating and for reducing the number of calculations that are required for the generation of a speech waveform. The speech synthesis apparatus includes a character series input section, for inputting a character series as phonetic text, a pitch waveform generator, for generating a pitch waveform by calculating a product of a matrix, which has been acquired for each pitch, and the character series, which is input by the character series input section, and a device for connecting pitch waveforms that are generated by the pitch waveform generator and for providing a speech waveform. The calculation method for the generation of such a pitch waveform provides a great reduction in the number of calculations that are required. In addition, in the calculation for the generation of a pitch waveform, a function that determines a frequency response is employed to convert a spectral envelope, which is obtained from a parameter, so that the timbres of synthesized speech can be changed without parameter operations.
摘要:
A data processing apparatus for synchronized audiovisual output has synchronizing signal bits which are assigned to bits of each sound data, represented by a 16-bit PCM code. A predetermined bit of the assigned bits having the least influence upon the human auditory sense is extracted as a synchronizing signal bit for synchronization of the image data output and sound output.
摘要:
A speech synthesis method and apparatus for synthesizing speech from a character series comprising a text and pitch information. The apparatus includes a parameter generator for generating power spectrum envelopes as parameters of a speech waveform to be synthesized representing the input text in accordance with the input character series. The apparatus also includes a pitch waveform generator for generating pitch waveforms whose period equals the pitch specified by the pitch information. The pitch waveform generator generates the pitch waveforms from the input pitch information and the power spectrum envelopes generated by the parameter generator. Also provided is a speech waveform output device for outputting the speech waveform obtained by connecting the generated pitch waveforms.
摘要:
In a speech synthesizer, each frame for generating a speech waveform has an expansion degree to which the frame is expanded or compressed in accordance with the production speed of synthetic speech. In accordance with the set speech production speed, the time interval between beat synchronization points is determined on the basis of the speed of the speech to be produced, and the time length of each frame present between the beat synchronization points is determined on the basis of the expansion degree of the frame. Parameters for producing a speech waveform in each frame are properly generated by the time length determined for the frame. In the speech synthesizer for outputting a speech signal by coupling phonemes constituted by one or a plurality of frames having phoneme vowel-consonant combination parameters (VcV, cV, or V) of the speech waveform, the number of frames can be held constant regardless of a change in the speech production speed. This prevents degradation in the tone quality or a variation in the processing quantity resulting from a change in the speech production speed.
摘要:
A speech recognition method uses continuous mixture Hidden Markov Models (HMM) for probability processing including a first type of HMM having a small number of mixtures and a second type of HMM having a larger number of mixtures. First output probabilities are formed for inputted speech using the small number of mixtures type HMM and second output probabilities are formed for the input speech using the large number of mixtures type HMM for selected states corresponding to the highest output probabilities of the first type HMM. The input speech is recognized from both the first and second output probabilities.
摘要:
A voice communication method includes the steps of inputting speech into an apparatus, recognizing the input speech using a first dictionary, predicting the category of an unrecognized word included in the input speech based on the recognition of the input speech in the recognition step, outputting a question to be asked to an operator requesting the operator to input a word which is included in the first dictionary and which can specify a second dictionary for recognizing the unrecognized word, based on the predicted category, and re-recognizing the unrecognized word with the second dictionary specified in response to the word inputted by the operator. The invention also relates to an apparatus performing these functions and to a computer program product instructing a computer to perform these functions.
摘要:
Detecting an unknown word in input speech data reduces the search space and the memory capacity for the unknown word. For this purpose, an HMM data memory stores data describing a state transition mode for the unknown word, defined by a number of states and the transition probability between the states. An output probability calculation unit acquires a state of the maximum likelihood at each time of the speech data, among the plural states employed in the state transition mode for a known word, employed in the speech recognition of the known word. The obtained result is applied to the state transition mode for the unknown word, stored in the HMM data memory, to obtain a state transition mode of the unknown word. A different output probability calculation unit determines the likelihood of the state transition mode for the known word. Then a language search unit effects the language search process, utilizing the likelihoods determined by the aforementioned two output probability calculation units, in a portion where the presence of the unknown word is permitted by the dictionary.
摘要:
A viewpoint of a user is detected in a viewpoint detecting process, and how long the detected viewpoint has stayed in an area is determined. The obtained viewpoint and its trace is displayed on a display unit. In a recognition information controlling process, the relationship between the viewpoint (in an area) and/or its movement, and recognition information (words, sentences, grammar, etc.) is obtained as weight P(). When the user pronounces a word (or sentence), the speech is inputted and A/D converted via a speech input unit. Next, in a speech recognition process, a speech recognition probability PS() is obtained. Finally, speech recognition is performed on the basis of a product of the weight P() and the speech recognition probability PS(). Accordingly, classes of the recognition information are controlled in accordance with the movement of the user's viewpoint, thereby improving the speech recognition probability and the speed of recognition.