摘要:
Speech is synthesized for a given text by determining a sequence of phonetic components based on the text, determining a sequence of target phonetic elements associated phonetic components, determining a sequence of target event types associated with the phonetic components and determining a sequence of speech units from a plurality of stored speech unit candidates by use of a cost function. The cost function comprises a unit cost, a concatenation cost, and an event type cost for each speech unit in the sequence of speech units. The unit cost of a speech unit is determined with respect to the corresponding target phonetic element, while the concatenation cost of a speech unit is determined with respect to adjacent speech units and the event type cost of each speech unit is determined with respect to the corresponding target event type.
摘要:
A speech synthesis system receives symbolic input describing an utterance to be synthesized. In one embodiment, different portions of the utterance are constructed from different sources, one of which is a speech corpus recorded from a human speaker whose voice is to be modeled. The other sources may include other human speech corpora or speech produced using Rule-Based Speech Synthesis (RBSS). At least some portions of the utterance may be constructed by modifying prototype speech units to produce adapted speech units that are contextually appropriate for the utterance. The system concatenates the adapted speech units with the other speech units to produce a speech waveform. In another embodiment, a speech unit of a speech corpus recorded from a human speaker lacks transitions at one or both of its edges. A transition is synthesized using RBSS and concatenated with the speech unit in producing a speech waveform for the utterance.
摘要:
A voice synthesizing apparatus comprises: means for storing phoneme pieces having a plurality of different pitches for each phoneme represented by a same phoneme symbol; means for reading a phoneme piece by using a pitch as an index; and a voice synthesizer that synthesizes a voice in accordance with the read phoneme piece.
摘要:
A plurality of voice segments, each including one or more phonemes are acquired in a time-serial manner, in correspondence with desired singing or speaking words. As necessary, a boundary is designated between start and end points of a vowel phoneme included in any one of the acquired voice segments. Voice is synthesized for a region of the vowel phoneme that precedes the designated boundary vowel phoneme, or a region of the vowel phoneme that succeeds the designated boundary in the vowel phoneme. By synthesizing a voice for the region preceding the designated boundary, it is possible to synthesize a voice imitative of a vowel sound that is uttered by a person and then stopped to sound with his or her mouth kept opened. Further, by synthesizing a voice for the region succeeding the designated boundary, it is possible to synthesize a voice imitative of a vowel sound that is started to sound with the mouth opened.
摘要:
In a speech synthesis process, micro-segments are cut from acquired waveform data and a window function. The obtained micro-segments are re-arranged to implement a desired prosody, and superposed data is generated by superposing the re-arranged micro-segments, so as to obtain synthetic speech waveform data. A spectrum correction filter is formed based on the acquired waveform data. At least one of the waveform data, micro-segments, and superposed data is corrected using the spectrum correction filter. In this way, "blur" of a speech spectrum due to the window function applied to obtain micro-segments is reduced, and speech synthesis with high sound quality is realized.
摘要:
Systems and methods for automatically segmenting speech inventories. A set of Hidden Markov Models (HMMs) are initialized using bootstrap data. The HMMs are next re-estimated and aligned to produce phone labels. The phone boundaries of the phone labels are then corrected using spectral boundary correction. Optionally, this process of using the spectral-boundary-corrected phone labels as input instead of the bootstrap data is performed iteratively in order to further reduce mismatches between manual labels and phone labels assigned by the HMM approach.