摘要:
A method of selecting units for speech synthesis includes receiving, by one or more computers of a text-to-speech system, data indicating text for speech synthesis; determining, by the one or more computers, a sequence of text units that each represent a respective portion of the text, the sequence of text units including at least a first text unit followed by a second text unit; determining, by the one or more computers, multiple paths of speech units that each represent the sequence of text units, wherein determining the multiple paths of speech units includes: selecting, from a speech unit corpus, a first speech unit that includes speech synthesis data representing the first text unit; selecting, from the speech unit corpus, multiple second speech units including speech synthesis data representing the second text unit, each of the multiple second speech units being determined based on (i) a join cost to concatenate the second speech unit with a first speech unit and (ii) a target cost indicating a degree that the second speech unit corresponds to the second text unit; and defining paths from the selected first speech unit to each of the multiple second speech units to include in the multiple paths of speech units; and providing, by the one or more computers of the text-to-speech system, synthesized speech data according to a path selected from among the multiple paths.
摘要:
An apparatus is designed for synthesizing a voice signal using a plurality of phonetic piece data each indicating a phonetic piece which contains at least two phoneme sections corresponding to different phonemes. In the apparatus, a phonetic piece adjustor forms a target section from a first phonetic piece and a second phonetic piece so as to connect the first phonetic piece and the second phonetic piece to each other such that the target section is formed of a rear phoneme section of the first phonetic piece and a front phoneme section of the second phonetic piece, and expands the target section by a target time length to form an adjustment section such that a central part of the target section is expanded at an expansion rate higher than that of a front part and a rear part of the target section, to thereby create synthesized phonetic piece data of the adjustment section having the target time length. A voice synthesizer creates a voice signal from the synthesized phonetic piece data created by the phonetic piece adjustment part.
摘要:
A signal coupling method and a signal coupling apparatus capable of creating a naturally combined speech with a reduced noise. The signal coupling method (or apparatus) couples a plurality of waveform signals to create a combined waveform signal by a step (or means) for deciding the upper limit frequency of each frequency spectrum of the plurality of waveform signals and a step (or means) for filtering at least coupled portion of each waveform signal by a predetermined cut−off frequency characteristic based on the decided upper limit frequency. Here, the filtering cut−off frequency is set to an upper limit frequency of a waveform signal preceding or following the coupled portion of the waveform signal having a higher upper limit frequency. Accordingly, a higher harmonic component generated by discontinuous change of the coupled portion of the waveform signals is effectively removed, thereby significantly reducing the noise of the combined waveform signal.
摘要:
A speech synthesizer is provided that concatenates stored samples of speech units without modifying the prosody of the samples. The present invention is able to achieve a high level of naturalness in synthesized speech with a carefully designed training speech corpus by storing samples based on the prosodic and phonetic context in which they occur. In particular, some embodiments of the present invention limit the training text to those sentences that will produce the most frequent sets of prosodic contexts for each speech unit. Further embodiments of the present invention also provide a multi-tier selection mechanism for selecting a set of samples that will produce the most natural sounding speech.
摘要:
An input linguistic description is converted into a speech waveform by deriving at least one target unit sequence corresponding to the linguistic description, selecting from a waveform unit database for the target unit sequences a plurality of alternative unit sequences approximating the target unit sequences, concatenating the alternative unit sequences to alternative speech waveforms and choosing one of the alternative speech waveforms by an operating person. There are no iterative cycles of manual modification and automatic selection, which enables a fast way of working. The operator does not need knowledge of units, targets, and costs, but chooses from a set of given alternatives. The fine-tuning of TTS prompts therefore becomes accessible to non-experts.
摘要:
A synthesis method for concatenative speech synthesis is provided for efficiently concatenating waveform segments in the time-domain. A digital waveform provider produces an input sequence of digital waveform segments. A waveform concatenator concatenates the input segments by using waveform blending within a concatenation zone to synchronize, weight, and overlap-add selected portions o the input segments to produce a single digital waveform. The synchronizing includes determining a minimum weighted energy anchor in the selected portion of each input segment and aligning synchronization peaks in a local vicinity of each anchor.