摘要:
A neuron device network is provided with a speech input layer, a context layer, a hidden layer, a speech output layer and a hypothesis layer. A phoneme to be learned is spectral-analyzed by an FFT unit and a vector row at a time point t is input to a speech input layer. Also, a vector state of the hidden layer at a time t-1 is input to the context layer, the vector row at a time t+1 is input to the speech output layer as an instructor signal, and a code row for hypothesizing the phoneme, or the code row, is input to the hypothesis layer. The time series relation of the vector rows and the phoneme are hypothetically learned. Alternatively, a spectrum, a cepstrum or a speech vector row based on outputs from the hidden layer of an auto-associative neural network is input to the speech input layer, and the code row is output from the hypothesis layer, taking into account the time series relation. The speech is recognized when a CPU reads the stored output values of the hidden layer and the connection weights of the hidden layer and the hypothesis layer from a memory of the neuron device network and calculates output values of the respective neuron devices of the hypothesis layer based on the output values and the connection weights. The corresponding phoneme is determined by collating the output values of the respective neuron devices of the hypothesis layer with the code rows in an instructor signal table.
摘要:
Method envelope-invariant for audio signal synthesis from elementary audio waveforms stored in a dictionary wherein:the waveforms are perfectly periodic, and stored as one of their period,synthesis is obtained by overlap-adding of the waveforms obtained from time-domain repetition of the periodic waveforms with a weighting window whose size is approximately two times the period of the signals to weight, and whose relative position inside of the period is fixed to any value identical for all the periods, each extracted from a reharmonized and thus periodic waveform, obtained by modifying, without changing the spectral envelope, the frequencies and amplitudes of harmonics in the spectrum of a frame of the original continuous speech waveform,whereby the time shift between two successive waveforms obtained by weighting the original signals is set to the imposed fundamental frequency of the signal to synthesize.
摘要:
To synthesize speech, which is clear and high in naturalness, in a Japanese-language speech synthesis system by improving not only phoneme information but also rhythm information. In the Japanese-language, the independent word speech and the adjunct word speech are remarkably different in speech characteristic. The difference in speech characteristics between them is clearly observed, particularly in rhythmical elements such as the intensity, speech, and pitch of speech. From this fact, there is provided a new rule synthesis method which uses as a speech synthesis unit an adjunct word chain unit comprising a chain of one or more adjunct words and which is capable of synthesizing speech whose naturalness is high. The portion other than the adjunct word portion, i.e., the independent word portion, is constituted in a CV/VC unit.
摘要:
The invention relates to a method and an arrangement for speech synthesis and provides an automatic mechanism for simulating human speech. The method provides a number of control parameters for controlling a speech synthesis device. The invention solves the problem of coarticulation by using an interpolation mechanism. The control parameters are stored in a matrix or a sequence list for each polyphone. The behaviour of the respective parameter with time is defined around each phoneme boundary and polyphones are joined by forming a weighted mean value of the curves which are defined by their two associated matrices/sequences list. The invention also provides an arrangement for carrying out the method.
摘要:
A system pattern-based speech recognition, e.g., a hidden Markov model (HMM) based speech recognizer using Viterbi scoring. The principle of minimum recognition error rate is applied by the present invention using discriminative training. Various issues related to the special structure of HMMs are presented. Parameter update expressions for HMMs are provided.
摘要:
A system that synchronously segments a speech waveform using pitch period and a center of the pitch waveform. The pitch waveform center is determined by finding a local minimum of a centroid histogram waveform of the low-pass filtered speech waveform for one pitch period. The speech waveform can then be represented by one or more of such pitch waveforms or segments during speech compression, reconstruction or synthesis. The pitch waveform can be modified by frequency enhancement/filtering, waveform stretching/shrinking in speech synthesis or speech disguise. The utterance rate can also be controlled to speed up or slow down the speech.
摘要:
A karaoke apparatus produces a karaoke accompaniment which accompanies a singing voice of an actual player, and concurrently creates a harmony voice originating from a virtual player. In the karaoke apparatus, a memory device stores voice information of the virtual singer. An input device collects the singing voice of the actual player. An analyzing device analyzes an audio frequency of the collected singing voice. A synthesizing device processes the stored voice information based on the analyzed audio frequency to synthesize the harmony voice having another audio frequency which is set in harmony with the analyzed audio frequency. An output device mixes the collected singing voice and the synthesized harmony voice with each other, and outputs the mixed singing and harmony voices along with the karaoke accompaniment. In one preferred embodiment, the memory device stores the voice information in the form of a sequence of phonetic elements that are successively sampled syllable by syllable from a singing voice of the virtual player.
摘要:
A system and method are provided for automatically computing local pitch contours from textual input to produce pitch contours that closely mimic those found in natural speech. The methodology of the invention incorporates parameterized equations whose parameters can be estimated directly from natural speech recordings. That methodology incorporates a model based on the premise that pitch contours instantiating a particular pitch contour class can be described as distortions in the temporal and frequency domains of a single, underlying contour. After the nature of the pitch contour for different pitch contour classes has been established, a pitch contour can be predicted that closely models a natural speech contour for a synthetic speech utterance by adding the individual contours of the different intonational classes and adjusting the boundaries of these to match the boundaries of the adjacent intonation curves.
摘要:
In a waveform compilation (waveform concatenation or synthesis-by-rule) type speech synthesis method and speech synthesizer, phoneme waveform segments in natural speech waveforms are clustered, and one of the phoneme waveform segments having a parameter nearest the centroid of LPC parameters of all the phoneme waveforms in each cluster is selected and stored as a representative phoneme waveform in a waveform information memory. When synthesizing a speech waveform, representative phoneme waveforms of the same phonemes, whose context is most similar to that of each phoneme of a phoneme string of the speech to be synthesized, are selectively read out of the waveform information memory and thus read-out representative phoneme waveforms are sequentially concatenated for output as a continuous synthesized speech waveform.
摘要:
A verification system to determine unknown input speech contains a recognized keyword or consists of speech or other sounds that do not contain any of the keywords. The verification system is designed to operate on the subword level, so that the verification process is advantageously vocabulary independent. Such a vocabulary-independent verifier is achieved by a two-stage verification process comprising subword level verification followed by string level verification. The subword level verification stage verifies each subword segment in the input speech as determined by an Hidden Markov Model recognizer to determine if that segment consists of the sound corresponding to the subword that the HMM recognizer assigned to that segment. The string level verification stage combines the results of the subword level verification to make the rejection decision for the whole keyword. Advantageously, the training of this two-stage verifier is independent of the specific vocabulary set implying that when the vocabulary set is update or changed the verifier need not be retrained and can still be reliably verifying the new set of keywords.