摘要:
A speech synthesizing method which synthesizes speech naturally is disclosed. Standardized frame power values of an n-th frame is calculated when frame power values at head and tail frames in a phoneme are standardized. An average value of the power values sampled from the power frequency characteristics in the n-th frame at a predetermined frequency interval is set as a mean frame power value. A sum of squares of signal levels in one frame of a frequency signal from a sound source is calculated as a frame power correction value. A speech envelope signal is calculated as a function having variables of the standardized frame power values, the frame power correction value and the mean frame power value. The speech envelope signal adjusts the amplitude level of a speech waveform signal supplied from a vocal tract filter according to the level of the speech envelope signal.
摘要:
A digital signal processing device for outputting a holding data in an output register from a DSP in synchronism with a second clock pulse having a frequency lower than that of a first clock pulse for conducting arithmetic processing in the DSP. Accordingly, data to be output from the DSP can be directly read by a microcomputer, and contents in a coefficient memory and a delay time memory, for example, can be updated in accordance with the read data. Further, the digital signal processing device can be applied to an audio apparatus such as a loudness controller and a spectrum indicating apparatus.
摘要:
Before executing a speech recognition, a composite acoustic model adapted to noise is generated by composition of a noise adaptive representative acoustic model generated by noise-adaptation of each representative acoustic model and difference models stored in advance in a storing section, respectively. Then, the noise and speaker adaptive acoustic model is generated by executing speaker-adaptation to the composite acoustic model with the feature vector series of uttered speech. The renewal difference model is generated by the difference between the noise and speaker adaptive acoustic model and the noise adaptive representative acoustic model, to replace the difference model stored in the storing section therewith. The speech recognition is performed by comparing the feature vector series of the uttered speech to be recognized with the composite acoustic model adapted to noise and speaker generated by the composition of the noise adaptive representative acoustic model and the renewal difference model.
摘要:
A trained vector creating part 15 creates a characteristic of an unvoiced sound in advance as a trained vector V. Meanwhile, a threshold value THD for distinguishing a voice from a background sound is created based on a predictive residual power ε of a sound which is created during a non-voice period. As a voice is actually uttered, an inner product computation part 18 calculates an inner product of a feature vector A of an input signal Sa and a trained vector V, and a first threshold value judging part 19 judges that it is a voice section when the inner product has a value which is equal to or larger than a predetermined value θ while a second threshold value judging part 21 judges that it is a voice section when the predictive residual power ε of the input signal Sa is larger than a threshold value THD. As at least one of the first threshold value judging part 19 and the second threshold value judging part 21 judges that it is a voice section, a voice section determining part 300 finally judges that it is a voice section and cuts out an input signal Saf which are in units of frames and corresponds to this voice section as a voice Svc which is to be recognized.
摘要:
A sound echo machine as an acoustic signal processing unit of the present invention comprising an adder to which an input signal is fed, and a delay circuit for delaying the signal fed from the adder for a certain time to repeatedly feed back to the adder to generate an echo sound further comprises an input signal level detector for detecting the level of the input signal and sending it to a frequency oscillator to vary the oscillating frequency in accordance with the thus detected signal level for feeding it later to the delay circuit so as to modulate the time to be delayed at a predetermined cycle, whereby it can create an acoustic field in which a listener can feel as if various level of reflected sounds are coming towards him from various directions. On the other hand, a sound effecter as an acoustic signal processing unit comprising a plurality of acoustic signal processing sections, a plurality of attenuators each connected to these acoustic signal processing sections, and an adder for summing up all the signals from these attenuators further comprises a signal mixing ratio control section for monitoring the input acoustic signal level, and determining a signal mixing ratio among the respective output signals from the plurality of acoustic signal processing sections in accordance with the thus monitored level of the input acoustic signal, whereby even a simple structure can provide a specific sound effect.
摘要:
An operator recognition device is provided that eliminates the registration of data such as HMM data having a characteristic amount for which error in recognition occurs easily when recognizing an operator, and thus reduces the possibility of errors in recognition, and has stable recognition performance. When registering HMM data that is used when performing recognition processing, a speaker recognition device 100 eliminates the registration of HMM data of a password having a characteristic amount of the spoken voice component that is similar to a characteristic amount that is indicated by HMM data that is already registered, and does not allow the registration of HMM data for which it is estimated that error in recognition will occur easily during the recognition process.
摘要:
There is provided a voice recognition device and a voice recognition method that enhance the function of noise adaptation processing in voice recognition processing and reduce the capacity of a memory being used. Acoustic models are subjected to clustering processing to calculate the centroid of each cluster and the differential vector between the centroid and each model, model composition between each kind of assumed noise model and the calculated centroid is carried out, and the centroid of each composition model and the differential vector are stored in a memory. In the actual recognition processing, the centroid optimal to the environment estimated by the utterance environmental estimation is extracted from the memory, model restoration is carried out on the extracted centroid by using the differential vector stored in the memory, and noise adaptation processing is executed on the basis of the restored model.
摘要:
At the time of the speaker adaptation, first feature vector generation sections (7, 8, 9) generate a feature vector series [Ci, M] from which the additive noise and multiplicative noise are removed. A second feature vector generation section (12) generates a feature vector series [Si, M] including the features of the additive noise and multiplicative noise. A path search section (10) conducts a path search by comparing the feature vector series [Ci, m] to the standard vector [an, m] of the standard voice HMM (300). When the speaker adaptation section (11) conducts correlation operation on an average feature vector [S^n, m] of the standard vector [an, m] corresponding to the path search result Dv and the feature vector series [Si, m], the adaptive vector [xn, m] is generated. The adaptive vector [xn, m] updates the feature vector of the speaker adaptive acoustic model (400) used for the speech recognition.
摘要:
A multiplicative distortion Hm(cep) is subtracted from a voice HMM 5, a multiplicative distortion Ha(cep) of the uttered voice is subtracted from a noise HMM 6 formed by HMM, and the subtraction results Sm(cep) and {Nm(cep)−Ha (cep)} are combined with each other to thereby form a combined HMM 18 in the cepstrum domain. A cepstrum R^a(cep) obtained by subtracting the multiplicative distortion Ha (cep) from the cepstrum Ra (cep) of the uttered voice is compared with the distribution R^m(cep) of the combined HMM 18 in the cepstrum domain, and the combined HMM with the maximum likelihood is output as the voice recognition result.
摘要:
A pitch control apparatus which suppresses the occurrence of a tremolo tone which the interval control is performed. Input audio signal data is written at a memory position at a designated writing address in a memory in a predetermined order for every sampling cycle, a plurality of reading addresses of the memory are designated for every sampling cycle, and are set in a different order from the predetermined order for each cycle which is a multiple of the sampling cycle by a predetermined multiplier, data is read from memory positions of designated plurality of reading addresses in the memory, a coefficient is set in accordance with an address interval between the writing address and each of the designated plurality of reading addresses in the memory, the data read out at the plurality of reading addresses are multiplied by the associated coefficients, and the results are added together as output data. The maximum value of interval between each of the plurality of reading addresses, Dmax, is set asDmax=Tdmax/{(1-(1/Jn)).multidot.T.sub.0 }when the pitch is to be raised, and set asDmax=Tdmax/{(1+(1/Jn)).multidot.T.sub.0 }when the pitch is to be lowered,where T.sub.0 denotes the sampling cycle of the input audio signal data, Jn denotes how may times a cycle for skipping sampling data or reading sampling data twice should be longer than the sampling cycle T.sub.0, and Tdmax denotes an allowable time for a time-dependent data shift between the plurality of reading addresses, and the allowable time is set 45 to 80 msec by which the reverberation phenomenon is not remarkably disturbing.