摘要:
An operator recognition device is provided that eliminates the registration of data such as HMM data having a characteristic amount for which error in recognition occurs easily when recognizing an operator, and thus reduces the possibility of errors in recognition, and has stable recognition performance. When registering HMM data that is used when performing recognition processing, a speaker recognition device 100 eliminates the registration of HMM data of a password having a characteristic amount of the spoken voice component that is similar to a characteristic amount that is indicated by HMM data that is already registered, and does not allow the registration of HMM data for which it is estimated that error in recognition will occur easily during the recognition process.
摘要:
An operator recognition device is provided that eliminates the registration of data such as HMM data having a characteristic amount for which error in recognition occurs easily when recognizing an operator, and thus reduces the possibility of errors in recognition, and has stable recognition performance. When registering HMM data that is used when performing recognition processing, a speaker recognition device 100 eliminates the registration of HMM data of a password having a characteristic amount of the spoken voice component that is similar to a characteristic amount that is indicated by HMM data that is already registered, and does not allow the registration of HMM data for which it is estimated that error in recognition will occur easily during the recognition process.
摘要:
A trained vector creating part 15 creates a characteristic of an unvoiced sound in advance as a trained vector V. Meanwhile, a threshold value THD for distinguishing a voice from a background sound is created based on a predictive residual power ε of a sound which is created during a non-voice period. As a voice is actually uttered, an inner product computation part 18 calculates an inner product of a feature vector A of an input signal Sa and a trained vector V, and a first threshold value judging part 19 judges that it is a voice section when the inner product has a value which is equal to or larger than a predetermined value θ while a second threshold value judging part 21 judges that it is a voice section when the predictive residual power ε of the input signal Sa is larger than a threshold value THD. As at least one of the first threshold value judging part 19 and the second threshold value judging part 21 judges that it is a voice section, a voice section determining part 300 finally judges that it is a voice section and cuts out an input signal Saf which are in units of frames and corresponds to this voice section as a voice Svc which is to be recognized.
摘要:
A multiplicative distortion Hm(cep) is subtracted from a voice HMM 5, a multiplicative distortion Ha(cep) of the uttered voice is subtracted from a noise HMM 6 formed by HMM, and the subtraction results Sm(cep) and {Nm(cep)−Ha (cep)} are combined with each other to thereby form a combined HMM 18 in the cepstrum domain. A cepstrum R^a(cep) obtained by subtracting the multiplicative distortion Ha (cep) from the cepstrum Ra (cep) of the uttered voice is compared with the distribution R^m(cep) of the combined HMM 18 in the cepstrum domain, and the combined HMM with the maximum likelihood is output as the voice recognition result.
摘要:
An initial combination HMM 16 is generated from a voice HMM 10 having multiplicative distortions and an initial noise HMM of additive noise, and at the same time, a Jacobian matrix J is calculated by a Jacobian matrix calculating section 19. Noise variation Namh (cep), in which an estimated value Ha^(cep) of the multiplicative distortions that are obtained from voice that is actually uttered, additive noise Na(cep) that is obtained in a non-utterance period, and additive noise Nm(cep) of the initial noise HMM 17 are combined, is multiplied by a Jacobian matrix, wherein the result of the multiplication and initial combination HMM 16 are combined, and an adaptive HMM 26 is generated. Thereby, an adaptive HMM 26 that is matched to the observation value series RNah(cep) generated from actual utterance voice can be generated in advance. When performing voice recognition by collating the observation value series RNah(cep) with adaptive HMM 26, influences due to the multiplicative distortions and additive distortions are counterbalanced, wherein an effect that is equivalent to a case where voice recognition is carried out with clean voice can be obtained, and a robust voice recognition system can be achieved.
摘要:
An acoustic model registration apparatus, an talker recognition apparatus, an acoustic model registration method and an acoustic model registration processing program, each of which prevents certainly an acoustic model having a low recognition capability for talker from being registered certainly, are provided.When a talker utters for the N utterances and the utterance sounds of the N utterances are input through the microphone 1, the sound feature quantity extraction part 4 extracts sound feature quantities which indicate the acoustic features of the input utterance sounds, wherein each sound feature quantity has one-to-one correspondence to each utterance, the talker model generation part 5 generates a talker model based on the extracted sound feature quantities for the N utterances, the collation part 6 calculates the degree of individual similarity between the each sound feature quantity of the N utterances and the talker model generated above, and only in the case that all the calculated degrees of similarities of the N utterances are equal to or more than the threshold value, the similarity verifying part 9 directs to register the generated talker model in the talker models' database as a talker model for the talker recognition.
摘要:
Before executing a speech recognition, a composite acoustic model adapted to noise is generated by composition of a noise adaptive representative acoustic model generated by noise-adaptation of each representative acoustic model and difference models stored in advance in a storing section, respectively. Then, the noise and speaker adaptive acoustic model is generated by executing speaker-adaptation to the composite acoustic model with the feature vector series of uttered speech. The renewal difference model is generated by the difference between the noise and speaker adaptive acoustic model and the noise adaptive representative acoustic model, to replace the difference model stored in the storing section therewith. The speech recognition is performed by comparing the feature vector series of the uttered speech to be recognized with the composite acoustic model adapted to noise and speaker generated by the composition of the noise adaptive representative acoustic model and the renewal difference model.
摘要:
A sound echo machine as an acoustic signal processing unit of the present invention comprising an adder to which an input signal is fed, and a delay circuit for delaying the signal fed from the adder for a certain time to repeatedly feed back to the adder to generate an echo sound further comprises an input signal level detector for detecting the level of the input signal and sending it to a frequency oscillator to vary the oscillating frequency in accordance with the thus detected signal level for feeding it later to the delay circuit so as to modulate the time to be delayed at a predetermined cycle, whereby it can create an acoustic field in which a listener can feel as if various level of reflected sounds are coming towards him from various directions. On the other hand, a sound effecter as an acoustic signal processing unit comprising a plurality of acoustic signal processing sections, a plurality of attenuators each connected to these acoustic signal processing sections, and an adder for summing up all the signals from these attenuators further comprises a signal mixing ratio control section for monitoring the input acoustic signal level, and determining a signal mixing ratio among the respective output signals from the plurality of acoustic signal processing sections in accordance with the thus monitored level of the input acoustic signal, whereby even a simple structure can provide a specific sound effect.
摘要:
There is provided a voice recognition device and a voice recognition method that enhance the function of noise adaptation processing in voice recognition processing and reduce the capacity of a memory being used. Acoustic models are subjected to clustering processing to calculate the centroid of each cluster and the differential vector between the centroid and each model, model composition between each kind of assumed noise model and the calculated centroid is carried out, and the centroid of each composition model and the differential vector are stored in a memory. In the actual recognition processing, the centroid optimal to the environment estimated by the utterance environmental estimation is extracted from the memory, model restoration is carried out on the extracted centroid by using the differential vector stored in the memory, and noise adaptation processing is executed on the basis of the restored model.
摘要:
At the time of the speaker adaptation, first feature vector generation sections (7, 8, 9) generate a feature vector series [Ci, M] from which the additive noise and multiplicative noise are removed. A second feature vector generation section (12) generates a feature vector series [Si, M] including the features of the additive noise and multiplicative noise. A path search section (10) conducts a path search by comparing the feature vector series [Ci, m] to the standard vector [an, m] of the standard voice HMM (300). When the speaker adaptation section (11) conducts correlation operation on an average feature vector [S^n, m] of the standard vector [an, m] corresponding to the path search result Dv and the feature vector series [Si, m], the adaptive vector [xn, m] is generated. The adaptive vector [xn, m] updates the feature vector of the speaker adaptive acoustic model (400) used for the speech recognition.