摘要:
A system and a method is disclosed for verifying a voice of a user conducting a telephone transaction. The system and method includes a mechanism for prompting the user to speak in a limited vocabulary. A feature extractor converts the limited vocabulary into a plurality of speech frames. A pre-processor is coupled to the feature extractor for processing the plurality of speech frames to produce a plurality of processed frames. The processing includes frame selection, which eliminates each of the plurality of speech frames having an absence of words. A Viterbi decoder is also coupled to said feature extractor for assigning a frame label to each of the plurality of speech frames to produce a plurality of frame labels. The processed frames and frame labels are then combined to produce a voice model, which includes each of the plurality of frame labels that correspond to the number of plurality of processed frames. A mechanism is also provided for comparing the voice model with the claimant's voice model, derived during a previous enrollment session. The voice model also is compared with an alternate voice model set, derived during previous enrollment sessions. The identity claimed is accepted if the voice model matches the claimant's voice model better than the alternative voice model set.
摘要:
A speech verification process involves comparison of enrollment and test speech data and an improved method of comparing the data is disclosed, wherein segmented frames of speech are analyzed jointly, rather than independently. The enrollment and test speech are both subjected to a feature extraction process to derive fixed-length feature vectors, and the feature vectors are compared, using a linear discriminant analysis and having no dependence upon the order of the words spoken or the speaking rate. The discriminant analysis is made possible, despite a relatively high dimensionality of the feature vectors, by a mathematical procedure provided for finding an eigenvector to simultaneously diagonalize the between-speaker and between-channel covariances of the enrollment and test data.
摘要:
A method for performing noise suppression and channel equalization of a noisy voice signal comprising the steps of sampling the noisy voice signal at a predetermined sampling rate fs; segmenting the sampled voice signal into a plurality of frames having a predetermined number of samples per frame, over a predetermined temporal window; generating an N-point spectral sample representation of each of the sample signal frames; determining the magnitude of each of the N-point spectral samples and generating a histogram of the energy associated with each of the N-point spectral samples at a particular frequency; detecting a peak amplitude of the histogram which corresponds to a noise threshold Nf associated with the particular frequency; determining a channel frequency response Cf associated with the particular frequency by determining a geometric mean over all the spectral samples having magnitude exceeding the noise threshold Nf; subtracting from each of the magnitudes of the N point spectral samples the noise threshold Nf to provide a noise suppressed sample sequence; applying blind deconvolution to the noise suppressed samples; transforming the deconvolved noise suppressed sampled sequence to a temporal representation; shifting the temporal sample sequence in time by a predetermined amount; and adding the time shifted temporal samples over a period corresponding to the predetermined temporal window to provide a suppressed noise voice signal.