Abstract:
Speaker authentication is performed by determining a similarity score for a test utterance and a stored training utterance. Computing the similarity score involves determining the sum of a group of functions, where each function includes the product of a posterior probability of a mixture component and a difference between an adapted mean and a background mean. The adapted mean is formed based on the background mean and the test utterance. The speech content provided by the speaker for authentication can be text-independent (i.e., any content they want to say) or text-dependent (i.e., a particular phrase used for training).
Abstract:
Speaker recognition (identification and/or verification) methods and systems, in which speech models for enrolled speakers consist of sets of feature vectors representing the smoothed frequency spectrum of each of a plurality of frames and a clustering algorithm is applied to the feature vectors of the frames to obtain a reduced data set representing the original speech sample, and wherein the adjacent frames are overlapped by at least 80 %. Speech models of this type model the static components of the speech sample and exhibit temporal independence. An identifier strategy is employed in which modelling and classification processes are selected to give a false rejection rate substantially equal to zero. Each enrolled speaker is associated with a cohort of a predetermined number of other enrolled speakers and a test sample is always matched with either the claimed identity or one of its associated cohort. This makes the overall error rate of the system dependent only on the false acceptance rate, which is determined by the cohort size. The false error rate is further reduced by use of multiple parallel modelling and/or classification processes. Speech models are normalised prior to classification using a normalisation model derived from either the test speech sample or one of the enrolled speaker samples (most preferably from the claimed identity enrolment sample).
Abstract:
The voice print system of the present invention concerns an automatic speaker verification (ASV) system that is subword-based and text-dependent with no constraints on the choice of vocabulary words or language. One component of the preferred ASV system is a channel estimation and normalization component that is able to remove the characteristics of the test channel component (150) and/or enrollment channel component (90) to increase accuracy. The preferred methods and systems of the present invention termed Curve-Fitting (62, 64, 66) and Clean Speech (82, 86, 88, 90, 92), separately, together, and in combination with Pole filtering (42, 44, 46), significantly improve the existing methods of channel estimation and normalization. Unlike Cepstral Mean Subtraction, both Curve-Fitting (62, 64, 66) and Clean Speech (42, 44, 46) methods and systems extract only the channel related information from the cepstral mean and not any speech information.
Abstract:
The present invention relates to a pattern recognition system (Fig. 1) which uses data fusion to combine data from a plurality of extracted features (60, 61, 62) and a plurality of classifiers (70, 71, 72). Speaker patterns can be accurately verified with the combination of discriminant based and distortion based classifiers. A novel approach using a training set of a "leave one out" data can be used for training the system with a reduced data set (Figs. 7A, 7B, 7C). Extracted features can be improved with a pole filtered method for reducing channel effects (Fig. 11B) and an affine transformation for improving the correlation between training and testing data (Fig. 14).
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for a dynamic threshold for speaker verification are disclosed. In one aspect, a method includes the actions of receiving, for each of multiple utterances of a hotword, a data set including at least a speaker verification confidence score, and environmental context data. The actions further include selecting from among the data sets, a subset of the data sets that are associated with a particular environmental context. The actions further include selecting a particular data set from among the subset of data sets based on one or more selection criteria. The actions further include selecting, as a speaker verification threshold for the particular environmental context, the speaker verification confidence score. The actions further include providing the speaker verification threshold for use in performing speaker verification of utterances that are associated with the particular environmental context.
Abstract:
A method and apparatus for adjusting a trigger parameter related to voice recognition processing includes receiving into the device an acoustic signal comprising a speech signal, which is provided to a voice recognition module, and comprising noise. The method further includes determining a noise profile for the acoustic signal, wherein the noise profile identifies a noise level for the noise and identifies a noise type for the noise based on a frequency spectrum for the noise, and adjusting the voice recognition module based on the noise profile by adjusting a trigger parameter related to voice recognition processing.
Abstract:
The present invention relates toamethod forspeech watermarking inspeaker verification,comprising the steps of: embedding watermark data into speech signal at a transmitter; and extracting watermark data from the speech signal at a receiver;characterisedby the steps of: selecting frameshavingleast speaker-specific information fromthe speech signal to carry watermark data; detecting voice activity to detect presence or absence of speaker's voice in the speech signal;and embedding watermark data into the selected frames of the speech signal according to thepresence or absence of the speaker's voice.
Abstract:
A speaker verification method is proposed that first builds a general model of user utterances using a set of general training speech data. The user also trains the system by providing a training utterance, such as a passphrase or other spoken utterance. Then in a test phase, the user provides a test utterance which includes some background noise as well as a test voice sample. The background noise is used to bring the condition of the training data closer to that of the test voice sample by modifying the training data and a reduced set of the general data, before creating adapted training and general models. Match scores are generated based on the comparison between the adapted models and the test voice sample, with a final match score calculated based on the difference between the match scores. This final match score gives a measure of the degree of matching between the test voice sample and the training utterance and is based on the degree of matching between the speech characteristics from extracted feature vectors that make up the respective speech signals, and is not a direct comparison of the raw signals themselves. Thus, the method can be used to verify a speaker without necessarily requiring the speaker to provide an identical test phrase to the phrase provided in the training sample.
Abstract:
A method and apparatus is provided for establishing a normalizing model suitable for use with a speaker model to normalize the speaker model, the speaker model for modelling voice characteristics of a specific individual, the speaker model and the normalizing model for use in recognizing identity of a speaker. A normalizer module (231) within a scoring module (215) uses the normalizing score (229) to normalize the speaker score (225) thereby obtaining a normalized speaker score (217). Based on the normalized speaker score (217), a decision module (219) makes a decision (221) of whether to believe that the test speaker (203), whose utterance was the source of the speech data (213), is the reference speaker (403).