摘要:
A method of performing a search of a database of speakers, includes: receiving a query speech sample spoken by a query speaker; deriving a query utterance from the query speech sample; extracting query utterance statistics from the query utterance; performing Kernelized Locality-Sensitive Hashing (KLSH) using a kernel function, the KLSH using as input the query utterance statistics and utterance statistics extracted from a plurality of utterances included in a database of speakers in order to select a subset of the plurality of utterances; and comparing, using an utterance comparison equation, the query utterance statistics to the utterance statistics for each utterance in the subset to generate a list of speakers from the database of utterances having a highest similarity to the query speaker.
摘要:
A voice sample characterization front-end suitable for use in a distributed speech recognition context. A digitized voice sample 31 is split between a low frequency path 32 and a high frequency path 33. Both paths are used to determine spectral content suitable for use when determining speech recognition parameters (such as cepstral coefficients) that characterize the speech sample for recognition purposes. The low frequency path 32 has a thorough noise reduction capability. In one embodiment, the results of this noise reduction are used by the high frequency path 33 to aid in de-noising without requiring the same level of resource capacity as used by the low frequency path 32.
摘要:
A system for enhancing the signal-to-noise ratio of a speech signal is avoided. A plurality of local energy maximums associated with a speech signal are determined. Presumably, each of these local energy maximums defines a speech pitch period. Typically, human pitch periods are approximately 100-400 Hz depending on the sex and age of the speaker. Because human speech typically includes more energy near the beginning of a pitch period than at the end of the pitch period, and background noise tends to remain relatively constant throughout the pitch period, the speech signal may be enhanced by increasing the energy associated with the beginning of the pitch period and/or by decreasing the energy associated with the end of the pitch period. Preferably, the amount of energy increase in the earlier portion of the pitch period is approximately equal to the amount of energy reduction in the later portion of the pitch period. In this manner, the total energy remains the constant.
摘要:
A method and apparatus for robust speech activity detection is disclosed. The method may include calculating autocorrelations by filtering input signals using order statistic filtering, averaging the autocorrelations over a time period, obtaining a voiced speech feature from the averaged autocorrelations, classifying the input signal as one of speech and non-speech based on the obtained voiced speech feature, and outputting only the classified speech signals or the input signals along with the speech/non-speech classification information, to an automated speech recognizer.