摘要:
A fast vocabulary independent method for spotting words in speech utilizes a preprocessing step and a coarse-to-detailed search strategy for spotting a word/phone sequence in speech. The preprocessing includes a Viterbi-beam phone level decoding using a tree-based phone language model. The coarse search matches phone-ngrams to identify regions of speech as putative word hits, and the detailed search performs an acoustic match at the putative hits with a model of the given word included in the vocabulary of the recognizer.
摘要:
A method for generating a frequency warping function comprising preparing the training speech of a source and a target speaker; performing frame alignment on the training speech of the speakers; selecting aligned frames from the frame-aligned training speech of the speakers; extracting corresponding sets of formant parameters from the selected aligned frames; and generating a frequency warping function based on the corresponding sets of formant parameters. The step of selecting aligned frames preferably selects a pair of aligned frames in the middle of the same or similar frame-aligned phonemes with the same or similar contexts in the speech of the source speaker and target speaker. The step of generating a frequency warping function preferably uses the various pairs of corresponding formant parameters in the corresponding sets of formant parameters as key positions in a piecewise linear frequency warping function to generate the frequency warping function.
摘要:
A method (and system) of determining confusable list items and resolving this confusion in a spoken dialog system includes receiving user input, processing the user input and determining if a list of items needs to be played back to the user, retrieving the list to be played back to the user, identifying acoustic confusions between items on the list, changing the items on the list as necessary to remove the acoustic confusions, and playing unambiguous list items back to the user.
摘要:
A speech synthesis system is disclosed that utilizes a pitch contour resulting in a more natural-sounding speech. The present invention modifies the predicted pitch, b(t), for synthesized speech using a low frequency energy booster. The low frequency energy booster interpolates the discrete pitch values, if necessary, and increase the amount of energy of the pitch contour associated with low frequency values, such as all frequency values below 10 Hertz. The amount of energy of the pitch contour associated with low frequency values can be increased, for example, by adding band-limited noise (a carrier signal) to the pitch contour, b(t), or by filtering the pitch values with an impulse response filter having a pole at the desired low frequency value. The present invention serves to add vibrato to the to the original pitch contour, b(t), and thereby improves the naturalness of the synthetic waveform.
摘要:
A characteristic-specific digitization method and apparatus are disclosed that reduces the error rate in converting input information into a computer-readable format. The input information is analyzed and subsets of the input information are classified according to whether the input information exhibits a specific physical parameter affecting recognition accuracy. If the input information exhibits the specific physical parameter affecting recognition accuracy, the characteristic-specific digitization system recognizes the input information using a characteristic-specific recognizer that demonstrates improved performance for the given physical parameter. If the input information does not exhibit the specific physical parameter affecting recognition accuracy, the characteristic-specific digitization system recognizes the input information using a general recognizer that performs well for typical input information. In one implementation, input speech having very low recognition accuracy as a result of a physical speech characteristic is automatically identified and recognized using a characteristic-specific speech recognizer.