摘要:
Systems and methods for speech synthesis and, in particular, text-to-speech systems and methods for converting a text input to a synthetic waveform by processing prosodic and phonetic content of a spoken example of the text input to accurately mimic the input speech style and pronunciation. Systems and methods provide an interface to a TTS system to allow a user to input a text string and a spoken utterance of the text string, extract prosodic parameters from the spoken input, and process the prosodic parameters to derive corresponding markup for the text input to enable a more natural sounding synthesized speech.
摘要:
A speech coding apparatus and method uses a hierarchy of prototype sets to code an utterance while consuming fewer computing resources. The value of at least one feature of an utterance is measured during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values. A plurality of level subsets of prototype vector signals is computed, wherein each prototype vector signal in a higher level subset is associated with at least one prototype vector signal in a lower level subset. Each level subset contains a plurality of prototype vector signals, with lower level subsets containing more prototypes than higher level subsets. The closeness of the feature value of the first feature vector signal is compared to the parameter values of prototype vector signals in the first level subset of prototype vector signals to obtain a ranked list of prototype match scores for the first feature vector signal and each prototype vector signal in the first level subset. The closeness of the feature value of the first feature vector signal is compared to the parameter values of each prototype vector signal in a second (lower) level subset that is associated with the highest ranking prototype vectors in the first level subset, to obtain a second ranked list of prototype match scores. The identification value of the prototype vector signal in the second ranked list having the best prototype match score is output as a coded utterance representation signal of the first feature vector signal.
摘要:
A method for generating a frequency warping function comprising preparing the training speech of a source and a target speaker; performing frame alignment on the training speech of the speakers; selecting aligned frames from the frame-aligned training speech of the speakers; extracting corresponding sets of formant parameters from the selected aligned frames; and generating a frequency warping function based on the corresponding sets of formant parameters. The step of selecting aligned frames preferably selects a pair of aligned frames in the middle of the same or similar frame-aligned phonemes with the same or similar contexts in the speech of the source speaker and target speaker. The step of generating a frequency warping function preferably uses the various pairs of corresponding formant parameters in the corresponding sets of formant parameters as key positions in a piecewise linear frequency warping function to generate the frequency warping function.
摘要:
A driving directions system loads into memory a limited subset of prerecorded, spoken utterances of geographic names from a mass media storage. The subset of spoken utterances may be limited, for example, to the geographic names within a predetermined radius (e.g., a few miles) of the driver's present location. The present location of the driver may be manually entered into the driving directions system by the driver, or automatically determined using a global positioning system (“GPS”) receiver. As the vehicle moves from its present location, the driving directions system loads into memory new names from the mass media storage and overwrites, if necessary, those which are now geographically out of range. Based on the current location of the driving, the driving directions system can audibly output geographic names from the run-time memory.
摘要:
A speech synthesis system is disclosed that utilizes a pitch contour resulting in a more natural-sounding speech. The present invention modifies the predicted pitch, b(t), for synthesized speech using a low frequency energy booster. The low frequency energy booster interpolates the discrete pitch values, if necessary, and increase the amount of energy of the pitch contour associated with low frequency values, such as all frequency values below 10 Hertz. The amount of energy of the pitch contour associated with low frequency values can be increased, for example, by adding band-limited noise (a carrier signal) to the pitch contour, b(t), or by filtering the pitch values with an impulse response filter having a pole at the desired low frequency value. The present invention serves to add vibrato to the to the original pitch contour, b(t), and thereby improves the naturalness of the synthetic waveform.
摘要:
A multi-lingual translation system that provides multiple output sentences for a given word or phrase. Each output sentence for a given word or phrase reflects, for example, a different emotional emphasis, dialect, accents, loudness or rates of speech. A given output sentence could be selected automatically, or manually as desired, to create a desired effect. For example, the same output sentence for a given word or phrase can be recorded three times, to selectively reflect excitement, sadness or fear. The multi-lingual translation system includes a phrase-spotting mechanism, a translation mechanism, a speech output mechanism and optionally, a language understanding mechanism or an event measuring mechanism or both. The phrase-spotting mechanism identifies a spoken phrase from a restricted domain of phrases. The language understanding mechanism, if present, maps the identified phrase onto a small set of formal phrases. The translation mechanism maps the formal phrase onto a well-formed phrase in one or more target languages. The speech output mechanism produces high-quality output speech. The speech output may be time synchronized to the spoken phrase using the output of the event measuring mechanism.
摘要:
A method for generating a frequency warping function comprising preparing the training speech of a source and a target speaker; performing frame alignment on the training speech of the speakers; selecting aligned frames from the frame-aligned training speech of the speakers; extracting corresponding sets of formant parameters from the selected aligned frames; and generating a frequency warping function based on the corresponding sets of formant parameters. The step of selecting aligned frames preferably selects a pair of aligned frames in the middle of the same or similar frame-aligned phonemes with the same or similar contexts in the speech of the source speaker and target speaker. The step of generating a frequency warping function preferably uses the various pairs of corresponding formant parameters in the corresponding sets of formant parameters as key positions in a piecewise linear frequency warping function to generate the frequency warping function.
摘要:
TTS synthesis systems are provided which implement computationally fast and efficient pitch contour smoothing methods for determining smooth pitch contours for non-smooth pitch contours, which closely track the non-smooth pitch contours. For example, a TTS method includes generating a sequence of phonetic units representative of a target utterance, determining a pitch contour for the target utterance, the pitch contour comprising a plurality of linear pitch contour segments, wherein each linear pitch contour segment has start and end times at anchor points of the pitch contour, filtering the pitch contour to determine pitch values of a smooth pitch contour at the anchor points, and determining the smooth pitch contour between adjacent anchor points by linearly interpolating between the pitch values of the smooth pitch contour at the anchor points.
摘要:
Examples of paralinguistic events (e.g., breaths, coughs, sighs, etc.) are recorded. A text-to-speech (“TTS”) engine may insert the examples into a stream of synthetic speech using, for example, markup. The synthetic speech may include a combination of normal text and paralinguistic text.
摘要:
Methods and apparatus for providing multiple output channels in a microphone. More particularly, provision is made for an arrangement wherein a single microphone is adapted to produce one or more different audio outputs depending upon characteristics of a speaker or user of the microphone while facilitating a high degree of accuracy in the recognition of the user or speaker by the arrangement. The microphone is adapted to produce one or more different audio streams or outputs depending upon the speaker presently using the microphone. In effect, this can be readily implemented by a main user or speaker, such as an interviewer on a radio or TV talk show, or any speaker in a conference room, intending to control the audio output streams by suitably activating a button or switch.