Abstract:
A conversion rule and a rule selection parameter are stored. The conversion rule converts a spectral parameter of a source speaker to a spectral parameter of a target speaker. The rule selection parameter represents the spectral parameter of the source speaker. A first conversion rule of start timing and a second conversion rule of end timing in a speech unit of the source speaker are selected by the spectral parameter of the start timing and the end timing. An interpolation coefficient corresponding to the spectral parameter of each timing in the speech unit is calculated by the first conversion rule and the second conversion rule. A third conversion rule corresponding to the spectral parameter of each timing in the speech unit is calculated by interpolating the first conversion rule and the second conversion rule with the interpolation coefficient. The spectral parameter of each timing is converted to a spectral parameter of the target speaker by the third conversion rule. A spectral acquired from the spectral parameter of the target speaker is compensated by a spectral compensation quantity. A speech waveform is generated from the compensated spectral.
Abstract:
An improved system method for enabling and implementing codebook-based voice conversion that both significantly reduces the memory footprint and improves the continuity of the output. In various embodiments, the paired source-target codebook is implemented as a multi-stage vector quantizer. During the conversion, N best candidates in a tree search are taken as the output from the quantizer. The N candidates for each vector to be converted are used in a dynamic programming-based approach that finds a smooth but accurate output sequence.
Abstract:
In one embodiment, the methods and apparatuses detect an original audio signal;detect a sound model wherein the sound model includes a sound parameter; transform the original audio signal based on the parameter whereby forming a transformed audio signal; and compare the transformed audio signal with the original audio signal.
Abstract:
An apparatus for providing efficient evaluation of feature transformation includes a training module and a transformation module. The training module is configured to train a Gaussian mixture model (GMM) using training source data and training target data. The transformation module is in communication with the training module. The transformation module is configured to produce a conversion function in response to the training of the GMM. The training module is further configured to determine a quality of the conversion function prior to use of the conversion function by calculating a trace measurement of the GMM.
Abstract:
A method for converting a voice signal from a source speaker into a converted voice signal with acoustic characteristics similar to those of a target speaker includes the steps of determining (1) at least one function for transforming source speaker acoustic characteristics into acoustic characteristics similar to those of the target speaker using target and source speaker voice samples; and transforming acoustic characteristics of the source speaker voice signal to be converted by applying the transformation function(s). The method is characterized in that the transformation (2) includes the step (44) of applying only a predetermined portion of at least one transformation function to said signal to be converted.
Abstract:
A method and apparatus for reducing noise in a speech signal. A handset or remote unit provides to users with a hearing deficiency, a first mode of operation where noise suppressant/speech enhancement algorithms are used during any auditory-related service. There is also provided, in a related mode of operation, speech filtering for reducing noise in a speech signal received through the microphone and outputting the filtered sound to the speaker. The handset includes a microphone for receiving an auditory sound, a receiver for receiving an auditory signal and a speech filter for suppressing noise in the auditory signal and sound. The speech filter also may be configured to shift the frequency and/or alter the intensity of the auditory signal and sound. The speaker is used for amplifying and outputting the enhanced speech component as an audible sound.
Abstract:
A method for analyzing fundamental frequency information contained in voice samples includes at least one analysis step (2) for the voice samples which are grouped together in frames in order to obtain information relating to the spectrum and information relating to the fundamental frequency for each sample frame; a step (20) for the determination of a model representing the common characteristics of the spectrum and fundamental frequency of all samples; and a step (30) for determination of a fundamental frequency prediction function exclusively according to spectrum-related in formation on the basis of the model and voice samples.
Abstract:
A method and apparatus are provided for adjusting a content of an oral presentation provided by an agent of an organization and perceived by a human target of the organization based upon an objective of the organization. The method includes the steps of detecting a content of the oral presentation provided by the agent and modifying the oral presentation provided by the agent to produce the oral presentation perceived by the human target based upon the detected content and the organizational objective.
Abstract:
A method for differentiated digital voice and music processing, noise filtering and the creation of special effects. The method can be used to make the most of digital audio technologies, by performing a pre-encoding audio signal analysis, assuming that any sound signal during one frame interval is the sum of sines having a fixed amplitude and a frequency which is linearly modulated as a function of time, the sum being temporally modulated by the signal envelope and the noise being added to the signal prior to the sum.
Abstract:
An acoustic analysis unit acoustically analyzes a first utterance by a user. A pattern by-characteristic selection unit selects a trained pattern optimal for the user's utterance from a plurality of trained patterns that are previously classified and stored every characteristic. A speaker adaptation processor determines a spectral frequency distortion coefficient for correcting difference between spectral frequencies. The difference is caused by vocal tract length of a training speaker and an input speaker. Recognition of subsequent utterances using this determination improves recognition performance of the subsequent speech sounds.