摘要:
This invention relates to a framework for converting a source speech signal associated with a source voice into a target speech signal that is a representation of the source speech signal associated with a target voice. The source speech signal is encoded into samples of encoding parameters, wherein the encoding comprises the step of segmenting the source speech signal into segments based on characteristics of the source speech signal. The samples of the encoding parameters, or a converted representation of the samples of the encoding parameters are then decoded to obtain the target speech signal. Therein, in the encoding, the decoding or in a separate step, samples of parameters related to the source speech signal are converted into samples of parameters related to the target speech signal. Therein, at least one of the encoding and the converting depends on the segments of the source speech signal.
摘要:
Systems and methods are provided for performing soft alignment in Gaussian mixture model (GMM) based and other vector transformations. Soft alignment may assign alignment probabilities to source and target feature vector pairs. The vector pairs and associated probabilities may then be used calculate a conversion function, for example, by computing GMM training parameters from the joint vectors and alignment probabilities to create a voice conversion function for converting speech sounds from a source speaker to a target speaker.
摘要:
In the concatenative text-to-speech system, high compression rate of duration data in the prosodic template is achieved by extracting statistical parameters describing behavior of actual duration values of instances of each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, and storing only the extracted statistical parameters, instead of the original duration values. Entries of each given basic unit in the prosodic template is sorted and indexed in the order of increasing duration value. Consequently, the amount of duration data can be significantly reduced, while keeping the error statistically under acceptable range.
摘要:
Systems and methods are provided for performing soft alignment in Gaussian mixture model (GMM) based and other vector transformations. Soft alignment may assign alignment probabilities to source and target feature vector pairs. The vector pairs and associated probabilities may then be used calculate a conversion function, for example, by computing GMM training parameters from the joint vectors and alignment probabilities to create a voice conversion function for converting speech sounds from a source speaker to a target speaker.
摘要:
An apparatus for providing data clustering and mode selection includes a training element and a transformation element. The training element is configured to receive a first training data set, a second training data set and auxiliary data extracted from the same material as the first training data set. The training element is also configured to train a classifier to group the first training data set into M clusters based on the auxiliary data and the first training data set and train M processing schemes corresponding to the M clusters for transforming the first training data set into the second training data set. The transformation element is in communication with the training element and is configured to cluster the second training data set into M clusters based on features associated with the second training data set.
摘要:
A system and method of extracting information from a lexicon and using the information with a computer software program. Lexicon data is arranged for a particular language using Unicode values or other uniquely defined code values for each character of word of the language. A location array is then created for the lexicon data arranged by Unicode value or other uniquely defined code value. Upon a request to search for a word, words that have the same initial character as the searched-for word are identified using the location array. The identified words are then searched for an identified word that matches the searched-for word. Therefore, the amount of data loaded into run-time memory is minimized, and searches for a given word are completely more quickly than in conventional systems.
摘要:
An improved system method for enabling and implementing codebook-based voice conversion that both significantly reduces the memory footprint and improves the continuity of the output. In various embodiments, the paired source-target codebook is implemented as a multi-stage vector quantizer. During the conversion, N best candidates in a tree search are taken as the output from the quantizer. The N candidates for each vector to be converted are used in a dynamic programming-based approach that finds a smooth but accurate output sequence.
摘要:
An apparatus for providing efficient evaluation of feature transformation includes a training module and a transformation module. The training module is configured to train a Gaussian mixture model (GMM) using training source data and training target data. The transformation module is in communication with the training module. The transformation module is configured to produce a conversion function in response to the training of the GMM. The training module is further configured to determine a quality of the conversion function prior to use of the conversion function by calculating a trace measurement of the GMM.
摘要:
This invention relates to a method, a device and a software application product for correcting a pronunciation of a speech object. The speech object is synthetically generated from a text object in dependence on a segmented representation of the text object. It is determined if an initial pronunciation of the speech object, which initial pronunciation is associated with an initial segmented representation of the text object, is incorrect. Furthermore, in case it is determined that the initial pronunciation of the speech object is incorrect, a new segmented representation of the text object is determined, which new segmented representation of the text object is associated with a new pronunciation of the speech object.
摘要:
A method of providing content dependent media content mixing includes automatically determining an emotional property of a first media content input, determining a specification for a second media content in response to the determined emotional property, and producing the second media content in accordance with the specification.