Abstract:
A method of converting a voice signal spoken by a source speaker into a converted voice signal having acoustic characteristics that resemble those of a target speaker. The method includes the following steps: determining (1) at least one function for the transformation of the acoustic characteristics of the source speaker into acoustic characteristics similar to those of the target speaker; and transforming the acoustic characteristics of the voice signal to be converted using the at least one transformation function. The method is characterized in that: (i) the aforementioned transformation function-determining step (1) consists in determining (1) a function for the joint transformation of characteristics relating to the spectral envelope and characteristics relating to the fundamental frequency of the source speaker; and (ii) the transformation includes the application of the joint transformation function.
Abstract:
A method for synthesizing speech from text includes receiving one or more waveforms characteristic of a voice of a person selected by a user, generating a personalized voice font based on the one or more waveforms, and delivering the personalized voice font to the user's computer, whereby speech can be synthesized from text, the speech being in the voice of the selected person, the speech being synthesized using the personalized voice font. A system includes a text-to-speech (TTS) application operable to generate a voice font based on speech waveforms transmitted from a client computer remotely accessing the TTS application.
Abstract:
An apparatus is constructed for converting an input voice signal into an output voice signal according to a target voice signal. In the apparatus, an input device provides the input voice signal composed of original sinusoidal components and original residual components other than the original sinusoidal components. An extracting device extracts original attribute data from at least the sinusoidal components of the input voice signal. The original attribute data is characteristic of the input voice signal. A synthesizing device synthesizes new attribute data based on both of the original attribute data derived from the input voice signal and target attribute data being characteristic of the target voice signal composed of target sinusoidal components and target residual components other than the sinusoidal components. The target attribute data is derived from at least the target sinusoidal components. An output device operates based on the new attribute data and either of the original residual component and the target residual component for producing the output voice signal.
Abstract:
2M-sets of model data strings (M is a positive integer and M≧2) are polymorphed. The model data strings are acquired by defining at least 2M-piece coordinates being morphed in a M-dimensional model-data mapping space and making the defined model data strings correspond to the coordinates being morphed, respectively. A unit cell is set in the space. The unit cell consists of a hyper rectangular parallelepiped having 2M-piece vertexes each located at the coordinates being morphed. A desired coordinate is set, as a morphing-destination coordinate, within the unit cell. The 2M sets of model data strings corresponding, set by set, to the coordinates being morphed are polymorphed using weighting factors depending on distances from the respective coordinates being morphed to the morphing-destination coordinate in the unit cell. Accordingly, a string of synthesized data corresponding to the morphing-destination coordinate is produced. The string of synthesized data is outputted using an outputting device.
Abstract:
A normalizer (100, 300) of the accent of accented speech modifies (210, 410) the characteristics of input signals that represent the speech spoken in an individual voice with an accent to form output signals that represent the speech spoken in the same voice but with less or no accent.
Abstract:
A strained-rough-voice conversion unit (10) is included in a voice conversion device that can generate a “strained rough” voice produced in a part of a speech when speaking forcefully with excitement, nervousness, anger, or emphasis and thereby richly express vocal expression such as anger, excitement, or an animated or lively way of speaking, using voice quality change. The strained-rough-voice conversion unit (10) includes: a strained phoneme position designation unit (11) designating a phoneme to be uttered as a “strained rough” voice in a speech; and an amplitude modulation unit (14) performing modulation including periodic amplitude fluctuation on a speech waveform. The amplitude modulation unit (14) generates, according to the designation of the strained phoneme position designation unit (11), the “strained rough” voice by performing the modulation including periodic amplitude fluctuation on the part to be uttered as the “strained rough” voice, in order to generate a speech having realistic and rich expression uttering forcefully with excitement, nervousness, anger, or emphasis.
Abstract:
A voice synthesis method, said method comprising a step of choosing a synthetic voice from among a set of voices having predetermined spectral signatures and a step of recording the natural voice of a first person, the method comprising a step of transforming the natural recorded voice so as to conform with the spectral signature of the chosen synthetic voice, the natural voice thereby transformed being recorded, said method comprising a step of determining at least one situation parameter for a first character from among a set of predefined parameters, each predefined parameter being associated with a spectral alteration of the emitted voice, the determined situation parameter particularly characterizing the environment or the physical or psychological state of the character, the method comprising a step of spectrally altering the transformed natural voice so as to conform with the spectral alteration associated with the character's situation parameter.
Abstract:
A device suitable for use in various applications, including, for example, sound production applications and video game applications. In one non-limiting embodiment, the device comprises a sound capturing unit for generating a first signal indicative of vocal sound produced by a user and an image capturing unit for generating a second signal indicative of images of a mouth region of the user. The device also comprises a processing unit communicatively coupled to the sound capturing unit and the image capturing unit for processing the first signal and the second signal. In an example in which the device is used for sound production, the processing unit is operative for processing the first signal and the second signal to cause a sound production unit to emit sound audibly perceivable as being a modified version of the vocal sound produced by the user. In an example in which the device is used for playing a video game, the processing unit is operative for processing the second signal to generate a video game feature control signal for controlling a feature associated with the video game. The feature associated with the video game may be a virtual character of the video game. The processing unit is further operative for processing the first signal for causing a sound production unit to emit sound associated with the video game.
Abstract:
A personality-based theme may be provided. An application program may query a personality resource file for a prompt corresponding to a personality. Then the prompt may be received at a speech synthesis engine. Next, the speech synthesis engine may query a personality voice font database for a voice font corresponding to the personality. Then the speech synthesis engine may apply the voice font to the prompt. The voice font applied prompt may then be produced at an output device.
Abstract:
A method of modifying acoustic characteristics of an original audio signal as a function of modification instructions relating at least to the fundamental frequency and the spectral envelope of the original signal. The method comprises a first modification operation applied to the original signal to deliver an intermediate audio signal, the first modification operation being intended to deform the spectral envelope of the original signal in application of said spectral envelope modification instruction; and a second modification operation applied to the intermediate signal to deliver a final audio signal, the second modification operation being intended to modify at least the fundamental frequency of the intermediate signal, in application of a modification factor that is determined so as to take account of the effects of the first modification operation on the fundamental frequency of the original audio signal, so that the fundamental frequency obtained for the final signal conforms to said instruction relating to fundamental frequency.