摘要:
Voice synthesis with improved expressivity is obtained in a voice synthesiser of source-filter type by making use of a library of source sound categories in the source module. Each source sound category corresponds to a particular morphological category and is derived from analysis of real vocal sounds, by inverse filtering so as to subtract the effect of the vocal tract. The library may be parametrical, that is, the stored data corresponds not to the inverse-filtered sounds themselves but to coefficients (amplitude spectra and frequency trajectories) for resynthesising the inverse-filtered sounds using an additive sinusoidal technique. The coefficients are derived by STFT analysis.
摘要:
A speech synthesizing system using a redundancy-reduced waveform database is disclosed. Each waveform of a sample set of voice segments necessary and sufficient for speech synthesis is divided into pitch waveforms, which are classified into groups of pitch waveforms closely similar to one another. One of the pitch waveforms of each group is selected as a representative of the group and is given a pitch waveform ID. The waveform database at least comprises a pitch waveform pointer table each record of which comprises a voice segment ID of each of the voice segments and pitch waveform IDs the pitch waveforms of which, when combined in the listed order, constitute a waveform identified by the voice segment ID and a pitch waveform table of pitch waveform IDs and corresponding pitch waveforms. This enables the waveform database size to be reduced. For each of pitch waveforms the database lacks, one of the pitch waveform IDs adjacent to the lacking pitch waveform ID in the pitch waveform pointer table is used without deforming the pitch waveform.
摘要:
The invention concerns a digital speech-synthesis process whereby utterances in a language are recorded, the recorded utterances are divided into speech segments which are stored so as to allow their allocation to specific phonemes; a text which is to be output as speech is converted to a phoneme chain and the stored segments are output in a sequence defined by the phoneme chain; an analysis of the text to be output as speech is carried out and thus provides information which completes the phoneme chain and modifies the timing sequence signal for the speech segments which are to be strung together for output as speech. The invention is characterised by the use of, as speech segments, microsegments consisting of: segments for vowel halves and semi-vowel halves, vowels standing between consonants being split into two microsegments, a first vowel half beginning shortly before the start of the vowel and extending as far as the vowel middle, and a second vowel half from the vowel middle to just before the vowel end; segments for quasi-stationary vowel components cut from the middle of a vowel; consonant segments beginning shortly before the front phoneme boundary and ending shortly before the rear phoneme boundary; and segments for vowel-vowel sequences cut from the middle of a vowel-vowel transition.
摘要:
A speech synthesizing apparatus for deforming and connecting speech pieces to synthesize speech has a speech waveform database for storing data of an accent type of a speech piece of a word or a syllable uttered with type-0 accent and type-1 accent, data of phonemic transcription of the speech piece and data of a position at which the speech piece can be segmented, an input buffer for storing a character string of phonemic transcription and prosody of speech to be synthesized, a synthesis unit selecting unit for retrieving candidates of speech pieces from the speech waveform database on the basis of the character string of phonemic transcription in the input buffer, and a used speech piece selecting unit for determining a speech piece to be practically used among the retrieved candidates according to an accent type of speech to be synthesized and a position in the speech at which the speech piece is used, thereby preventing degradation of a quality of sound when the speech piece is processed.
摘要:
A system (87) for generating high quality speech uses coarticulated speech segment data extracted from spoken carrier syllables and digitally compressed for storage using adaptive differential pulse code modulation (ADPCM). The system includes a programmed digital microprocessor (89) with an associated read only memory (91) containing the compressed coarticulated speech segment library, random access memory (93) containing system variables and the sequence of coarticulated speech segments required to generate a desired spoken message, and text to speech chip (95) which provides the sequence of coarticulated speech segments to the RAM (93). The microprocessor (89) operates in accordance with a program stored in ROM (91) to recover the compressed coarticulated speech segment data stored in ROM (91) in a sequence called for by the text to speech chip (95), to reconstruct or ''blow back'' the stored ADPCM data to PCM data, and to concatenate the PCM data into waveforms to produce a real time digital speech waveform. The digital speech waveform is converted to an analog signal via digital to analog converter (97), amplified in amplifier (99) and applied to an audio speaker (101) which generates a high quality spoken message. In the preferred embodiment of the invention, the coarticulated speech segments are diphones.
摘要:
A text-to-speech (TTS) system includes components capable of supporting the generation of speech output in any of multiple styles, and may switch seamlessly from producing speech output in one style to producing speech output in another style. For example, a concatenative TTS system may include a speech base storing speech units associated with multiple speech styles, and a linguistic analysis component to generate a phonetic transcription specifying speech output in any of multiple styles. Text input may include a style indication associated with a particular segment of the input text. The linguistic analysis component may invoke encoded rules and/or components based upon the style indication, and generate a phonetic transcription specifying a speech style, which may be processed to generate output speech.
摘要:
A decoder for generating an audio output signal having one or more audio output channels from a downmix signal having one or more downmix channels is provided. The downmix signal encodes one or more audio object signals. The decoder has a threshold determiner for determining a threshold value depending on a signal energy and/or a noise energy of at least one of the of or more audio object signals and/or depending on a signal energy and/or a noise energy of at least one of the one or more downmix channels. Moreover, the decoder has a processing unit for generating the one or more audio output channels from the one or more downmix channels depending on the threshold value.
摘要:
A voice synthesizing apparatus includes a manipulation determiner configured to determine a manipulation position which is moved according to a manipulation of a user, and a voice synthesizer configured to generate, in response to an instruction to generate a voice in which a second phoneme follows a first phoneme, a voice signal so that vocalization of the first phoneme starts before the manipulation position reaches a reference position and that vocalization from the first phoneme to the second phoneme is made when the manipulation position reaches the reference position.
摘要:
A sound synthesizing apparatus includes a waveform storing section which stores a plurality of unit waveforms extracted from different positions, on a time axis, of a sound waveform indicating a voiced sound, and a waveform generating section which generates a synthesized waveform by arranging the plurality of unit waveforms on the time axis.