A speech engine for producing synthetic speech from an input in convention orthography. The speech engine analyses the input data into small elements which are used to produce the synthetic speech. The analysis is carried out with the aid of a skeletal database 11 and a plurality of symbolic processor 12-16 each of which is adapted to preform one linguistic task. Each processor 13-16 obtains its data from the database 11 (processor 12 obtains its data from an input buffer 10). Each processor returns its results to the database 11. The database 11 is organised in accordance with the linguistic structures so that the results and intermediate results are not only stored but the linguistic relationships are also available. Preferably the database 11 is formed of a plurality of storage modules (1/1-5/7) each of which has an address. Each module has a register 100 which holds an item of data being either an intermediary or final result. In addition each module contains addresses of related modules 101, 102, 103 whereby the linguistic structure of the sentence is defined.
Techniques for predicting prosody in speech synthesis may make use of a data set of example text fragments with corresponding aligned spoken audio. To predict prosody for synthesizing an input text, the input text may be compared with the data set of example text fragments to select a best matching sequence of one or more example text fragments, each example text fragment in the sequence being paired with a portion of the input text. The selected example text fragment sequence may be aligned with the input text, e.g., at the word level, such that prosody may be extracted from the audio aligned with the example text fragments, and the extracted prosody may be applied to the synthesis of the input text using the alignment between the input text and the example text fragments.
Techniques for predicting prosody in speech synthesis may make use of a data set of example text fragments with corresponding aligned spoken audio. To predict prosody for synthesizing an input text, the input text may be compared with the data set of example text fragments to select a best matching sequence of one or more example text fragments, each example text fragment in the sequence being paired with a portion of the input text. The selected example text fragment sequence may be aligned with the input text, e.g., at the word level, such that prosody may be extracted from the audio aligned with the example text fragments, and the extracted prosody may be applied to the synthesis of the input text using the alignment between the input text and the example text fragments.