摘要:
A method of generating a message having an invariable portion (U1) and a variable portion (V) is provided. Most of the invariable portion (U1) is provided in the form of recorded speech (A) whereas the variable portion (V) is provided in the form of synthesised speech (B). The synthesised speech (8) also extends by half a phoneme into the invariable portion (U1) of the message. The synthesised speech (B) and the recorded speech (A) are then concatenated, with a transition signal being formed on the basis of a boundary portion of each of the recorded (A) and synthesised signals (B) about any join (8). In forming the transition signal, a set of transition signal pitchmarks is created and an overlap-add technique is used to copy the waveform within the boundary portions of the speech signals (A, B) around the transition signal pitchmarks. The signal around the penultimate pitchmark in the leading boundary portion is copied to the trailing half of the transition signal and the signal around the second pitchmark in the trailing boundary portion is copied to the leading half of the transition signal. In this way, the characteristics of the generated message around the join (8) change gradually between the characteristics of the recorded speech (A) and the characteristics of the synthesised speech (B).
摘要:
The mixed decision tree includes a network of yes-no questions about adjacent letters in a spelled word sequence and also about adjacent phonemes in the phoneme sequence corresponding to the spelled word sequence. Leaf nodes of the mixed decision tree provide information about which phonetic transcriptions are most probable. Using the mixed trees, scores are developed for each of a plurality of possible pronunciations, and these scores can be used to select the best pronunciation as well as to rank pronunciations in order of probability. The pronunciations generated by the system can be used in speech synthesis and speech recognition applications as well as lexicography applications.
摘要:
The mixed decision tree includes a network of yes-no questions about adjacent letters in a spelled word sequence and also about adjacent phonemes in the phoneme sequence corresponding to the spelled word sequence. Leaf nodes of the mixed decision tree provide information about which phonetic transcriptions are most probable. Using the mixed trees, scores are developed for each of a plurality of possible pronunciations, and these scores can be used to select the best pronunciation as well as to rank pronunciations in order of probability. The pronunciations generated by the system can be used in speech synthesis and speech recognition applications as well as lexicography applications.
摘要:
Script recognition using speech recognition for use in editing of video or film clips uses preferably a grammar based speech recognition engine. A script file and audio dialog file are input a speech recognition system, and the script file is processed to generate a grammar file, which in turn is reduced to a binary context file compatible with a specific speech recognition engine. The script file and audio file are used to define variable parameters for the speech recognition engine. The audio file is broken up into utterances which are processed by the speech recognition engine according to the variable parameters and the context file. The best "guess" from the speech recognition engine is fitted to the script file to determine a match. Mismatched utterances are fed back to the utterance determining step to determine a new search point. With a match the audio file is marked with the corresponding location in the script or the script file is time marked with the corresponding video clip time code. Video or film clips may then be accessed for editing by indicating a place in the script or the dialog.
摘要:
A system for providing a primarily audio environment for world wide web access includes a system for rendering structured documents using audio, an interface for information exchange to users, a non-keyword based WWW search system and a few miscellaneous features. The system for rendering structured documents using audio includes a pre-rendering system which converts a HTML document into an intermediate document and a rendering system which actually generates an audio output. The interface includes a non-visual browsing system and an interface to users for visual browsing environments.
摘要:
The invention concerns a digital speech-synthesis process whereby utterances in a language are recorded, the recorded utterances are divided into speech segments which are stored so as to allow their allocation to specific phonemes; a text which is to be output as speech is converted to a phoneme chain and the stored segments are output in a sequence defined by the phoneme chain; an analysis of the text to be output as speech is carried out and thus provides information which completes the phoneme chain and modifies the timing sequence signal for the speech segments which are to be strung together for output as speech. The invention is characterised by the use of, as speech segments, microsegments consisting of: segments for vowel halves and semi-vowel halves, vowels standing between consonants being split into two microsegments, a first vowel half beginning shortly before the start of the vowel and extending as far as the vowel middle, and a second vowel half from the vowel middle to just before the vowel end; segments for quasi-stationary vowel components cut from the middle of a vowel; consonant segments beginning shortly before the front phoneme boundary and ending shortly before the rear phoneme boundary; and segments for vowel-vowel sequences cut from the middle of a vowel-vowel transition.
摘要:
A pattern recognition system and method for optimal reduction of redundancy and size of a weighted and labeled graph presents receiving speech signals, converting the speech signals into word sequences, interpreting the word sequences in a graph where the graph is labeled with word sequences and weighted with probabilities and determinizing the graph by removing redundant word sequences. The size of the graph can also be minimized by collapsing some nodes of the graph in a reverse determinizing manner. The graph can further be tested for determinizability to determine if the graph can be determinized. The resulting word sequence in the graph may be shown in a display device so that recognition of speech signals can be demonstrated.
摘要:
A system and method is provided for playing back a recorded voice message, and, in particular, for automatically playing back a spoken numeric portion of the message at a rate that is slower than the rate for playing back the remaining portion of the recorded voice message. A voice messaging system receives and analyzes the voice message. Specifically, the messaging system determines whether the voice message includes spoken numeric information and, if so, determines the relative position of the spoken numeric information within the message. The computer system stores both the voice message and the positional information in a storage device. Upon playback of the message, the messaging system retrieves the stored voice message and positional information from the storage device. As the voice message is played back, the messaging system processes the positional information. When the positional information indicates that a particular portion of a voice message includes spoken numeric information, that particular portion is played back at a decreased speed.