摘要:
Speaker-independent word recognition is performed, based on a small acoustically distinct vocabulary, with minimal hardware requirements. After a simple preconditioning filter, the zero crossing intervals of the input speech are measured and sorted by duration, to provide a rough measure of the frequency distribution within each input frame. The distribution of zero crossing intervals is transformed into a binary feature vector, which is compared with each reference template using a modified Hamming distance measure. A dynamic time warping algorithm is used to permit recognition of various speaker rate, and to economize on the reference template storage requirements. A mask vector for each reference template is used to ignore insignificant (or speaker-dependent) features of the words detected.
摘要:
Energy normalization in speech synthesis systems is achieved by a look-ahead adaptive normalization procedure, wherein energy is adaptively tracked, and the adaptive energy-tracking value is used to normalize a much earlier frame's energy value.
摘要:
This voice messaging system provides an LPC analyzer in combination with a pitch extractor wherein LPC parameters and a residual signal organized in a sequence of speech data frames are provided by the LPC analyzer as an output representative of an analog speech signal. The pitch extractor is operably associated with the LPC analyzer and produces a plurality of pitch candidates for each of the speech data frames in the sequence thereof. Dynamic programming is performed on the plurality of pitch candidates for each speech data frame and also with respect to a voiced/unvoiced decision of the speech data for each frame by tracking both pitch and voicing from frame to frame to provide an optimal pitch value and also an optimal voicing decision. During dynamic programming, a cumulative penalty for a sequence of frame pitch/voicing decisions is accumulated by defining a transition error between each pitch candidate of a current speech data frame and each pitch candidate of the preceding frame, and defining a cumulative error for each pitch candidate of the current frame equal to the transition error between the pitch candidate of the current frame plus the cumulative error of an optimally identified pitch candidate in the preceding frame to locate the track providing optimal pitch and voicing decisions based upon the lowest cumulative penalty. An encoder then encodes the LPC parameters as generated by the LPC analyzer and the optimal pitch and voicing decisions for each speech data frame for subsequent use in providing an audible synthesized speech output substantially identical to the original speech input.
摘要:
A method and apparatus are provided for identifying one or more boundaries of a speech pattern within an input utterance. One or more anchor patterns are defined, and an input utterance is received. An anchor section of the input utterance is identified as corresponding to at least one of the anchor patterns. A boundary of the speech pattern is defined based upon the anchor section. Also provided are a method and apparatus for identifying a speech pattern within an input utterance. One or more segment patterns are defined, and an input utterance is received. Portions of the input utterance which correspond to the segment patterns are identified. One or more of the segments of the input utterance are defined responsive to the identified portions.
摘要:
An efficient pruning method reduces central processing unit (CPU) loading during real time speech recognition by instructing the CPU to compare a current state's previously calculated probability score against a predetermined threshold value and to discard hypothesis containing states with probability scores below such threshold. After determining that the current state should be kept, the CPU is directed to locate an available slot in the scoring buffer where information about the current state is then stored. The CPU locates an available slot by comparing the current time-index with the time-index associated with each scoring buffer slot. When they are equal, the slot is considered not available; when the current time-index is greater, the slot is considered available. After the information about the current state is stored, the CPU then sets the current state's backpointer to point at the start state of the current best path if the current states represents a completed model. Regardless of the current state's status, the CPU then associates the current time-index with the time-indices of all the slots along the best path to the current state. The CPU then proceeds to calculate the probability score of the next current state and the method repeats until all states have been completed.
摘要:
Recognition of sound units is improved by comparing frame-pair feature vectors which helps compensate for context variations in the pronunciation of sound units. A plurality of reference frames are stored of reference feature vectors representing reference words. A linear predictive coder (10) generates a plurality of spectral feature vectors for each frame of the speech signals. A filter bank system (12) transforms the spectral feature vectors to filter bank representations. A principal feature vector transformer (14) transforms the filter bank representations to an identity matrix of transformed input feature vectors. A concatenate frame system (16) concatenates the input feature vectors of adjacent frames to form the feature vector of a frame-pair. A transformer (18) and a comparator (20) compute the likelihood that each input feature vector for a frame-pair was produced by each reference frame. This computation is performed individually and independently for each reference frame-pairs. A dynamic time warper (22) constructs an optimum time path through the input speech signals for each of the computed likelihoods. A high level decision logic (24) recognizes the input speech signals as one of the reference words in response to the computed likelihoods and the optimum time paths.
摘要:
A method for generating connected word templates begins with generating isolated word templates of selected words. The isolated word templates are used to extract a continuous word template from a segment of continuous speech containing the selectd words. Both the isolated word templates and the connected word templates can be used to generate speech to determine the quality of the generated templates through aural judgment.
摘要:
Silence suppression in speech synthesis systems is achieved by detecting and processing only segments of voice activity. A segment is classified as "speech" if the energy of the signal is greater than an adaptively adjusted threshold. The adaptively adjusted threshold is preferably defined as the maximum of scaled values of two separate envelope parameters, which both track the variation in energy over the sequence of frames of speech data. One contour is a slow-rising fast-falling value, which is updated only during unvoiced speech frames, and therefore track a lower envelope of the energy contour. This parameter in effect tracks an ambiant noise level. The other parameter is a fast-rising slow-falling parameter, which is updated only during voiced speech frames, and thus tracks an upper envelope of the energy contour. (This in effect tracks the average speech level.) A nonsilent energy tracker and a silent energy tracker adjust corresponding energy values representing the energy contours.
摘要:
Pole encoding of a linear predictive all-pole model of speech is accomplished by first finding poles up to the number required for good prediction (e.g., ten). These poles are extracted from the LPC predictor polynomial, using, e.g., a slightly modified Bairstow method. Those poles having a sufficiently narrow bandwidth (i.e., those sufficiently near the unit circle) are separately encoded, since these poles generally correspond to perceptually important formants. The remaining poles are lumped together to form a residual polynomial. The residual polynomial is then transformed to produce reflection coefficients, and all reflection coefficients above the first two are discarded. This provides an efficient spectral-shaping polynomial of a reduced degree. Thus, pole encoding is made possible using a reduced and adaptively varied bit rate.
摘要:
Data converter for a speech synthesizer system wherein encoded formant parameters as stored in a memory are decoded and transformed or converted to reflection coefficients in real time by means of a circuit implementing a Taylor series type approximation. The reflection coefficients are then quantized and input to a speech synthesizer which utilizes quantized reflection coefficients to synthesize speech. The use of the coded formant frequency speech data which inherently contains more speech intelligence than reflection coefficient speech data enables a speech synthesizer system which utilizes quantized reflection coefficients to operate at a significantly lower bit rate than would otherwise be possible where reflection coefficients are employed as the speech data stored in the memory.