摘要:
A Vector-Sum Excited Linear Predictive Coding (VSELP) speech coder provides improved quality and reduced complexity over a typical speech coder. VSELP uses a codebook which has a predefined structure such that the computations required for the codebook search process can be significantly reduced. This VSELP speech coder uses single or multi-segment vector quantizer of the reflection coefficients based on a Fixed-Point-Lattice-Technique (FLAT). Additionally, this speech coder uses a pre-quantizer to reduce the vector codebook search complexity and a high-resolution scalar quantizer to reduce the amount of memory needed to store the reflection coefficient vector codebooks. Resulting in a high quality speech coder with reduced computations and storage requirements.
摘要:
A digital speech coding method uses an Rth-order filter to model the frequency response of multiple filters, thereby, providing a filter which offers the control of multiple filters without the complexity of multiple filters. The Rth-order filter can be used as a spectral noise weighting filter or a combination of a short-term predictor filter and a spectral noise weighting filter, referred to as the spectrally noise weighted synthesis filter, depending on which embodiment is employed. In general, the method models the frequency response of L Pth-order filters by a single Rth-order filter, where the order R
摘要:
Word spotting in a speech recognition system without predetermining the endpoints of the input speech. The invention is intended to be implemented in a system which has word templates stored in template memory, with the system being capable of accumulating distance measures for states within each word template. The following steps are used to generate a measure of similarity between a subset of the input frames and a word template. The steps are: a) recording a beginning input frame number for each state to identify the potential beginning of the word; b) accumulating distance measures for at least one state for each input frame; c) normalizing the distance measures by substracting a normalization amount from each distance measure; d) recording normalization information corresponding to the normalization amount for each input frame; and e) determining a similarity measure between the word template and a subset of input frames after a given input frame has been processed. The subset is identified from the beginning input frame number corresponding to an end state of the template, through the given input frame number. The similarity measure is based on the normalized distance measure recorded for the end state. and the normalization information.
摘要:
A speech coder and decoder methodology wherein pitch excitation and codebook excitation source energies are represented by parameters that are readily transmissible with minimal transmission capacity requirements. The parameters are the long term energy value, a short term correction factor which is applied to the long term energy value to match the short term energy, and proportionality factor(s) that specify the relative energy contribution of the excitation sources to the short term energy value.
摘要:
A digital speech coder includes a long-term filter (124) having an improved sub-sample resolution long-term predictor (FIG. 5 ) which allows for subsample resolution for the lag parameter L. A frame of N samples of input speech vector s(n) is applied to an adder (510). The output of the adder (510) produces the output vector b(n) for the long term filter (124). The output vector b(n) is fed back to a delayed vector generator block (530) of the long-term predictor. The nominal long-term predictor lag parameter L is also input to the delayed vector generator block (530). The long-term predictor lag parameter L can take on non-integer values, which may be multiples of one half, one third, one fourth or any other rational fraction. The delayed vector generator (530) includes a memory which holds past samples of b(n). In addition, interpolated samples of b(n) are also calculated by the delayed vector generator (530) and stored in its memory, at least one interpolated sample being calculated and stored between each past sample of b(n). The delayed vector generator (530) provides output vector q(n) to the long-term multiplier block (520), which scales the long-term predictor response by the long-term predictor coefficient .beta.. The scaled output .beta.q(n) is then applied to the adder (510) to complete the feedback loop of the recursive filter (124).
摘要:
In a Viterbi Algorithm decoder (204) as used to decode convolutionally encoded information, reliability information is developed for various path discard decisions made within the Viterbi Algorithm. These decisions are made for discard opportunities that impact one or more error detection windows (601). Based upon these metrics, a reliability factor sequence can be provided and compared against a fixed (or varying) threshold. When unreliability appears, appropriate action can be taken. For example, all of the information can be discarded, or only certain portions of the information can be discarded, as appropriate to the particular application.
摘要:
A continuous speech recognition system employs a grammar tree of alternative potentially recognized word paths. A technique of tracing back through the grammar tree is utilized in determining which partial word path is common to all potential word paths. The common partial word path is deleted and words corresponding to the deleted partial word path are output as recognized words.
摘要:
An improved excitation vector generation and search technique (FIG. 1) is described for a code-excited linear prediction (CELP) speech coder (100) using a codebook memory of excitation code vectors. A set of M basis vectors v.sub.m (n) are used along with the excitation signal codewords (i) to generate the codebook of excitation vectors u.sub.i (n) according to a "vector sum" technique (120) of converting stored selector codewords into a plurality of interim data signals, multiplying the set of M basis vectors by the interim data signals, and summing the resultant vectors to produce the set of 2.sup.M codebook vectors. Only M basis vectors need to be stored in memory (114), as opposed to all 2.sup.M code vectors.
摘要:
A user-interactive speech recognition control system is disclosed for recognizing a complete sequence of keywords (e.g., a telephone number such as 123-4567) via entering, verifying, and editing variable-length utterance strings (e.g., 1-2-3; 4-5; 6-7) separated by the user-defined placement of pauses. The device controller (120) utilizes timers (124) to monitor the pause time between partial-sequence digit strings recognized by the speech recognizer (110). When a string of digits is followed by a predetermined pause time interval, the recognized digits will be replied via the speech synthesizer (130). An additional string of digits can then be entered, and only the subsequent string will be replied after the next pause. Furthermore, the user has the flexibility to correct only the last digit string entered, or the entire sequence. Hence, if there is an error in only one digit, the erroneous digit string can be corrected without having to re-enter the entire digit sequence. The invention is well-suited to be used in a hands-free voice command dialing system for a mobile radiotelephone, wherein vehicular background noise may affect recognition accuracy.
摘要:
A start of an input speech signal is detected during presentation of an output audio signal and an input start time, relative to the output audio signal, is determined. The input start time is then provided for use in responding to the input speech signal. In another embodiment, the output audio signal has a corresponding identification. When the input speech signal is detected during presentation of the output audio signal, the identification of the output audio signal is provided for use in responding to the input speech signal. Information signals comprising data and/or control signals are provided in response to at least the contextual information provided, i.e., the input start time and/or the identification of the output audio signal. In this manner, the present invention accurately establishes a context of an input speech signal relative to an output audio signal regardless of the delay characteristics of the underlying communication system.