摘要:
Instability inherent in analysis-by-synthesis speech/audio codecs and caused in particular by channel errors during transmission of highly periodic signals such as high-frequency sine waves is removed. Analysis-by-synthesis techniques involve production, in response to the speech/audio signal and at regular time intervals called frames, of (a) a set of spectral parameters for use in driving a synthesis filter in view of synthesizing the speech/audio signal, and (b) a pitch gain for constructing a past-excitation-signal component supplied to the synthesis filter. In accordance with the instability eradication method, the first step consists of detecting a set of conditions including (i) a resonance condition assessed from the spectral parameters, (ii) a duration condition detected when the resonance condition has prevailed for at least the M most recent frames, M being an integer greater than 1, and (iii) a gain condition which evidences consistently-high values of the pitch gain in the N most recent frames, N being an integer greater than 1. To eradicate the occasional instability, the pitch gain is reduced to a value lower than a given threshold whenever these three conditions are detected.
摘要:
A modular system and method is provided for low bit rate encoding and decoding of speech signals using voicing probability determination. The continuous input speech is divided into time segments of a predetermined length. For each segment the encoder of the system computes a model signal and subtracts the model signal from the original signal in the segment to obtain a residual excitation signal. Using the excitation signal the system computes the signal pitch and a parameter which is related to the relative content of voiced and unvoiced portions in the spectrum of the excitation signal, which is expressed as a ratio Pv, defined as a voicing probability. The voiced and the unvoiced portions of the excitation spectrum, as determined by the parameter Pv, are encoded using one or more parameters related to the energy of the excitation signal in a predetermined set of frequency bands. In the decoder, speech is synthesized from the transmitted parameters representing the model speech, the signal pitch, voicing probability and excitation levels in a reverse order. Boundary conditions between voiced and unvoiced segments are established to ensure amplitude and phase continuity for improved output speech quality. Perceptually smooth transition between frames is ensured by using an overlap and add method of synthesis. LPC interpolation and post-filtering is used to obtain output speech with improved perceptual quality.
摘要:
Simplified methods of searching a codebook table are provided. These methods perform a codebook search for a plurality of pulses, one pulse at a time, in order of increasing to decreasing pulse significance, wherein pulse significance is defined as the relative contribution a given pulse provides to minimizing the mean-squared error between the source signal and the quantized sequence of pulses.
摘要:
A method and apparatus (100) for pitch-epoch-synchronous source-filter speech encoding by means of error component modeling methods (310) which capture fundamental orthogonal (uncorrelated) basis elements of an excitation source waveform. A periodic waveform model (318) along with four orthogonal error waveforms, desirably including phase error (319), ensemble error (321), standard deviation error (323), and mean error (324) waveforms, are incorporated together to form a complete description of the excitation. These error waveforms (319,321, 323, 324) represent those portions of the excitation that are not represented by the purely periodic model. By thus orthogonalizing the error components, the perceptual effect of each element is isolated from the composite set, and can thus be encoded separately. In addition to high-quality, fixed-rate operation, the identity-system capability and low complexity of the speech encoding method and apparatus make them applicable to variable-rate applications without changing underlying modeling methods.
摘要:
In a speech decoding apparatus, a conversion unit converts a received encoded signal into a parameter in units of frames. A memory repeatedly updates and stores the parameter representing a pause state and output from the conversion unit for the pause interval of the speech signal. A synthesis filter coefficient generation unit generates a synthesis filter coefficient on the basis of the parameter read out from the memory. A smoothed filter coefficient generation unit generates a smoothed filter coefficient on the basis of the synthesis filter coefficient output from the synthesis filter coefficient generation unit. The smoothed filter coefficient generation unit generates the smoothed filter coefficient which is smoothed such that the synthesis filter coefficient changes in accordance with a count value of the frames during the predetermined period. A background noise generation unit generates background noise on the basis of the parameter read out from the memory for the pause interval of the speech signal. A smoothing filter performs filtering processing of the background noise output from the background noise generation unit by using the smoothed filter coefficient output from the smoothed filter coefficient unit and outputs smoothed background noise.
摘要:
Multiple speech bit-stream frame buffers are used between the controller and the speech decoder. Whenever excessive or missing speech packages are detected, the speech decoder switches to a special corrective mode. If there is too much, the buffered frames are played out fast; if there is too little the buffered frames are played out slowly. For the fast play, some speech information has to be discarded, while for the slow play some speech-like information has to be synthesized. The speech may be handled in sub-frame units, which may be 52 samples at a time. Low energy, silent or unvoiced sub-frames, which also indicate non-periodicity, are detected and manipulated. Moreover, the decoded signal is manipulated at the excitation phase, before the final LPC synthesis filter, resulting in a transparent perceptual effect on the manipulated speech quality. Additionally, the buffers are enlarged such that the problem caused by controller asynchronicity is eliminated. Further, for bulk delay caused by multiplexing data and speech transmissions, the buffers maintain the smallest number of speech packets necessary to prevent buffer underflow during a data packet transmission while minimizing speech delay and preserving data transmission efficiency.
摘要:
An electronic music system which imitates acoustic instruments addresses the problem wherein the audio spectrum of a a recorded note is entirely shifted in pitch by transposition. The consequence of this is that unnatural formant shifts occur, resulting in the phenomenon known in the industry as "munchkinization." The present invention eliminates munchkinization, thus allowing a substantially wider transposition range for a single recording. Also, the present invention allows even shorter recordings to be used for still further memory improvements. An analysis stage separates and stores the formant and excitation components of sounds from an instrument. On playback, either the formant component or the excitation component may be manipulated.
摘要:
On encoding with a smallest possible number of bits LPC parameters produced by an LPC analyzer from at least one of subframe signals of each frame signal of an input speech signal, a divider divides the LPC parameters into several parameter regions. Using vector code books loaded for each parameter region with code vectors, a vector quantizer quantizes the LPC parameters into, for use as quantized codes, indexes of selected vectors which are selected from the code vectors and of which a linear combination minimizes a quantization distortion.
摘要:
A code-excited linear-predictive (CELP) coder for speech or audio transmission at compressed (e.g., 16 kb/s) data rates is adapted for low-delay (e.g., less than five ms. per vector) coding by performing spectral analysis of at least a portion of a previous frame of simulated decoded speech to determine a synthesis filter of a much higher order than conventionally used for decoding synthesis and then transmitting only the index for the vector which produces the lowest internal error signal. Modified perceptual weighting parameters and a novel use of postfiltering greatly improve tandeming of a number of encodings and decodings while retaining high quality reproduction.
摘要:
A digital audio signal processing apparatus is provided having a predictive error generator for generating predictive error data by processing input digital data to acquire a plurality of different frequency characteristics. A selector selects one of the plural predictive error data. A requantizer requantizes the selected predictive error data. A corrector processes with a predetermined frequency characteristic, the requantization error induced during the operation of the requantizer, thereby correcting the requantization error caused in the requantizer. A frequency characteristic control selects at least two of the predictive error data obtained with the plural frequency characteristics, then calculates the selected predictive error data and controls the frequency characteristic in the corrector in accordance with the result of such calculation. In this apparatus, the ratio or the difference between at least two predictive error data obtained with a plurality of frequency characteristics is calculated and then is compared with a predetermined reference value. The frequency characteristic in the corrector is controlled in conformity with the numerical relation between the calculated value and the reference value. Therefore, two or more frequency characteristics in the corrector are selectively rendered conformable with one frequency characteristic in the predictive error generator, hence achieving an enhanced effect of further improving the signal-to-noise ratio.