摘要:
Normalization parameters are generated at a normalization-parameter generating unit by calculating the mean values and the standard deviations of an initial prosody pattern and a prosody pattern of a training sentence of a speech corpus. Then, the variance range or variance width of the initial prosody pattern is normalized at the prosody-pattern normalizing unit in accordance with the normalization parameters. As a result, a prosody pattern similar to speech of human beings and improved in naturalness can be generated with a small amount of calculation.
摘要:
A method to generate a pitch contour for speech synthesis is proposed. The method is based on finding the pitch contour that maximizes a total likelihood function created by the combination of all the statistical models of the pitch contour segments of an utterance, at one or multiple linguistic levels. These statistical models are trained from a database of spoken speech, by means of a decision tree that for each linguistic level clusters the parametric representation of the pitch segments extracted from the spoken speech data with some features obtained from the text associated with that speech data. The parameterization of the pitch segments is performed in such a way, the likelihood function of any linguistic level can be expressed in terms of the parameters of one of the levels, thus allowing the maximization to be calculated with respect to the parameters of that level. Moreover, the parameterization of that main level has to be invertible so that the final pitch contour is obtained from the parameters of that level by means of an inverse transformation.
摘要:
A speech recognition device includes an extracting unit that analyzes an input signal and extracts a feature to be used for speech recognition from the input signal; a storing unit configured to store therein an acoustic model that is a stochastic model for estimating what type of a phoneme is included in the feature; a speech-recognition unit that performs speech recognition on the input signal based on the feature and determines a word having maximum likelihood from the acoustic model; and an optimizing unit that dynamically self-optimizes parameters of the feature and the acoustic model depending on at least one of the input signal and a state of the speech recognition performed by the speech-recognition unit.
摘要:
A packet communication system or ATM communication system in which a sequence of signals such as speech signals is divided into a plurality of band areas and the power of each band area is determined. Based on the power of each band area, coding signals are allocated for each band, frame by frame. At a receiving side, the signal to noise ratio SNR of the decoded signal is predicted by changing the total number of encoding bits for each band area based on the power of each band area signal. The bit rate is controlled so as to make the SNR constant. The bit rate is changed in accordance with a Fourier transform of the input signal.
摘要:
In a transmitter in the present invention, an input signal is input to a QMF bank 102 where the input signal is divided to a plurality of frequency bands to form corresponding band signals. A distributed bit calculating unit 109 calculates respective bit rates with which the corresponding band signals are encoded on the respective power values of the band signals. Quantizers 104-1, 104-2, . . . , 104-n encode the respective band signals at the corresponding bit rates and input the resulting corresponding band codes to a multiplexer unit 111 which incorporates the respective band codes into a cell as an information unit and sends the cell. In a receiver, a cell is decomposed to obtain the respective band codes, which are then dequantized to form the corresponding band signals. These band signals are synthesized to form a signal for the entire band, and the signal for the entire band is output as a decoded signal.
摘要:
A transformation-parameter calculating unit calculates a first model parameter indicating a parameter of a speaker model for causing a first likelihood for a clean feature to maximum, and calculates a transformation parameter for causing the first likelihood to maximum. The transformation parameter transforms, for each of the speakers, a distribution of the clean feature corresponding to the identification information of the speaker to a distribution represented by the speaker model of the first model parameter. A model-parameter calculating unit transforms a noisy feature corresponding to identification information for each of speakers by using the transformation parameter, and calculates a second model parameter indicating a parameter of the speaker model for causing a second likelihood for the transformed noisy feature to maximum.
摘要:
A speech synthesis method that generates a speech pitch wave from a reference speech signal by subjecting the reference speech signal to one of Fourier transform and Fourier series expansion to produce a discrete spectrum, that interpolates the discrete spectrum to generate a consecutive spectrum, and that subjects the consecutive spectrum to inverse Fourier transform. A linear prediction coefficient is generated by subjecting the reference speech signal to a linear prediction analysis. The speech pitch wave is subjected to inverse-filtering based on the linear prediction coefficient to produce a residual pitch wave. Information regarding the residual pitch wave is stored as information of a speech synthesis unit in a voice period. A speech is then synthesized using the information of the speech synthesis unit.
摘要:
A speech synthesis method subjects a reference speech signal to windowing to extract an aperiodic speech pitch wave from the reference speech signal. A linear prediction coefficient is generated by subjecting the reference speech signal to a linear prediction analysis. The aperiodic speech pitch wave is subjected to inverse-filtering based on the linear prediction coefficient to produce a residual pitch wave. Information regarding the residual pitch wave is stored as information of a speech synthesis unit and a voiced period in the storage. The speech is then synthesized using the information of the speech synthesis unit.
摘要:
A speech encoding method and apparatus including analyzing, using a codebook expressing speech parameters within a predetermined search range, an input speech signal in an audibility weighting filter corresponding to a pitch period longer than the search range of the codebook, and searching, from the codebook, on the basis of the analysis result, a combination of speech parameters by which the distortion of the input speech signal is minimized, and encoding the combination. The apparatus uses an adaptive codebook of pitch and a noise codebook. The codebooks search a group formed by extracting vectors of predetermined length from one original code vector, while sequentially shifting position so that the vectors overlap each other. The search group is further restricted and another preselection is made before the final search. Search is based on inversely convoluted, orthogonally transformed vectors.
摘要:
A speech communication apparatus of the present invention includes, in addition to an echo canceller for canceling an acoustic echo generated in a hands-free speech space, a chirp signal generating unit and a training unit. The chirp signal generating unit generates a chirp signal adequate for initial training of the echo canceller. The training control unit enables the chirp signal generating unit to generate a chirp signal, when a predetermined condition for starting hands-free speaking is satisfied, and a chirp tone corresponding to the chirp signal to be output as a volume-amplified tone from the hands-free speaker. The echo canceller performs initial training of the echo canceller based on the chirp tone.