摘要:
A low cost speech recognition system generates frames of received speech having binary feature components. The received speech frames are compared with reference templates, and error values representing the difference between the received speech and the reference templates are generated. At the end of an utterance, if one template resulted in a sufficiently small error value, the word represented by that template is selected as the recognized word.
摘要:
A method for generating connected word templates begins with generating isolated word templates of selected words. The isolated word templates are used to extract a continuous word template from a segment of continuous speech containing the selectd words. Both the isolated word templates and the connected word templates can be used to generate speech to determine the quality of the generated templates through aural judgment.
摘要:
Memory chips with data memory (202), embedded logic (206) and broadcast memory (204) for two modes of operation are disclosed. A first mode of operation is the usual memory mode expected of a data RAM. The second mode of operation allows localized computation and/or processing of the data in data memory (202) by the embedded logic (206) with minimal handshaking with a remote CPU. In a functioning system, the memory chips are organized in a hierarchical manner and include address-associative memory systems.
摘要:
A memory system 10 is provided including a processor 12 and an active memory device 14 coupled to a processor 12. Active memory 14 includes a first memory 20 for storing a plurality of possible addresses and a second memory 22 for storing an actual address received from processor 12. Circuitry 26 is provided for identifying at least one active address from ones of the possible addresses stored in first memory 20 as a function of the actual address stored in second memory 22.
摘要:
A speech system recognizes words from a spoken phrase that conform to checksum constraints. Grammar rules are applied to hypothesize words according to the checksum constraints. The checksum associated with the phrase is thus inherent in the grammar. Sentences which do not meet a predetermined checksum constraint are not valid under the grammar rules and are therefore inherently rejected. The checksum constraints result in increased recognition accuracy.
摘要:
A speaker voice verification system uses temporal decorrelation linear transformation and includes a collector for receiving speech inputs from an unknown speaker claiming a specific identity, a word-level speech features calculator operable to use a temporal decorrelation linear transformation for generating word-level speech feature vectors from such speech inputs, word-level speech feature storage for storing word-level speech feature vectors known to belong to a speaker with the specific identity, a word-level speech feature vectors received from the unknown speaker with those received from the word-level speech feature storage, and speaker verification decision circuitry for determining, based on the similarity score, whether the unknown speaker's identity is the same as that claimed. The word-level vector scorer further includes concatenation circuitry as well as a word-specific orthogonalizing linear transformer. Other systems and methods are also disclosed.
摘要:
A cost-effective word recognizer. Each frame of spoken input is compared to a set of reference frames. The comparison is equivalent to embodying the reference frame as an LPC inverse filter, and is preferably done in the autocorrelation domain. To avoid the instability and computational difficulties which can be caused by a high-gain LPC inverse filter, a noise floor is introduced into each reference frame sample. Thus, for each input speech frame, a scalar measures its similarity to each of the vocabulary of reference frames.To achieve connected word recognition based on this similarity measurement, a dynamic programming algorithm is used in which time warping to match a sample to a reference is in effect permitted, and in which matching is performed with unconstrained endpoints. Thus, the word boundary decisions are made on the basis of a local maximum in similarity, and, since no separate word division decision is required, the error which can be introduced by even the best preliminary decision as to word boundaries is avoided.
摘要:
A method and system are provided for time aligning speech. Speech data is input representing speech signals from a speaker. An orthographic transcription is input including a plurality of words transcribed from the speech signals. A sentence model is generated indicating a selected order of the words in response to the orthographic transcription. In response to the orthographic transcription, word models are generated associated with respective ones of the words. The orthographic transcription is aligned with the speech data in response to the sentence model, to the word models and to the speech data.
摘要:
A speech encoder is disclosed quantizing speech information with respect to energy, voicing and pitch parameters to provide a fixed number of bits per block of frames. Coding of the parameters takes place for each N frames, which comprise a block, irrespective of phonemic boundaries. Certain frames of speech information are discarded during transmission, if such information is substantially duplicated in an adjacent frame. A very low data rate transmission system is thus provided which exhibits a high degree of fidelity and throughput.
摘要:
A voice messaging system, wherein linear predictive coding (LPC) parameters, pitch, and preferably other excitation information is derived from a human voice input, encoded, and transmitted and/or stored, to be called up later to provide a speech output which is nearly identical to the original speech input. The invention features adaptive filtering of the residual signal. The residual signal derived from LPC estimation is adaptively filtered, and then is used as the input to a conventional pitch estimation procedure. The adaptive filtering step uses the first reflection coefficient (k.sub.1) to realize a simple filter (e.g., A(z)=(1-k.sub.1 z.sup.-1).sup.-1. This filter removes high frequency noise from the residual signal during voiced periods, but does not remove the high frequency energy which contains important information during the unvoiced periods of speech. Preferably the above preprocessing technique is also combined with a postprocessing technique, wherein dynamic programming is used to optimally track pitch and voicing information through successive frames.