摘要:
In methods and apparatus for at least partially automating a telephone directory assistance function, directory assistance callers are prompted to speak locality or called entity names associated with desired directory listings. A speech recognition algorithm is applied to speech signals received in response to prompting to determine spoken locality or called entity names. Desired telephone numbers are released to callers, and released telephone numbers are used to confirm or correct at least some of the recognized locality or called entity names. Speech signal representations labelled with the confirmed or corrected names are used as labelled speech tokens to refine prior training of the speech recognition algorithm. The training refinement automatically adjusts for deficiencies in prior training of the speech recognition algorithm and to long term changes in the speech patterns of directory assistance callers served by a particular directory assistance installation. The methods can be generalized to other speech recognition applications.
摘要:
In a speech recognizer, for recognizing unknown utterances in isolated-word speech or continuous speech, improved recognition accuracy is obtained by augmenting the usual spectral representation of the unknown utterance with a dynamic component. A corresponding dynamic component is provided in the templates with which the spectral representation of the utterance is compared. In preferred embodiments, the representation is mel-based cepstral and the dynamic components comprise vector differences between pairs of primary cepstra. Preferably the time interval between each pair is about 50 milliseconds. It is also preferable to compute a dynamic perceptual loudness component along with the dynamic parameters.
摘要:
A method and apparatus are provided for performing prosody based endpoint detection of speech in a speech recognition system. Input speech represents an utterance, which has an intonation pattern. An end-of-utterance condition is identified based on prosodic parameters of the utterance, such as the intonation pattern and the duration of the final syllable of the utterance, as well as non-prosodic parameters, such as the log energy of the speech.
摘要:
A system and method for efficiently distributing voice call data received from speech recognition servers over a telephone network having a shared processing resource is disclosed. Incoming calls are received from phone lines and assigned grammar types by speech recognition servers. A request for processing the voice call data is sent to a resource manager which monitors the shared processing resource and identifies a preferred processor within the shared resource. The resource manager sends an instruction to the speech recognition server to send the voice call data to a preferred processor for processing. The preferred processor is determined by known processor efficiencies for voice call data having the assigned grammar type of the incoming voice call data and a measure of processor loads. While the system is operating, the resource manger develops and updates a history of each processor. The histories include processing efficiency values for all grammar types received. The processing efficiencies are stored, tabulated and assigned usage number values for each processor. When incoming voice call data is receive, the resource manages evaluates the total sum of the usage numbers for processing requests assigned to each processor and the usage number for the grammar type of the incoming data as applied to each processor. The incoming data is distributed to the processor with the lowest sum of total of usage numbers for assigned requests and the usage number assigned to the incoming data for that processor.
摘要:
A speech recognizer, for recognizing unknown utterances in isolated-word small-vocabulary speech has improved rejection of out of vocabulary utterances. Both a usual spectral representation including a dynamic component and an equalized representation are used to match unknown utterances to templates for in-vocabulary words. In a preferred embodiment, the representations are mel-based cepstral with dynamic components being signed vector differences between pairs of primary cepstra. The equalized representation being the signed difference of each cepstral coefficient less an average value of the coefficients. Factors are generated from the ordered lists of templates to determine the probability of the top choice being a correct acceptance, with different methods being applied when the usual and equalized representations yield a different match. For additional enhancement, the rejection method may use templates corresponding to non-vocabulary utterances or decoys. If the top choice corresponds to a decoy, the input is rejected.
摘要:
A system and method for spoken language proficiency assessment by a computer is described. A user provides a spoken response to a constructed response question. A speech recognition system processes the spoken response into a sequence of linguistic units. At training time, features matching a linguistic template are extracted by identifying matches between a training sequence of linguistic units and pre-selected templates. Additionally, a generalized count of the extracted features is computed. At runtime, linguistic features are detected by comparing a runtime sequence of linguistic units to the feature set extracted at training time. This comparison results in a generalized count of linguistic features. The generalized count is then used to compute a score.
摘要:
A speech-enabled distributed processing system forming a Voice Web includes a gateway, one or more voice content sites coupled to the gateway over a wide area network, and a browser coupled to the gateway over a network, which may or may not be the wide area network. The gateway receives telephone calls from one or more users over telephony connections and performs endpointing of speech of each user. The browser provides the gateway with information enabling the gateway to selectively direct the endpointed speech to a voice content site via the wide area network. The gateway outputs the endpointed speech in the form of application protocol requests onto the wide area network to the appropriate site, as specified by the browser, or to the browser. The gateway receives prompts in the form of application protocol responses from the browser or a voice content site and plays the prompts to the appropriate user over the telephony connection. While accessing a selected voice content site, the gateway reroutes the endpointed speech to the browser if the endpointing result represents a hotword candidate.
摘要:
A method of recognizing speech comprises searching a vocabulary of words for a match to an unknown utterance. Words in the vocabulary are represented by concatenated allophone models and the vocabulary is represented as a network. On a first pass of the search, a one-state duration constrained model is used to search the vocabulary network. The one-state model has as its transition probability the maximum observed transitional probability (model distance) of the unknown utterance for the corresponding allophone model. Words having top scores are chosen from the first pass search and, in a second pass of the search, rescored using a full Viterbi trellis with the complete allophone models and model distances. The rescores are sorted to provide a few top choices. Using a second set of speech parameters these few top choices are again rescored. Comparison of the scores using each set of speech parameters determines a recognition choice. Post processing is also possible to further enhance recognition accuracy. Test results indicate that the two-pass search provides approximately the same recognition accuracy as a full Viterbi search of the vocabulary network.
摘要:
In methods and apparatus for at least partially automating a telephone directory assistance function, directory assistance callers are prompted to speak locality or called entity names associated with desired directory listings. A speech recognition algorithm is applied to speech signals received in response to prompting to determine spoken locality or called entity names. Desired telephone numbers are released to callers, and released telephone numbers are used to confirm or correct at least some of the recognized locality or called entity names. Speech signal representations labelled with the confirmed or corrected names are used as labelled speech tokens to refine prior training of the speech recognition algorithm. The training refinement automatically adjusts for deficiencies in prior training of the speech recognition algorithm and to long term changes in the speech patterns of directory assistance callers served by a particular directory assistance installation. The methods can be generalized to other speech recognition applications.
摘要:
In a telecommunications system, automatic directory assistance uses a voice processing unit comprising a lexicon of lexemes and data representing a predetermined relationship between each lexeme and calling numbers in a locality served by the automated directory assistance apparatus. The voice processing unit issues messages to a caller making a directory assistance call to prompt the caller to utter a required one of said lexemes. The unit detects the calling number originating a directory assistance call and, responsive to the calling number and the relationship data computes a probability index representing the likelihood of a lexeme being the subject of the directory assistance call. The unit employs a speech recognizer to recognize, on the basis of the acoustics of the caller's utterance and the probability index, a lexeme corresponding to that uttered by the caller.