摘要:
A system and method include an initial noise model produced based on pre-estimated noise of a service environment and an initial synthesized model of a voice containing noise. The system and method produce an utterance environment noise model from background noise of the service environment upon speech recognition as well as a sequence of feature vectors from noise-superimposed speech including an uttered voice and the background noise. The system and method also produce an adaptive model by adapting the initial synthesized model using the utterance environment noise model, the initial noise model, and a compensation model, so that the adaptive model is checked against the sequence of feature vectors to perform speech recognition. Upon performing the speech recognition, a compensation model is created upon which the signal to noise ratio between the background noise present at the time of actual utterance of a voice and the uttered voice is reflected.
摘要:
According to the present invention, a method for integrating processes with a multi-faceted human centered interface is provided. The interface is facilitated to implement a hands free, voice driven environment to control processes and applications. A natural language model is used to parse voice initiated commands and data, and to route those voice initiated inputs to the required applications or processes. The use of an intelligent context based parser allows the system to intelligently determine what processes are required to complete a task which is initiated using natural language. A single window environment provides an interface which is comfortable to the user by preventing the occurrence of distracting windows from appearing. The single window has a plurality of facets which allow distinct viewing areas. Each facet has an independent process routing its outputs thereto. As other processes are activated, each facet can reshape itself to bring a new process into one of the viewing areas. All activated processes are executed simultaneously to provide true multitasking.
摘要:
A method of speech recognition obtains acoustic data from a plurality of conversations. A plurality of pairs of utterances are selected from the plurality of conversations. At least one portion of the first utterance of the pair of utterances is dynamically aligned with at least one portion of the second utterance of the pair of utterance, and an acoustic similarity is computed. At least one pair that includes a first portion from a first utterance and a second portion from a second utterance is chosen, based on a criterion of acoustic similarity. A common pattern template is created from the first portion and the second portion.
摘要:
A computer-based system generates alternative phonetic transcriptions for a target word or phrase corresponding to specific phonological processes that replace individual phonemes or clusters of two or more phonemes with replacement phonemes. The system compares a user's speech with a list of possible transcriptions that includes the base (i.e., correct) transcription of the test target as well as the different alternative transcriptions, to identify the transcription that best matches the user's. In a speech therapy application, the system identifies the phonological process(es), if any, associated with the user's speech and generates statistics over multiple test targets that can be used to diagnose the user's specific phonological disorders. The system can also be implemented in other contexts such as foreign language instruction and automated attendant applications to cover a wide variety and range of accents and/or phonological disorders.
摘要:
A system and method for voice recognition is disclosed. The system enrolls speakers using an enrollment voice samples and identification information. An extraction module characterizes enrollment voice samples with high-dimensional feature vectors or speaker data points. A data structuring module organizes data points into a high-dimensional data structure, such as a kd-tree, in which similarity between data points dictates a distance, such as a Euclidean distance, a Minkowski distance, or a Manhattan distance. The system recognizes a speaker using an unidentified voice sample. A data querying module searches the data structure to generate a subset of approximate nearest neighbors based on an extracted high-dimensional feature vector. A data modeling module uses Parzen windows to estimate a probability density function representing how closely characteristics of the unidentified speaker match enrolled speakers, in real-time, without extensive training data or parametric assumptions about data distribution. A smoothing parameter controls the relative contributions of close and far speaker data points to the estimated density.
摘要:
A voice recognition/response system comprising an utterance recognition unit, a dialog control processing unit, an utterance feature analyzing unit and a response voice generating unit. The utterance recognition unit recognizes utterance content of a user through a voice input therefrom and outputs recognition results. The dialog control processing unit controls progress of dialog with the user based on the recognition results so as to determine response content to the user. The utterance feature analyzing unit analyzes utterance features of the user to generate utterance feature information. The response voice generating unit generates response voice to the user based on the response content and the utterance feature information.
摘要:
A method for verifying and identifying users, and for verifying users' identity, by means of an authentication device capable of transmitting, receiving and recording audio or ultrasonic signals, and capable of converting the signals into digital data, and performing digital signal processing. Voice pattern(s) and user(s) information of one or more authorized user(s) are recorded and stored on the authentication device. User(s) identification is verified by inputting to the authentication device a vocal identification signal from a user, and comparing the voice pattern of the vocal identification signal with the recorded voice pattern(s) of the authorized user(s), and if a match is detected issuing an indication that the user is identified as an authorized user.
摘要:
A method is developed which includes 1) defining a switching state space model for a continuous valued hidden production-related parameter and the observed speech acoustics, and 2) approximating a posterior probability that provides the likelihood of a sequence of the hidden production-related parameters and a sequence of speech units based on a sequence of observed input values. In approximating the posterior probability, the boundaries of the speech units are not fixed but are optimally determined. Under one embodiment, a mixture of Gaussian approximation is used. In another embodiment, an HMM posterior approximation is used.
摘要:
Automated decision making techniques are provided. For example, a technique for generating a decision associated with an individual or an entity includes the following steps. First, two or more data streams associated with the individual or the entity are captured. Then, at least one time-varying measure is computed in accordance with the two or more data streams. Lastly, a decision is computed based on the at least one time-varying measure. One form of the time-varying measure may include a measure of the coverage of a model associated with previously-obtained training data by at least a portion of the captured data. Another form of the time-varying measure may include a measure of the stability of at least a portion of the captured data. While either measure may be employed alone to compute a decision, preferably both the coverage and stability measures are employed. The technique may be used to authenticate a speaker.
摘要:
Systems and methods for analyzing speech containing at least one lexical item, said analysis to determine an attitude of a speaker towards an entity, comprising determining at least one actual valence for the at least one lexical item by analyzing the at least one lexical item in context; determining the attitude based on the at least one actual valence; associating the speaker, the entity and the attitude; and wherein the at least one lexical item encodes attitude information about the entity.