摘要:
Disclosed herein are systems, methods, and computer-readable storage media for improving speech recognition accuracy using textual context. The method includes retrieving a recorded utterance, capturing text from a device display associated with the spoken dialog and viewed by one party to the recorded utterance, and identifying words in the captured text that are relevant to the recorded utterance. The method further includes adding the identified words to a dynamic language model, and recognizing the recorded utterance using the dynamic language model. The recorded utterance can be a spoken dialog. A time stamp can be assigned to each identified word. The method can include adding identified words to and/or removing identified words from the dynamic language model based on their respective time stamps. A screen scraper can capture text from the device display associated with the recorded utterance. The device display can contain customer service data.
摘要:
Disclosed herein are systems, methods, and computer-readable storage media for improving speech recognition accuracy using textual context. The method includes retrieving a recorded utterance, capturing text from a device display associated with the spoken dialog and viewed by one party to the recorded utterance, and identifying words in the captured text that are relevant to the recorded utterance. The method further includes adding the identified words to a dynamic language model, and recognizing the recorded utterance using the dynamic language model. The recorded utterance can be a spoken dialog. A time stamp can be assigned to each identified word. The method can include adding identified words to and/or removing identified words from the dynamic language model based on their respective time stamps. A screen scraper can capture text from the device display associated with the recorded utterance. The device display can contain customer service data.
摘要:
A system and method are disclosed for processing received data associated with a grammar. The method comprises receiving input data having a characteristic that the input data cannot be assigned an interpretation by a grammar, translating the input data into translated input data and submitting the translated input data into the grammar. The transducer coerces the set of strings encoded in a lattice resulting from recognition (such as speech recognition) to the closest strings in the grammar that can be assigned an interpretation.
摘要:
A system and method of exchanging medical information between a user and a computer device is disclosed. The computer device can receive user input in one of a plurality of types of user input comprising speech, pen, gesture and a combination of speech, pen and gesture. The method comprises receiving information from the user associated with a medical condition and a bodily location of the medical condition on a patient in one of a plurality of types of user input, presenting in one of a plurality of types of system output an indication of the received medical condition and the bodily location of the medical condition, and presenting to the user an indication that the computer device is ready to receive further information. The invention enables a more flexible multi-modal interactive environment for entering medical information into a computer device. The medical device also generates multi-modal output for presenting a patient's medical condition in an efficient manner.
摘要:
A system and method of exchanging medical information between a user and a computer device is disclosed. The computer device can receive user input in one of a plurality of types of user input comprising speech, pen, gesture and a combination of speech, pen and gesture. The method comprises receiving information from the user associated with a medical condition and a bodily location of the medical condition on a patient in one of a plurality of types of user input, presenting in one of a plurality of types of system output an indication of the received medical condition and the bodily location of the medical condition, and presenting to the user an indication that the computer device is ready to receive further information. The invention enables a more flexible multi-modal interactive environment for entering medical information into a computer device. The medical device also generates multi modal output for presenting a patient's medical condition in an efficient manner.
摘要:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for approximating relevant responses to a user query with voice-enabled search. A system practicing the method receives a word lattice generated by an automatic speech recognizer based on a user speech and a prosodic analysis of the user speech, generates a reweighted word lattice based on the word lattice and the prosodic analysis, approximates based on the reweighted word lattice one or more relevant responses to the query, and presents to a user the responses to the query. The prosodic analysis examines metalinguistic information of the user speech and can identify the most salient subject matter of the speech, assess how confident a speaker is in the content of his or her speech, and identify the attitude, mood, emotion, sentiment, etc. of the speaker. Other information not described in the content of the speech can also be used.
摘要:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for approximating responses to a user speech query in voice-enabled search based on metadata that include demographic features of the speaker. A system practicing the method recognizes received speech from a speaker to generate recognized speech, identifies metadata about the speaker from the received speech, and feeds the recognized speech and the metadata to a question-answering engine. Identifying the metadata about the speaker is based on voice characteristics of the received speech. The demographic features can include age, gender, socio-economic group, nationality, and/or region. The metadata identified about the speaker from the received speech can be combined with or override self-reported speaker demographic information.
摘要:
Disclosed herein are systems, computer-implemented methods, and tangible computer-readable media for multimodal interaction. The method includes receiving a plurality of multimodal inputs associated with a query, the plurality of multimodal inputs including at least one gesture input, editing the at least one gesture input with a gesture edit machine. The method further includes responding to the query based on the edited gesture input and remaining multimodal inputs. The gesture inputs can be from a stylus, finger, mouse, and other pointing/gesture device. The gesture input can be unexpected or errorful. The gesture edit machine can perform actions such as deletion, substitution, insertion, and aggregation. The gesture edit machine can be modeled as a finite-state transducer. In one aspect, the method further includes generating a lattice for each input, generating an integrated lattice of combined meaning of the generated lattices, and responding to the query further based on the integrated lattice.
摘要:
An Internet Protocol television system includes a user profile agent, a keyword detection agent, and an information search agent. The user profile agent is in communication with a multimedia device, and generates a user profile based on information received from the multimedia device. The keyword detection agent is in communication with the user profile agent, and searches text associated with a multimedia video stream transmitted to the multimedia device for keywords associated with the user profile. The information search agent is in communication with the keyword detection agent, and connects to an information source associated with the keywords detected by the keyword detection agent, and provides additional information associated with the keywords to the multimedia device.
摘要:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for approximating responses to a user speech query in voice-enabled search based on metadata that include demographic features of the speaker. A system practicing the method recognizes received speech from a speaker to generate recognized speech, identifies metadata about the speaker from the received speech, and feeds the recognized speech and the metadata to a question-answering engine. Identifying the metadata about the speaker is based on voice characteristics of the received speech. The demographic features can include age, gender, socio-economic group, nationality, and/or region. The metadata identified about the speaker from the received speech can be combined with or override self-reported speaker demographic information.