Abstract:
A system and method is presented for performing dual mode speech recognition, employing a local recognition module on a mobile device and a remote recognition engine on a server device. The system accepts a spoken query from a user, and both the local recognition module and the remote recognition engine perform speech recognition operations on the query, returning a transcription and confidence score, subject to a latency cutoff time. If both sources successfully transcribe the query, then the system accepts the result having the higher confidence score. If only one source succeeds, then that result is accepted. In either case, if the remote recognition engine does succeed in transcribing the query, then a client vocabulary is updated if the remote system result includes information not present in the client vocabulary.
Abstract:
In one implementation, a method is described of retrying matching of an audio query against audio references. The method includes receiving a follow-up query that requests a retry at matching a previously submitted audio query. In some implementations, this follow-up query is received without any recognition hint that suggests how to retry matching. The follow-up query includes the audio query or a reference to the audio query to be used in the retry. The method further includes retrying matching the audio query using retry matching resources that include an expanded group of audio references, identifying at least one match and transmitting a report of the match. Optionally, the method includes storing data that correlates the follow-up query, the audio query or the reference to the audio query, and the match after retrying.
Abstract:
A client, such as a mobile phone, receives an audio signal from a microphone; the sound comes from a broadcast signal such as a radio or television program. The client sends a segment of audio data from the broadcast program to a detection system, such as a server. A broadcast monitoring system receives many broadcast audio signals and encodes their fingerprints in a database for matching. The detection system compares the client's audio data fingerprints to the content fingerprints to identify which broadcast station broadcast the signal having the sampled content. This information enables the client to resume the experience of the broadcast from one of a number of possible media sources.
Abstract:
The present invention relates to the continuous monitoring of an audio signal and identification of audio items within an audio signal. The technology disclosed utilizes predictive caching of fingerprints to improve efficiency. Fingerprints are cached for tracking an audio signal with known alignment and for watching an audio signal without known alignment, based on already identified fingerprints extracted from the audio signal. Software running on a smart phone or other battery-powered device cooperates with software running on an audio identification server.
Abstract:
A method of controlling an engagement state of an agent during a human-machine dialog is provided. The method can include receiving a spoken request that is a conditional locking request, wherein the conditional locking request uses a natural language expression to explicitly specify a locking condition, which is a predicate, storing the predicate in a format that can be evaluated when needed by the agent, entering a conditionally locked state in response to the conditional locking request, in the conditionally locked state, receiving a multiplicity of requests without a need for a wakeup indicator, and for a request from the multiplicity of requests evaluating the predicate upon receiving the request, and processing the request if the predicate is true.
Abstract:
A method is described that includes processing text and speech from an input utterance using local overrides of default dictionary pronunciations. Applying this method, a word-level grammar used to process the tokens specifies at least one local word phonetic variant that applies within a specific production rule and, within a local context of the specific production rule, the local word phonetic variant overrides one or more default dictionary phonetic versions of the word. This method can be applied to parsing utterances where the pronunciation of some words depends on their syntactic or semantic context.
Abstract:
A method is provided for advertisement selection. The method includes recognizing words from user speech over a large number of interactions, computing a number of unique words uttered during the interactions, classifying the user by the number of unique words uttered during the interactions, and selecting an advertisement targeted to the classified users.
Abstract:
The technology disclosed relates to computer-implemented conversational agents and particularly to detecting a point in the dialog (end of turn, or end of utterance) at which the agent can start responding to the user. The technology disclosed provides a method of incrementally parsing an input utterance with multiple parses operating in parallel. The technology disclosed includes detecting an interjection point in the input utterance when a pause exceeds a high threshold, or detecting an interjection point in the input utterance when a pause exceeds a low threshold and at least one of the parallel parses is determined to be interruptible by matching a complete sentence according to the grammar. The conversational agents start responding to the user at a detected interjection point.
Abstract:
Systems and methods are provided for providing relevant information in response to natural language expressions. The expressions may be part of a spoken conversation between people either together or remotely. The information may be provided visually. Whether a piece of information is relevant to display can be conditioned by a model of the interest of the speaker. The interest model can be based on a history of expressions by the speaker and information from a user profile. The display of information can also be conditioned on a current conversation topic and on whether the same information has been displayed recently.
Abstract:
A server receives a user audio stream, the stream comprising multiple utterances. A query-processing module of the server continuously listens to and processes the utterances. The processing includes parsing successive utterances and recognizing corresponding queries, taking appropriate actions while the utterances are being received. In some embodiments, a query may be parsed and executed before the previous query's execution is complete.