Abstract:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for assigning saliency weights to words of an ASR model. The saliency values assigned to words within an ASR model are based on human perception judgments of previous transcripts. These saliency values are applied as weights to modify an ASR model such that the results of the weighted ASR model in converting a spoken document to a transcript provide a more accurate and useful transcription to the user.
Abstract:
A low power sound recognition sensor is configured to receive an analog signal that may contain a signature sound. Sound parameter information is extracted from the analog signal and compared to a sound parameter reference stored locally with the sound recognition sensor to detect when the signature sound is received in the analog signal. A trigger signal is generated when a signature sound is detected. A portion of the extracted sound parameter information is sent to a remote training location for adaptive training when a signature sound detection error occurs. An updated sound parameter reference from the remote training location is received in response to the adaptive training.
Abstract:
A system and method is presented for performing dual mode speech recognition, employing a local recognition module on a mobile device and a remote recognition engine on a server device. The system accepts a spoken query from a user, and both the local recognition module and the remote recognition engine perform speech recognition operations on the query, returning a transcription and confidence score, subject to a latency cutoff time. If both sources successfully transcribe the query, then the system accepts the result having the higher confidence score. If only one source succeeds, then that result is accepted. In either case, if the remote recognition engine does succeed in transcribing the query, then a client vocabulary is updated if the remote system result includes information not present in the client vocabulary.
Abstract:
A mobile terminal including a wireless communication unit configured to provide wireless communication, a display, and a controller configured to, activate a mode for voice recognition in response to a touch input to a soft button displayed on the display or to a hard button on the mobile terminal, receive a first voice input associated with a phone call relating operation of the mobile terminal, display an indicator on the display indicating the voice input is being recognized by the mobile terminal, analyze the context of a voice command in the voice input, execute the call relating operation only if there is a single contact in a phonebook that matches the voice command in the first voice input, if there is no single contact that matches the voice command of the received voice input, display a plurality of candidates that is analyzed based on the voice command, receive a second input according to a plurality of candidates, and execute the call relating operation based on the second input.
Abstract:
Systems and methods of script identification in audio data obtained from audio data. The audio data is segmented into a plurality of utterances. A script model representative of a script text is obtained. The plurality of utterances are decoded with the script model. A determination is made if the script text occurred in the audio data.
Abstract:
Systems and methods of automated adaptation of a language model for transcription of audio data include obtaining audio data. The audio data is transcribed with a language model to produce a plurality of audio file transcriptions. A quality of the plurality of audio file transcriptions is evaluated. At least one best transcription from a plurality of audio file transcriptions is selected based upon the evaluated quality. Statistics are calculated from the selected at least one best transcription from the plurality of audio file transcriptions. The language model is modified from the calculated statistics.
Abstract:
A mobile terminal including an input unit configured to receive an input to activate a voice recognition function on the mobile terminal and a memory configured to store multiple domains related to menus and operations of the mobile terminal. It further includes a controller configured to access a specific domain among the multiple domains included in the memory based on the received input to activate the voice recognition function, to recognize user speech based on a language model and an acoustic model of the accessed domain, and to determine at least one menu and operation of the mobile terminal based on the accessed specific domain and the recognized user speech.
Abstract:
A method for configuring a speech recognition system comprises obtaining a speech sample utilised by a voice authentication system in a voice authentication process. The speech sample is processed to generate acoustic models for units of speech associated with the speech sample. The acoustic models are stored for subsequent use by the speech recognition system as part of a speech recognition process.
Abstract:
A voice command definition file (VCDF) declaratively defines voice commands for an application. For example, the VCDF may include definitions for: voice commands; one or more phrases/utterances that may be said to execute each of the commands; a navigation location to navigate to within the application (e.g. a page); phrase lists containing items that may be used as a parameter in a voice command; examples; feedback; and the like. A user may say a single utterance to launch the application, navigate to the associated location of the command and execute the command. The VCDF may define multiple ways to listen for a particular command. The VCDF may be edited/defined by a user and may include a user friendly name for an application. A speech engine loads the VCDF for use such that it may recognize the commands associated with an application. The definitions may be updated during runtime.
Abstract:
A clausifier and method of extracting clauses for spoken language understanding are disclosed. The method relates to generating a set of clauses from speech utterance text and comprises inserting at least one boundary tag in speech utterance text related to sentence boundaries, inserting at least one edit tag indicating a portion of the speech utterance text to remove, and inserting at least one conjunction tag within the speech utterance text. The result is a set of clauses that may be identified within the speech utterance text according to the inserted at least one boundary tag, at least one edit tag and at least one conjunction tag. The disclosed clausifier comprises a sentence boundary classifier, an edit detector classifier, and a conjunction detector classifier. The clausifier may comprise a single classifier or a plurality of classifiers to perform the steps of identifying sentence boundaries, editing text, and identifying conjunctions within the text.