Abstract:
A system includes acquisition of a domain grammar, determination of an interpolated grammar based on the domain grammar and a base grammar, determination of a delta domain grammar based on an augmented first grammar and the interpolated grammar, determination of an out-of-vocabulary class based on the domain grammar and the base grammar, insertion of the out-of-vocabulary class into a composed transducer composed of the augmented first grammar and one or more other transducers to generate an updated composed transducer, composition of the delta domain grammar and the updated composed transducer, and application of the composition of the delta domain grammar and the updated composed transducer to an output of an acoustic model.
Abstract:
An incremental speech recognition system. The incremental speech recognition system incrementally decodes a spoken utterance using an additional utterance decoder only when the additional utterance decoder is likely to add significant benefit to the combined result. The available utterance decoders are ordered in a series based on accuracy, performance, diversity, and other factors. A recognition management engine coordinates decoding of the spoken utterance by the series of utterance decoders, combines the decoded utterances, and determines whether additional processing is likely to significantly improve the recognition result. If so, the recognition management engine engages the next utterance decoder and the cycle continues. If the accuracy cannot be significantly improved, the result is accepted and decoding stops. Accordingly, a decoded utterance with accuracy approaching the maximum for the series is obtained without decoding the spoken utterance using all utterance decoders in the series, thereby minimizing resource usage.
Abstract:
Techniques and technologies for diagnosing speech recognition errors are described. In an example implementation, a system for diagnosing speech recognition errors may include an error detection module configured to determine that a speech recognition result is least partially erroneous, and a recognition error diagnostics module. The recognition error diagnostics module may be configured to (a) perform a first error analysis of the at least partially erroneous speech recognition result to provide a first error analysis result; (b) perform a second error analysis of the at least partially erroneous speech recognition result to provide a second error analysis result; and (c) determine at least one category of recognition error associated with the at least partially erroneous speech recognition result based on a combination of the first error analysis result and the second error analysis result.
Abstract:
Systems and methods are provided for acquiring training data and building an organizational-based language model based on the training data. In organizational data is generated via one or more applications associated with an organization, the collected organizational data is aggregated and filtered into training data that is used for training an organizational-based language model for speech processing based on the training data.
Abstract:
A system includes acquisition of meeting data associated with a meeting, determination of a plurality of meeting participants based on the acquired meeting data, acquisition of e-mail data associated with each of the plurality of meeting participants, generation of a meeting language model based on the acquired e-mail data and the meeting data, and transcription of audio associated with the meeting based on the meeting language model.
Abstract:
A method for eyes-off training of a dictation system includes translating an audio signal featuring speech audio of a speaker into an initial recognized text using a previously-trained general language model. The initial recognized text is provided to the speaker for error correction. The audio signal is re-translated into an updated recognized text using a specialized language model biased to recognize words included in the corrected text. The general language model is retrained in an “eyes-off” manner, based on the audio signal and the updated recognized text.
Abstract:
Disclosed in various examples are methods, systems, and machine-readable mediums for providing improved computer implemented speech recognition by detecting and correcting speech recognition errors during a speech session. The system recognizes repeated speech commands from a user in a speech session that are similar or identical to each other. To correct these repeated errors, the system creates a customized language model that is then utilized by the language modeler to produce a refined prediction of the meaning of the repeated speech commands. The custom language model may comprise clusters of similar past predictions of speech commands from the speech session of the user.
Abstract:
A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus. The computer system may derive feature vectors from the unspeakable corpus and train a classifier to perform discriminative data selection for language modeling based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.
Abstract:
Solutions for custom display post processing (DPP) in speech recognition (SR) use a customized multi-stage DPP pipeline that transforms a stream of SR tokens from lexical form to display form. A first transformation stage of the DPP pipeline receives the stream of tokens, in turn, by an upstream filter, a base model stage, and a downstream filter, and transforms a first aspect of the stream of tokens (e.g., disfluency, inverse text normalization (ITN), capitalization, etc.) from lexical form into display form. The upstream filter and/or the downstream filter alter the stream of tokens to change the default behavior of the DPP pipeline into custom behavior. Additional transformation stages of the DPP pipeline perform further transforms, allowing for outputting final text in a display format that is customized for a specific user. This permits each user to efficiently leverage a common baseline DPP pipeline to produce a custom output.
Abstract:
A system includes acquisition of meeting data associated with a meeting, determination of a plurality of meeting participants based on the acquired meeting data, acquisition of e-mail data associated with each of the plurality of meeting participants, generation of a meeting language model based on the acquired e-mail data and the meeting data, and transcription of audio associated with the meeting based on the meeting language model.