Abstract:
Systems and methods for locating the end of a keyword in voice sensing are provided. An example method includes receiving an acoustic signal that includes a keyword portion immediately followed by a query portion. The acoustic signal represents at least one captured sound. The method further includes determining the end of the keyword portion. The method further includes, separating, using the end of the keyword portion, the query portion from the keyword portion of the acoustic signal. The method further includes providing the query portion, absent any part of the keyword portion, to an automatic speech recognition (ASR) system.
Abstract:
A system can be configured to perform tasks such as converting recorded speech to a sequence of phonemes that represent the speech, converting an input sequence of graphemes into a target sequence of phonemes, translating an input sequence of words in one language into a corresponding sequence of words in another language, or predicting a target sequence of words that follow an input sequence of words in a language (e.g., a language model). In a speech recognizer, the RNN system may be used to convert speech to a target sequence of phonemes in real-time so that a transcription of the speech can be generated and presented to a user, even before the user has completed uttering the entire speech input.
Abstract:
A computer-implemented technique is described herein for detecting actionable items in speech. In one manner of operation, the technique entails: receiving utterance information that expresses at least one utterance made by one participant of a conversation to at least one other participant of the conversation; converting the utterance information into recognized speech information; using a machine-trained model to recognize at least one actionable item associated with the recognized speech information; and performing at least one computer-implemented action associated the actionable item(s).The machine-trained model may correspond to a deep-structured convolutional neural network. In some implementations, the technique produces the machine-trained model using a source environment corpus that is not optimally suited for a target environment in which the model is intended to be applied. The technique further provides various adaptation techniques for adapting a source-environment model so that it better suits the target environment.
Abstract:
A mechanism for compiling a generative description of an inference task into a neural network. First, an arbitrary generative probabilistic model from the exponential family is specified (or received). The model characterizes a conditional probability distribution for measurement data given a set of latent variables. A factor graph is generated for the generative probabilistic model. Each factor node of the factor graph is expanded into a corresponding sequence of arithmetic operations, based on a specified inference task and a kind of message passing algorithm. The factor graph and the sequences of arithmetic operations specify the structure of a neural network for performance of the inference task. A learning algorithm is executed, to determine values of parameters of the neural network. The neural network is then ready for performing inference on operational measurements.
Abstract:
Systems and methods for contactless speech recognition using lip-reading are provided. In various aspects, a speech recognition unit (112) is configured to receive, via a receiver (108), a Doppler broadened reflected electromagnetic signal that has been modulated and reflected by the lip and facial movements of a speaking subject (104) and to output recognized speech based on an analysis of the received reflected signal. In one embodiment,the functionality of speech recognition unit (112) is implemented via a preprocessing unit (202), a Neural Network ("NNet") unit (204), and a Hidden Markov Model ("HMM") unit (206).
Abstract:
In an automatic speech recognition (ASR) processing system, ASR processing may be configured to process speech based on multiple channels of audio received from a beamformer. The ASR processing system may include a microphone array and the beamformer to output multiple channels of audio such that each channel isolates audio in a particular direction. The multichannel audio signals may include spoken utterances/speech from one or more speakers as well as undesired audio, such as noise from a household appliance. The ASR device may simultaneously perform speech recognition on the multi-channel audio to provide more accurate speech recognition results.