Abstract:
Representative embodiments disclose mechanisms to separate and recognize multiple audio sources (e.g., picking out individual speakers) in an environment where they overlap and interfere with each other. The architecture uses a microphone array to spatially separate out the audio signals. The spatially filtered signals are then input into a plurality of separators, so each signal is input into a corresponding signal. The separators use neural networks to separate out audio sources. The separators typically produce multiple output signals for the single input signals. A post selection processor then assesses the separator outputs to pick the signals with the highest quality output. These signals can be used in a variety of systems such as speech recognition, meeting transcription and enhancement, hearing aids, music information retrieval, speech enhancement and so forth.
Abstract:
Disclosed herein are systems, methods, and devices for optimally performing object identification employing a neural network (NN), for example a convolutional neural network (CNN). The aspects disclosed herein employ audio data captured by one or more microphones in to at least identify an object, or augment image capturing to perform the same. The audio data and the image data are each propagated to the NN, to perform object identification.
Abstract:
Described herein are systems and methods for providing hierarchical state tracking in a spoken dialogue system. A sequence of turns is received by a spoken dialogue system. Each turn includes a user utterance and a machine act. At each turn, a value pointer and a turn pointer are provided for that turn. The value pointer represents a probability distribution over the one or more words in the user utterance that indicates whether each word in the user utterance is a slot value for a slot. The turn pointer identifies which turn in a set of turns includes a currently-relevant slot value for the slot, where the set of turns includes a current turn for which the turn point is being provided, and all turns that precede the current turn.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for detecting voice activity. In one aspect, a method include actions of receiving, by a neural network included in an automated voice activity detection system, a raw audio waveform, processing, by the neural network, the raw audio waveform to determine whether the audio waveform includes speech, and provide, by the neural network, a classification of the raw audio waveform indicating whether the raw audio waveform includes speech.
Abstract:
According to some aspects, a method of classifying speech recognition results is provided, using a neural network comprising a plurality of interconnected network units, each network unit having one or more weight values, the method comprising using at least one computer, performing acts of providing a first vector as input to a first network layer comprising one or more network units of the neural network, transforming, by a first network unit of the one or more network units, the input vector to produce a plurality of values, the transformation being based at least in part on a plurality of weight values of the first network unit, sorting the plurality of values to produce a sorted plurality of values, and providing the sorted plurality of values as input to a second network layer of the neural network.
Abstract:
The task of relevance score assignment to a set of items onto which an artificial neural network is applied is obtained by redistributing an initial relevance score derived from the network output, onto the set of items by reversely propagating the initial relevance score through the artificial neural network so as to obtain a relevance score for each item. In particular, this reverse propagation is applicable to a broader set of artificial neural networks and/or at lower computational efforts by performing same in a manner so that for each neuron, preliminarily redistributed relevance scores of a set of downstream neighbor neurons of the respective neuron are distributed on a set of upstream neighbor neurons of the respective neuron according to a distribution function.
Abstract:
A wearable device for binaural audio is described. The wearable device includes a feedback mechanism, a microphone, an always on binaural recorder (AOBR), and a processor. The AOBR is to capture ambient noise via the microphone and interpret the ambient noise. An alert is issued by the processor to the feedback mechanism based on a notification detected via the microphone in the ambient noise.