ENHANCED MULTI-CHANNEL ACOUSTIC MODELS

    公开(公告)号:US20210295859A1

    公开(公告)日:2021-09-23

    申请号:US17303822

    申请日:2021-06-08

    Applicant: Google LLC

    Abstract: This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.

    Deliberation Model-Based Two-Pass End-To-End Speech Recognition

    公开(公告)号:US20210225369A1

    公开(公告)日:2021-07-22

    申请号:US17149018

    申请日:2021-01-14

    Applicant: Google LLC

    Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector. The method also includes decoding the first context vector and the second context vector at a context vector decoder to form a second-pass hypothesis

    Compressed recurrent neural network models

    公开(公告)号:US10515307B2

    公开(公告)日:2019-12-24

    申请号:US15172457

    申请日:2016-06-03

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for implementing long-short term memory layers with compressed gating functions. One of the systems includes a first long short-term memory (LSTM) layer, wherein the first LSTM layer is configured to, for each of the plurality of time steps, generate a new layer state and a new layer output by applying a plurality of gates to a current layer input, a current layer state, and a current layer output, each of the plurality of gates being configured to, for each of the plurality of time steps, generate a respective intermediate gate output vector by multiplying a gate input vector and a gate parameter matrix. The gate parameter matrix for at least one of the plurality of gates is a structured matrix or is defined by a compressed parameter matrix and a projection matrix.

    Fusion of acoustic and text representations in RNN-T

    公开(公告)号:US12211509B2

    公开(公告)日:2025-01-28

    申请号:US17821160

    申请日:2022-08-19

    Applicant: Google LLC

    Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.

    Text Injection For Training Auxiliary Tasks In Speech Recognition Models

    公开(公告)号:US20240296840A1

    公开(公告)日:2024-09-05

    申请号:US18592590

    申请日:2024-03-01

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/02 G10L15/063

    Abstract: A joint auxiliary task and ASR model includes an encoder to receive a sequence of acoustic frames and generate, at each of a plurality of output steps, a higher-order feature representation for a corresponding acoustic frame. The model also includes a multi-output HAT decoder to generate at each of the plurality of output steps a probability distribution over possible speech recognition hypotheses, and an indication of whether the output step corresponds to an auxiliary token associated with a particular auxiliary task. The model is trained by a JEIT training process based on: a paired training data set including paired audio data and transcriptions, the transcriptions annotated with ground-truth auxiliary tokens associated with the particular auxiliary task; and an unpaired training data set including textual utterances not paired with any corresponding audio data, the textual utterances annotated with the ground-truth auxiliary tokens associated with the particular auxiliary task.

    Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

    公开(公告)号:US20240290323A1

    公开(公告)日:2024-08-29

    申请号:US18660655

    申请日:2024-05-10

    Applicant: Google LLC

    CPC classification number: G10L15/063 G06N3/02 G10L15/16 G10L15/197 G10L15/22

    Abstract: A method of training a language model for rare-word speech recognition includes obtaining a set of training text samples, and obtaining a set of training utterances used for training a speech recognition model. Each training utterance in the plurality of training utterances includes audio data corresponding to an utterance and a corresponding transcription of the utterance. The method also includes applying rare word filtering on the set of training text samples to identify a subset of rare-word training text samples that include words that do not appear in the transcriptions from the set of training utterances or appear in the transcriptions from the set of training utterances less than a threshold number of times. The method further includes training the external language model on the transcriptions from the set of training utterances and the identified subset of rare-word training text samples.

Patent Agency Ranking