Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models

    公开(公告)号:US20230237993A1

    公开(公告)日:2023-07-27

    申请号:US18011571

    申请日:2021-10-01

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/32 G10L15/22

    Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

    Lookup-Table Recurrent Language Model

    公开(公告)号:US20220310067A1

    公开(公告)日:2022-09-29

    申请号:US17650566

    申请日:2022-02-10

    Applicant: Google LLC

    Abstract: A computer-implemented method includes receiving audio data that corresponds to an utterance spoken by a user and captured by a user device. The method also includes processing the audio data to determine a candidate transcription that includes a sequence of tokens for the spoken utterance. Tor each token in the sequence of tokens, the method includes determining a token embedding for corresponding token, determining a n-gram token embedding for a previous sequence of n-gram tokens, and concatenating the token embedding and the n-gram token embedding to generate a concatenated output for the corresponding token. The method also includes rescoring the candidate transcription for the spoken utterance by processing the concatenated output generated for each corresponding token in the sequence of tokens.

    Multichannel speech recognition using neural networks

    公开(公告)号:US11062725B2

    公开(公告)日:2021-07-13

    申请号:US16278830

    申请日:2019-02-19

    Applicant: Google LLC

    Abstract: This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.

    Large-Scale Multilingual Speech Recognition With A Streaming End-To-End Model

    公开(公告)号:US20200380215A1

    公开(公告)日:2020-12-03

    申请号:US16834342

    申请日:2020-03-30

    Applicant: Google LLC

    Abstract: A method of transcribing speech using a multilingual end-to-end (E2E) speech recognition model includes receiving audio data for an utterance spoken in a particular native language, obtaining a language vector identifying the particular language, and processing, using the multilingual E2E speech recognition model, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance. The multilingual E2E speech recognition model includes a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language. The method also includes providing the transcription for output.

    Processing audio waveforms
    38.
    发明授权

    公开(公告)号:US10403269B2

    公开(公告)日:2019-09-03

    申请号:US15080927

    申请日:2016-03-25

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing audio waveforms. In some implementations, a time-frequency feature representation is generated based on audio data. The time-frequency feature representation is input to an acoustic model comprising a trained artificial neural network. The trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers. An output that is based on output of the trained artificial neural network is received. A transcription is provided, where the transcription is determined based on the output of the acoustic model.

    Enhanced multi-channel acoustic models

    公开(公告)号:US10224058B2

    公开(公告)日:2019-03-05

    申请号:US15350293

    申请日:2016-11-14

    Applicant: Google LLC

    Abstract: This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.

Patent Agency Ranking