Large-Scale Multilingual Speech Recognition With A Streaming End-To-End Model

    公开(公告)号:US20200380215A1

    公开(公告)日:2020-12-03

    申请号:US16834342

    申请日:2020-03-30

    Applicant: Google LLC

    Abstract: A method of transcribing speech using a multilingual end-to-end (E2E) speech recognition model includes receiving audio data for an utterance spoken in a particular native language, obtaining a language vector identifying the particular language, and processing, using the multilingual E2E speech recognition model, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance. The multilingual E2E speech recognition model includes a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language. The method also includes providing the transcription for output.

    Large-scale multilingual speech recognition with a streaming end-to-end model

    公开(公告)号:US11468244B2

    公开(公告)日:2022-10-11

    申请号:US16834342

    申请日:2020-03-30

    Applicant: Google LLC

    Abstract: A method of transcribing speech using a multilingual end-to-end (E2E) speech recognition model includes receiving audio data for an utterance spoken in a particular native language, obtaining a language vector identifying the particular language, and processing, using the multilingual E2E speech recognition model, the language vector and acoustic features derived from the audio data to generate a transcription for the utterance. The multilingual E2E speech recognition model includes a plurality of language-specific adaptor modules that include one or more adaptor modules specific to the particular native language and one or more other adaptor modules specific to at least one other native language different than the particular native language. The method also includes providing the transcription for output.

    MINIMUM WORD ERROR RATE TRAINING FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS

    公开(公告)号:US20210358491A1

    公开(公告)日:2021-11-18

    申请号:US17443557

    申请日:2021-07-27

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses N-best lists of decoded hypotheses, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

    Minimum word error rate training for attention-based sequence-to-sequence models

    公开(公告)号:US11107463B2

    公开(公告)日:2021-08-31

    申请号:US16529252

    申请日:2019-08-01

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses N-best lists of decoded hypotheses, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

Patent Agency Ranking