Large-scale language model data selection for rare-word speech recognition

    公开(公告)号:US12014725B2

    公开(公告)日:2024-06-18

    申请号:US17643861

    申请日:2021-12-13

    申请人: Google LLC

    摘要: A method of training a language model for rare-word speech recognition includes obtaining a set of training text samples, and obtaining a set of training utterances used for training a speech recognition model. Each training utterance in the plurality of training utterances includes audio data corresponding to an utterance and a corresponding transcription of the utterance. The method also includes applying rare word filtering on the set of training text samples to identify a subset of rare-word training text samples that include words that do not appear in the transcriptions from the set of training utterances or appear in the transcriptions from the set of training utterances less than a threshold number of times. The method further includes training the external language model on the transcriptions from the set of training utterances and the identified subset of rare-word training text samples.

    Lookup-Table Recurrent Language Model

    公开(公告)号:US20220310067A1

    公开(公告)日:2022-09-29

    申请号:US17650566

    申请日:2022-02-10

    申请人: Google LLC

    摘要: A computer-implemented method includes receiving audio data that corresponds to an utterance spoken by a user and captured by a user device. The method also includes processing the audio data to determine a candidate transcription that includes a sequence of tokens for the spoken utterance. Tor each token in the sequence of tokens, the method includes determining a token embedding for corresponding token, determining a n-gram token embedding for a previous sequence of n-gram tokens, and concatenating the token embedding and the n-gram token embedding to generate a concatenated output for the corresponding token. The method also includes rescoring the candidate transcription for the spoken utterance by processing the concatenated output generated for each corresponding token in the sequence of tokens.

    Detecting Unintended Memorization in Language-Model-Fused ASR Systems

    公开(公告)号:US20230335126A1

    公开(公告)日:2023-10-19

    申请号:US18303296

    申请日:2023-04-19

    申请人: Google LLC

    摘要: A method includes inserting a set of canary text samples into a corpus of training text samples and training an external language model on the corpus of training text samples and the set of canary text samples inserted into the corpus of training text samples. For each canary text sample, the method also includes generating a corresponding synthetic speech utterance and generating an initial transcription for the corresponding synthetic speech utterance. The method also includes rescoring the initial transcription generated for each corresponding synthetic speech utterance using the external language model. The method also includes determining a word error rate (WER) of the external language model based on the rescored initial transcriptions and the canary text samples and detecting memorization of the canary text samples by the external language model based on the WER of the external language model.