Speech Recognition Using Unspoken Text and Speech Synthesis

    公开(公告)号:US20210350786A1

    公开(公告)日:2021-11-11

    申请号:US16869552

    申请日:2020-05-07

    Applicant: Google LLC

    Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

    Language-agnostic Multilingual Modeling Using Effective Script Normalization

    公开(公告)号:US20210233510A1

    公开(公告)日:2021-07-29

    申请号:US17152760

    申请日:2021-01-19

    Applicant: Google LLC

    Abstract: A method includes obtaining a plurality of training data sets each associated with a respective native language and includes a plurality of respective training data samples. For each respective training data sample of each training data set in the respective native language, the method includes transliterating the corresponding transcription in the respective native script into corresponding transliterated text representing the respective native language of the corresponding audio in a target script and associating the corresponding transliterated text in the target script with the corresponding audio in the respective native language to generate a respective normalized training data sample. The method also includes training, using the normalized training data samples, a multilingual end-to-end speech recognition model to predict speech recognition results in the target script for corresponding speech utterances spoken in any of the different native languages associated with the plurality of training data sets.

    USING NON-PARALLEL VOICE CONVERSION FOR SPEECH CONVERSION MODELS

    公开(公告)号:US20250095639A1

    公开(公告)日:2025-03-20

    申请号:US18962686

    申请日:2024-11-27

    Applicant: Google LLC

    Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation. The method also includes determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term.

    Zero-Shot Task Expansion of ASR Models Using Task Vectors

    公开(公告)号:US20250078813A1

    公开(公告)日:2025-03-06

    申请号:US18817181

    申请日:2024-08-27

    Applicant: Google LLC

    Abstract: A method includes training, using an un-supervised learning technique, an auxiliary ASR model based on a first set of un-transcribed source task speech utterances to determine a first task vector, training, using the un-supervised learning technique, the auxiliary ASR model based on a second set of un-transcribed speech utterances to determine a second task vector, and training, using the un-supervised learning technique, the auxiliary ASR model based on un-transcribed target task speech utterances to determine a target task vector. The method also includes determining a first correlation between the first and target task vectors, determining a second correlation between the second and target task vectors, and adapting parameters of a trained primary ASR model based on the first and second source task vectors and the first and second correlations to teach the primary ASR model to learn how to recognize speech associated with the target task.

    Multilingual Re-Scoring Models for Automatic Speech Recognition

    公开(公告)号:US20240420692A1

    公开(公告)日:2024-12-19

    申请号:US18818010

    申请日:2024-08-28

    Applicant: Google LLC

    Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.

    USING TEXT-INJECTION TO RECOGNIZE SPEECH WITHOUT TRANSCRIPTION

    公开(公告)号:US20240304178A1

    公开(公告)日:2024-09-12

    申请号:US18439630

    申请日:2024-02-12

    Applicant: Google LLC

    CPC classification number: G10L15/063 G10L15/22 G10L15/26

    Abstract: A method includes receiving training data including transcribed speech utterances spoken in a general domain, modified speech utterances in a target domain, and unspoken textual utterances corresponding to the transcriptions of the modified speech utterances in the target domain. The modified speech utterances include utterances spoken in the target domain that have been modified to obfuscate one or more classes of sensitive information recited in the utterances. The method also includes generating a corresponding alignment output for each unspoken textual utterance of the received training data using an alignment model. The method also includes training a speech recognition model on the alignment outputs generated for the corresponding to the unspoken textual utterances, the un-transcribed speech utterances, and the transcribed speech utterances to teach the speech recognition model to learn to recognize speech in the target domain and phrases within the one or more classes of sensitive information.

    Training speech synthesis to generate distinct speech sounds

    公开(公告)号:US12087272B2

    公开(公告)日:2024-09-10

    申请号:US17756995

    申请日:2019-12-13

    Applicant: Google LLC

    CPC classification number: G10L13/047 G10L13/086 G10L15/063 G10L15/16

    Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.

    Self-Training With Oracle And Top-Ranked Hypotheses

    公开(公告)号:US20240296832A1

    公开(公告)日:2024-09-05

    申请号:US18590918

    申请日:2024-02-28

    Applicant: Google LLC

    CPC classification number: G10L15/063 G10L15/01 G10L15/16 G10L15/197

    Abstract: A method includes, for each training sample of a plurality of training samples, processing, using an RNN-T model, a corresponding sequence of acoustic frames to obtain an n-best list of speech recognition hypotheses, and, for each speech recognition hypothesis of the n-best list, determining a corresponding number of word errors relative to a corresponding ground-truth transcription. For a top-ranked hypothesis from the n-best list, the method includes determining a first loss based on the corresponding ground-truth transcription. The method includes identifying, as an oracle hypothesis, the speech recognition hypothesis from the n-best list having the smallest corresponding number of word errors relative to the corresponding ground-truth transcription, and determining a second loss for the oracle hypothesis based on the corresponding ground-truth transcription. The method includes determining a corresponding self-training combined loss based on the first and second losses, and training the model based on the corresponding self-training combined loss.

    Using speech recognition to improve cross-language speech synthesis

    公开(公告)号:US11990117B2

    公开(公告)日:2024-05-21

    申请号:US17451613

    申请日:2021-10-20

    Applicant: Google LLC

    CPC classification number: G10L13/047 G10L13/086 G10L13/10

    Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.

Patent Agency Ranking