Using Aligned Text and Speech Representations to Train Automatic Speech Recognition Models without Transcribed Speech Data

    公开(公告)号:US20240029715A1

    公开(公告)日:2024-01-25

    申请号:US18355508

    申请日:2023-07-20

    Applicant: Google LLC

    CPC classification number: G10L15/063

    Abstract: A method includes receiving training data that includes unspoken textual utterances in a target language. Each unspoken textual utterance not paired with any corresponding spoken utterance of non-synthetic speech. The method also includes generating a corresponding alignment output for each unspoken textual utterance using an alignment model trained on transcribed speech utterance in one or more training languages each different than the target language. The method also includes generating a corresponding encoded textual representation for each alignment output using a text encoder and training a speech recognition model on the encoded textual representations generated for the alignment outputs. Training the speech recognition model teaches the speech recognition model to learn how to recognize speech in the target language.

    Generating diverse and natural text-to-speech samples

    公开(公告)号:US11475874B2

    公开(公告)日:2022-10-18

    申请号:US17163007

    申请日:2021-01-29

    Applicant: Google LLC

    Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.

    Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech

    公开(公告)号:US20220301543A1

    公开(公告)日:2022-09-22

    申请号:US17326542

    申请日:2021-05-21

    Applicant: Google LLC

    Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

    Learning Word-Level Confidence for Subword End-To-End Automatic Speech Recognition

    公开(公告)号:US20220270597A1

    公开(公告)日:2022-08-25

    申请号:US17182592

    申请日:2021-02-23

    Applicant: Google LLC

    Abstract: A method includes receiving a speech recognition result, and using a confidence estimation module (CEM), for each sub-word unit in a sequence of hypothesized sub-word units for the speech recognition result: obtaining a respective confidence embedding that represents a set of confidence features; generating, using a first attention mechanism, a confidence feature vector; generating, using a second attention mechanism, an acoustic context vector; and generating, as output from an output layer of the CEM, a respective confidence output score for each corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the CEM. For each of the one or more words formed by the sequence of hypothesized sub-word units, the method also includes determining a respective word-level confidence score for the word. The method also includes determining an utterance-level confidence score by aggregating the word-level confidence scores.

    Unsupervised Learning of Disentangled Speech Content and Style Representation

    公开(公告)号:US20220189456A1

    公开(公告)日:2022-06-16

    申请号:US17455667

    申请日:2021-11-18

    Applicant: Google LLC

    Abstract: A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.

    Very deep convolutional neural networks for end-to-end speech recognition

    公开(公告)号:US11080599B2

    公开(公告)日:2021-08-03

    申请号:US16692538

    申请日:2019-11-22

    Applicant: Google LLC

    Abstract: A speech recognition neural network system includes an encoder neural network and a decoder neural network. The encoder neural network generates an encoded sequence from an input acoustic sequence that represents an utterance. The input acoustic sequence includes a respective acoustic feature representation at each of a plurality of input time steps, the encoded sequence includes a respective encoded representation at each of a plurality of time reduced time steps, and the number of time reduced time steps is less than the number of input time steps. The encoder neural network includes a time reduction subnetwork, a convolutional LSTM subnetwork, and a network in network subnetwork. The decoder neural network receives the encoded sequence and processes the encoded sequence to generate, for each position in an output sequence order, a set of sub string scores that includes a respective sub string score for each substring in a set of substrings.

    Advancing the use of text and speech in ASR pretraining with consistency and contrastive losses

    公开(公告)号:US12272363B2

    公开(公告)日:2025-04-08

    申请号:US17722264

    申请日:2022-04-15

    Applicant: Google LLC

    Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

    MULTILINGUAL AND CODE-SWITCHING ASR USING LARGE LANGUAGE MODEL GENERATED TEXT

    公开(公告)号:US20250095637A1

    公开(公告)日:2025-03-20

    申请号:US18886581

    申请日:2024-09-16

    Applicant: Google LLC

    Abstract: A method includes receiving a textual prompt in a first language and obtaining a fine-tuned prompt embedding configured to guide a large language model (LLM) to generate text in a target language from textual prompts in the first language. The method also includes processing, using the LLM, the textual prompt conditioned on the fine-tuned prompt embedding to generate output text in the target language and concatenating the textual prompt and the generated output text to provide an unspoken textual utterance. The method also includes training a multilingual automatic speech recognition (ASR) model to learn how to recognize speech in the target language by injecting the unspoken textual utterance into a text encoder associated with the multilingual ASR model.

    Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech

    公开(公告)号:US12249315B2

    公开(公告)日:2025-03-11

    申请号:US18499031

    申请日:2023-10-31

    Applicant: Google LLC

    Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

Patent Agency Ranking