Generating diverse and natural text-to-speech samples

    公开(公告)号:US11475874B2

    公开(公告)日:2022-10-18

    申请号:US17163007

    申请日:2021-01-29

    Applicant: Google LLC

    Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.

    Multilingual Re-Scoring Models for Automatic Speech Recognition

    公开(公告)号:US20220310081A1

    公开(公告)日:2022-09-29

    申请号:US17701635

    申请日:2022-03-22

    Applicant: Google LLC

    Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis, and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.

    Supervised and unsupervised training with contrastive loss over sequences

    公开(公告)号:US12230249B2

    公开(公告)日:2025-02-18

    申请号:US17655903

    申请日:2022-03-22

    Applicant: Google LLC

    Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

    Mixture model attention for flexible streaming and non-streaming automatic speech recognition

    公开(公告)号:US12136415B2

    公开(公告)日:2024-11-05

    申请号:US17644343

    申请日:2021-12-15

    Applicant: Google LLC

    Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.

    Multilingual re-scoring models for automatic speech recognition

    公开(公告)号:US12080283B2

    公开(公告)日:2024-09-03

    申请号:US17701635

    申请日:2022-03-22

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/005 G10L15/16 G10L15/22

    Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.

    Multilingual Re-Scoring Models for Automatic Speech Recognition

    公开(公告)号:US20240203409A1

    公开(公告)日:2024-06-20

    申请号:US18589220

    申请日:2024-02-27

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/005 G10L15/16 G10L15/22

    Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.

    Speech Recognition Using Unspoken Text and Speech Synthesis

    公开(公告)号:US20220068255A1

    公开(公告)日:2022-03-03

    申请号:US17454536

    申请日:2021-11-11

    Applicant: Google LLC

    Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

    Speech recognition using unspoken text and speech synthesis

    公开(公告)号:US11222620B2

    公开(公告)日:2022-01-11

    申请号:US16869552

    申请日:2020-05-07

    Applicant: Google LLC

    Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

Patent Agency Ranking