-
公开(公告)号:US11475874B2
公开(公告)日:2022-10-18
申请号:US17163007
申请日:2021-01-29
Applicant: Google LLC
Inventor: Yu Zhang , Bhuvana Ramabhadran , Andrew Rosenberg , Yonghui Wu , Byungha Chun , Ron Weiss , Yuan Cao
Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
-
公开(公告)号:US20220310081A1
公开(公告)日:2022-09-29
申请号:US17701635
申请日:2022-03-22
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/16 , G10L15/22 , G10L15/00
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis, and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
公开(公告)号:US12230249B2
公开(公告)日:2025-02-18
申请号:US17655903
申请日:2022-03-22
Applicant: Google LLC
Inventor: Andrew Rosenberg , Bhuvana Ramabhadran , Zhehuai Chen , Yuan Wang , Yu Zhang , Jesse Emond
Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.
-
24.
公开(公告)号:US12136415B2
公开(公告)日:2024-11-05
申请号:US17644343
申请日:2021-12-15
Applicant: Google LLC
Inventor: Kartik Audhkhasi , Bhuvana Ramabhadran , Tongzhou Chen , Pedro J. Moreno Mengibar
IPC: G10L15/16 , G06F1/03 , G06N3/04 , G06N3/0455 , G10L19/16
Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
-
公开(公告)号:US12080283B2
公开(公告)日:2024-09-03
申请号:US17701635
申请日:2022-03-22
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/00 , G10L15/16 , G10L15/22
CPC classification number: G10L15/197 , G10L15/005 , G10L15/16 , G10L15/22
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
公开(公告)号:US20240203409A1
公开(公告)日:2024-06-20
申请号:US18589220
申请日:2024-02-27
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/00 , G10L15/16 , G10L15/22
CPC classification number: G10L15/197 , G10L15/005 , G10L15/16 , G10L15/22
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
27.
公开(公告)号:US12014729B2
公开(公告)日:2024-06-18
申请号:US17644344
申请日:2021-12-15
Applicant: Google LLC
Inventor: Kartik Audhkhasi , Bhuvana Ramabhadran , Tongzhou Chen , Pedro J. Moreno Mengibar
IPC: G10L15/16 , G06F1/03 , G06N3/04 , G06N3/0455 , G10L19/16
CPC classification number: G10L15/16 , G06F1/03 , G06N3/04 , G06N3/0455 , G10L19/167
Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
-
公开(公告)号:US11929060B2
公开(公告)日:2024-03-12
申请号:US17170836
申请日:2021-02-08
Applicant: Google LLC
Inventor: Zhehuai Chen , Andrew Rosenberg , Bhuvana Ramabhadran , Pedro Jose Moreno Mengibar
IPC: G10L15/06 , G06N3/04 , G06N3/044 , G06N3/045 , G06N3/08 , G06N3/088 , G10L13/02 , G10L15/16 , G10L15/197
CPC classification number: G10L15/063 , G06N3/044 , G06N3/045 , G06N3/088 , G10L13/02 , G10L15/16 , G10L15/197 , G10L2015/0635
Abstract: A method for training a speech recognition model includes receiving a set of training utterance pairs each including a non-synthetic speech representation and a synthetic speech representation of a same corresponding utterance. At each of a plurality of output steps for each training utterance pair in the set of training utterance pairs, the method also includes determining a consistent loss term for the corresponding training utterance pair based on a first probability distribution over possible non-synthetic speech recognition hypotheses generated for the corresponding non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses generated for the corresponding synthetic speech representation. The first and second probability distributions are generated for output by the speech recognition model. The method also includes updating parameters of the speech recognition model based on the consistent loss term determined at each of the plurality of output steps for each training utterance pair.
-
公开(公告)号:US20220068255A1
公开(公告)日:2022-03-03
申请号:US17454536
申请日:2021-11-11
Applicant: Google LLC
Inventor: Zhehuai Chen , Andrew M. Rosenberg , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar
Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.
-
公开(公告)号:US11222620B2
公开(公告)日:2022-01-11
申请号:US16869552
申请日:2020-05-07
Applicant: Google LLC
Inventor: Zhehuai Chen , Andrew M. Rosenberg , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar
Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.
-
-
-
-
-
-
-
-
-