-
公开(公告)号:US12272348B2
公开(公告)日:2025-04-08
申请号:US17655030
申请日:2022-03-16
Applicant: Google LLC
Inventor: Bhuvana Ramabhadran , Zhehuai Chen , Fadi Biadsy , Pedro J. Moreno Mengibar
IPC: G10L13/027 , G10L13/047 , G10L15/16 , G10L15/22 , G10L25/18
Abstract: A method for speech conversion includes receiving, as input to an encoder of a speech conversion model, an input spectrogram corresponding to an utterance, the encoder including a stack of self-attention blocks. The method further includes generating, as output from the encoder, an encoded spectrogram and receiving, as input to a spectrogram decoder of the speech conversion model, the encoded spectrogram generated as output from the encoder. The method further includes generating, as output from the spectrogram decoder, an output spectrogram corresponding to a synthesized speech representation of the utterance.
-
52.
公开(公告)号:US20250022458A1
公开(公告)日:2025-01-16
申请号:US18896830
申请日:2024-09-25
Applicant: Google LLC
Inventor: Kartik Audhkhasi , Bhuvana Ramabhadran , Tongzhou Chen , Pedro J. Moreno Mengibar
IPC: G10L15/16 , G06F1/03 , G06N3/04 , G06N3/0455 , G10L19/16
Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
-
公开(公告)号:US12159617B2
公开(公告)日:2024-12-03
申请号:US17808091
申请日:2022-06-21
Applicant: Google LLC
Inventor: Zhehuai Chen , Bhuvana Ramabhadran , Andrew M. Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar
IPC: G10L15/06 , G10L13/047 , G10L13/08 , G10L15/16
Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.
-
54.
公开(公告)号:US20240296837A1
公开(公告)日:2024-09-05
申请号:US18589802
申请日:2024-02-28
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Yosuke Higuchi , Bhuvana Ramabhadran
Abstract: A method includes receiving a sequence of acoustic frames characterizing an utterance. During a first pass, the method includes generating first-pass audio encodings based on the sequence of acoustic frames using a stack of mask-conformer blocks of an acoustic encoder, generating a first-pass transcription of the utterance based on the first-pass audio encodings using a speech recognition decoder, and generating a first-pass masked output sequence using a mask-predict decoder of the acoustic encoder. During a second pass, the method includes generating second-pass audio encodings by performing cross-attention on the sequence of acoustic frames and the masked first-pass transcription using the stack of mask-conformer blocks of the acoustic encoder and generating a second-pass transcription of the utterance based on the second-pass audio encodings using the speech recognition decoder.
-
公开(公告)号:US20240282292A1
公开(公告)日:2024-08-22
申请号:US18654278
申请日:2024-05-03
Applicant: Google LLC
Inventor: Zhehuai Chen , Bhuvana Ramabhadran , Andrew Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar
IPC: G10L13/047 , G10L13/08 , G10L13/10
CPC classification number: G10L13/047 , G10L13/086 , G10L13/10
Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.
-
公开(公告)号:US20240153484A1
公开(公告)日:2024-05-09
申请号:US18494324
申请日:2023-10-25
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Takaaki Saeki , Zhehuai Chen , Byungha Chun , Bhuvana Ramabhadran
IPC: G10L13/047 , G10L15/06 , G10L15/16
CPC classification number: G10L13/047 , G10L15/063 , G10L15/16
Abstract: A method includes receiving training data that includes a plurality of sets of text-to-speech (TTS) spoken utterances each associated with a respective language and including TTS utterances of synthetic speech spoken that includes a corresponding reference speech representation paired with a corresponding input text sequence. For each TTS utterance in each set of the TTS spoken training utterances of the received training data, the method includes generating a corresponding TTS encoded textual representation for the corresponding input text sequence, generating a corresponding speech encoding for the corresponding TTS utterance of synthetic speech, generating a shared encoder output, generating a predicted speech representation for the corresponding TTS utterance of synthetic speech, and determining a reconstruction loss. The method also includes training a TTS model based on the reconstruction losses determined for the TTS utterances in each set of the TTS spoken training utterances.
-
公开(公告)号:US20230298565A1
公开(公告)日:2023-09-21
申请号:US17660487
申请日:2022-04-25
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Gary Wang , Bhuvana Ramabhadran , Fadi Biadsy
IPC: G10L15/06 , G10L15/197 , G10L13/02 , G10L19/038 , G10L15/22
CPC classification number: G10L15/063 , G10L15/197 , G10L13/02 , G10L19/038 , G10L15/22 , G10L2015/0635 , G10L2019/0001
Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation. The method also includes determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term.
-
公开(公告)号:US20230274727A1
公开(公告)日:2023-08-31
申请号:US18312576
申请日:2023-05-04
Applicant: Google LLC
Inventor: Vijayaditya Peddinti , Bhuvana Ramabhadran , Andrew Rosenberg , Mateusz Golebiewski
IPC: G10L13/08 , G10L15/187
CPC classification number: G10L13/08 , G10L15/187
Abstract: A method for instantaneous learning in text-to-speech (TTS) during dialog includes receiving a user pronunciation of a particular word present in a query spoken by a user. The method also includes receiving a TTS pronunciation of the same particular word that is present in a TTS input where the TTS pronunciation of the particular word is different than the user pronunciation of the particular word. The method also includes obtaining user pronunciation-related features and TTS pronunciation related features associated with the particular word. The method also includes generating a pronunciation decision selecting one of the user pronunciation or the TTS pronunciation of the particular word that is associated with a highest confidence. The method also include providing the TTS audio that includes a synthesized speech representation of the response to the query using the user pronunciation or the TTS pronunciation for the particular word.
-
公开(公告)号:US20230017892A1
公开(公告)日:2023-01-19
申请号:US17808091
申请日:2022-06-21
Applicant: Google LLC
Inventor: Zhehuai Chen , Bhuvana Ramabhadran , Andrew M. Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar
IPC: G10L13/047 , G10L13/08
Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.
-
60.
公开(公告)号:US20230013587A1
公开(公告)日:2023-01-19
申请号:US17722264
申请日:2022-04-15
Applicant: Google LLC
Inventor: Andrew Rosenberg , Zhehuai Chen , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar , Gary Wang , Yu Zhang
Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.
-
-
-
-
-
-
-
-
-