Patent search ap:("Google LLC") AND inv:"Zhehuai Chen" Page 1

1.

发明申请
Supervised and Unsupervised Training with Contrastive Loss Over Sequences 有权

公开(公告)号：US20250166614A1

公开(公告)日：2025-05-22

申请号：US19034304

申请日：2025-01-22

Applicant: Google LLC

Inventor： Andrew Rosenberg , Bhuvana Ramabhadran , Zhehuai Chen , Yuan Wang , Yu Zhang , Jesse Emond

IPC: G10L15/06 , G06N3/0464 , G06N3/09

Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

2.

发明申请
Injecting Text in Self-Supervised Speech Pre-training 有权

公开(公告)号：US20250078807A1

公开(公告)日：2025-03-06

申请号：US18951572

申请日：2024-11-18

Applicant: Google LLC

Inventor： Zhehuai Chen , Bhuvana Ramabhadran , Andrew M. Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar

IPC: G10L13/047 , G10L13/08

Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

3.

发明公开
Using Aligned Text and Speech Representations to Train Automatic Speech Recognition Models without Transcribed Speech Data 审中-公开

公开(公告)号：US20240029715A1

公开(公告)日：2024-01-25

申请号：US18355508

申请日：2023-07-20

Applicant: Google LLC

Inventor： Andrew Rosenberg , Zhehuai Chen , Ankur Bapna , Yu Zhang , Bhuvana Ramabhadran

IPC: G10L15/06

CPC classification number: G10L15/063

Abstract: A method includes receiving training data that includes unspoken textual utterances in a target language. Each unspoken textual utterance not paired with any corresponding spoken utterance of non-synthetic speech. The method also includes generating a corresponding alignment output for each unspoken textual utterance using an alignment model trained on transcribed speech utterance in one or more training languages each different than the target language. The method also includes generating a corresponding encoded textual representation for each alignment output using a text encoder and training a speech recognition model on the encoded textual representations generated for the alignment outputs. Training the speech recognition model teaches the speech recognition model to learn how to recognize speech in the target language.

4.

发明授权
Advancing the use of text and speech in ASR pretraining with consistency and contrastive losses 有权

公开(公告)号：US12272363B2

公开(公告)日：2025-04-08

申请号：US17722264

申请日：2022-04-15

Applicant: Google LLC

Inventor： Andrew Rosenberg , Zhehuai Chen , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar , Yuan Wang , Yu Zhang

IPC: G10L19/00 , G10L13/02 , G10L15/06 , G10L15/26

Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

5.

发明公开
Joint Speech and Text Streaming Model for ASR 审中-公开

公开(公告)号：US20240028829A1

公开(公告)日：2024-01-25

申请号：US18346232

申请日：2023-07-01

Applicant: Google LLC

Inventor： Tara N. Sainath , Zhouyuan Huo , Zhehuai Chen , Yu Zhang , Weiran Wang , Trevor Strohman , Rohit Prakash Prabhavalkar , Bo Li , Ankur Bapna

IPC: G06F40/284 , G06F40/40

CPC classification number: G06F40/284 , G06F40/40

Abstract: A method includes receiving training data that includes a set of unspoken textual utterances. For each respective unspoken textual utterance, the method includes, tokenizing the respective textual utterance into a sequence of sub-word units, generating a first higher order textual feature representation for a corresponding sub-word unit tokenized from the respective unspoken textual utterance, receiving the first higher order textual feature representation generated by a text encoder, and generating a first probability distribution over possible text units. The method also includes training an encoder based on the first probability distribution over possible text units generated by a first-pass decoder for each respective unspoken textual utterance in the set of unspoken textual utterances.

6.

发明公开
Unsupervised Data Selection via Discrete Speech Representation for Automatic Speech Recognition 审中-公开

公开(公告)号：US20240013777A1

公开(公告)日：2024-01-11

申请号：US18320458

申请日：2023-05-19

Applicant: Google LLC

Inventor： Zhiyun Lu , Yu Zhang , Wei Han , Yongqiang Wang , Parisa Haghani , Zhehuai Chen

IPC: G10L15/16 , G10L15/06

CPC classification number: G10L15/16 , G10L15/063

Abstract: A method includes obtaining a corpus of unlabeled training data including a plurality of spoken utterances, each corresponding spoken utterance of the plurality of spoken utterances includes audio data characterizing the corresponding spoken utterance. The method also includes receiving a target domain. The method also includes selecting, using a contrastive data selection model, a subset of the utterances from the corpus of unlabeled training data that correspond to the target domain. The method includes training an automatic speech recognition (ASR) model on the subset of utterances.

7.

发明授权
Speech recognition using unspoken text and speech synthesis 有权

公开(公告)号：US11837216B2

公开(公告)日：2023-12-05

申请号：US18168969

申请日：2023-02-14

Applicant: Google LLC

Inventor： Zhehuai Chen , Andrew M. Rosenberg , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar

IPC: G10L13/00 , G10L13/08 , G10L15/06

CPC classification number: G10L13/00 , G10L13/08 , G10L15/063

Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

8.

发明授权
Speech recognition using unspoken text and speech synthesis 有权

公开(公告)号：US11605368B2

公开(公告)日：2023-03-14

申请号：US17454536

申请日：2021-11-11

Applicant: Google LLC

Inventor： Zhehuai Chen , Andrew M. Rosenberg , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar

IPC: G10L15/16 , G10L13/00 , G10L13/08 , G10L15/06

Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

9.

发明授权
Supervised and unsupervised training with contrastive loss over sequences 有权

公开(公告)号：US12230249B2

公开(公告)日：2025-02-18

申请号：US17655903

申请日：2022-03-22

Applicant: Google LLC

Inventor： Andrew Rosenberg , Bhuvana Ramabhadran , Zhehuai Chen , Yuan Wang , Yu Zhang , Jesse Emond

IPC: G10L15/06 , G10L13/02 , G10L15/16 , G10L15/22

Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

10.

发明申请
Automatic Speech Recognition Accuracy With Multimodal Embeddings Search 有权

公开(公告)号：US20250006217A1

公开(公告)日：2025-01-02

申请号：US18344007

申请日：2023-06-29

Applicant: Google LLC

Inventor： Christopher Li , Kyle Scott Kastner , Yuan Wang , Zhehuai Chen , Andrew Maxwell Rosenberg , Heng Su , Qian Chen , Leonid Aleksandrovich Velikovich , Patrick Maxim Rondon , Diamantino Antonio Caseiro , Zelin Wu

IPC: G10L25/30 , G10L15/26

Abstract: A method includes receiving training data that includes a set of transcribed speech utterances where each respective transcribed speech utterance is paired with a corresponding transcription. For each respective transcribed speech utterance, the method includes generating an encoded audio representation and an encoded textual representation, generating a higher order audio feature representation for a corresponding encoded audio representation, generating a higher order textual feature representation for a corresponding encoded textual representation, and determining a loss for the respective transcribed speech utterance based on the higher order audio feature representation and the higher order textual feature representation. The method also includes training a speech encoder and a text encoder of a correction model based on the loss determined for each transcribed speech utterance of the set of transcribed speech utterances.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification