Patent search ap:("Google LLC") AND inv:"Yu Zhang" Page 8

71.

发明授权
Text-to-speech using duration prediction 有权

公开(公告)号：US12100382B2

公开(公告)日：2024-09-24

申请号：US17492543

申请日：2021-10-01

Applicant: Google LLC

Inventor： Yu Zhang , Isaac Elias , Byungha Chun , Ye Jia , Yonghui Wu , Mike Chrzanowski , Jonathan Shen

IPC: G10L13/027 , G10L13/04

CPC classification number: G10L13/027 , G10L13/04

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.

72.

发明公开
MIXTURE-OF-EXPERT CONFORMER FOR STREAMING MULTILINGUAL ASR 审中-公开

公开(公告)号：US20240304185A1

公开(公告)日：2024-09-12

申请号：US18598885

申请日：2024-03-07

Applicant: Google LLC

Inventor： Ke Hu , Bo Li , Tara N. Sainath , Yu Zhang , Francoise Beaufays

IPC: G10L15/197 , G10L15/02 , G10L15/06

CPC classification number: G10L15/197 , G10L15/02 , G10L15/063

Abstract: A method of a multilingual ASR model includes receiving a sequence of acoustic frames characterizing an utterance of speech. At a plurality of output steps, the method further includes generating a first higher order feature representation for an acoustic frame by a first encoder that includes a first plurality of multi-head attention layers; generating a second higher order feature representation for a corresponding first higher order feature representation by a second encoder that includes a second plurality of multi-head attention layers; and generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on the second higher order feature representation and a sequence of N previous non-blank symbols. A gating layer of each respective MoE layer configured to dynamically route an output from a previous multi-head attention layer at each of the plurality of output steps to a respective pair of feed-forward expert networks.

73.

发明公开
CHUNK-WISE ATTENTION FOR LONGFORM ASR 审中-公开

公开(公告)号：US20240290321A1

公开(公告)日：2024-08-29

申请号：US18585168

申请日：2024-02-23

Applicant: Google LLC

Inventor： Yongqiang Wang , Yu Zhang , Wei Han , Parisa Haghani , Pedro J. Moreno Mengibar

IPC: G10L15/06 , G10L15/26

CPC classification number: G10L15/063 , G10L15/26

Abstract: A method includes receiving training data including a corpus of multilingual unspoken textual utterances, a corpus of multilingual un-transcribed non-synthetic speech utterances, and a corpus of multilingual transcribed non-synthetic speech utterances. For each un-transcribed non-synthetic speech utterance, the method includes generating a target quantized vector token and a target token index, generating contrastive context vectors from corresponding masked audio features, and deriving a contrastive loss term. The method also includes generating an alignment output, generating a first probability distribution over possible speech recognition hypotheses for the alignment output, and determining an alignment output loss term. The method also includes generating a second probability distribution over possible speech recognition hypotheses and determining a non-synthetic speech loss term. The method also includes pre-training an audio encoder based on the contrastive loss term, the alignment output loss term, and the non-synthetic speech loss term.

74.

发明公开
USING SPEECH RECOGNITION TO IMPROVE CROSS-LANGUAGE SPEECH SYNTHESIS 审中-公开

公开(公告)号：US20240282292A1

公开(公告)日：2024-08-22

申请号：US18654278

申请日：2024-05-03

Applicant: Google LLC

Inventor： Zhehuai Chen , Bhuvana Ramabhadran , Andrew Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar

IPC: G10L13/047 , G10L13/08 , G10L13/10

CPC classification number: G10L13/047 , G10L13/086 , G10L13/10

Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.

75.

发明公开
Parallel Tacotron Non-Autoregressive and Controllable TTS 审中-公开

公开(公告)号：US20240161730A1

公开(公告)日：2024-05-16

申请号：US18421116

申请日：2024-01-24

Applicant: Google LLC

Inventor： Isaac Elias , Jonathan Shen , Yu Zhang , Ye Jia , Ron J. Weiss , Yonghui Wu , Byungha Chun

IPC: G10L13/08 , G06F40/126 , G06N3/044 , G06N3/045 , G06N3/08 , G06N3/088 , G10L13/047 , G10L21/10

CPC classification number: G10L13/08 , G06F40/126 , G06N3/044 , G06N3/045 , G06N3/08 , G06N3/088 , G10L13/047 , G10L21/10 , G06N3/048

Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.

76.

发明公开
Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech 审中-公开

公开(公告)号：US20240062743A1

公开(公告)日：2024-02-22

申请号：US18499031

申请日：2023-10-31

Applicant: Google LLC

Inventor： Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu

IPC: G10L13/08 , G10L13/04

CPC classification number: G10L13/08 , G10L13/04

Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

77.

发明公开
Streaming Automatic Speech Recognition With Non-Streaming Model Distillation 审中-公开

公开(公告)号：US20240029716A1

公开(公告)日：2024-01-25

申请号：US18480827

申请日：2023-10-04

Applicant: Google LLC

Inventor： Thibault Doutre , Wei Han , Min Ma , Zhiyun Lu , Chung-Cheng Chiu , Ruoming Pang , Arun Narayanan , Ananya Misra , Yu Zhang , Liangliang Cao

IPC: G10L15/06 , G10L15/08 , G10L15/18 , G06N3/045

CPC classification number: G10L15/063 , G10L15/083 , G10L15/18 , G06N3/045

Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.

78.

发明公开
Dynamic Adjustment of Delivery Location Based on User Location 审中-公开

公开(公告)号：US20230410032A1

公开(公告)日：2023-12-21

申请号：US18317614

申请日：2023-05-15

Applicant: Google LLC

Inventor： Yu Zhang

IPC: G06Q10/0835

CPC classification number: G06Q10/08355

Abstract: A user places an order on a merchant website associated with a merchant system via a user computing device. The user selects an option for delivery to the user computing device location within a delivery area during a delivery time window and authorizes a delivery system to log the location of the user computing device during and/or a period of time before the delivery time window. When the delivery time window arrives, the delivery system provides a delivery route to a delivery agent computing device. When the delivery agent arrives at the user computing device's location, the user receives an alert that the delivery agent has arrived and receives a package from the delivery agent. If the user does not remain within the delivery area, the user may cancel the order and the delivery, may reschedule the delivery, and/or may accept delivery of the order to a fixed shipping address.

79.

发明申请
Injecting Text in Self-Supervised Speech Pre-training 有权

公开(公告)号：US20230017892A1

公开(公告)日：2023-01-19

申请号：US17808091

申请日：2022-06-21

Applicant: Google LLC

Inventor： Zhehuai Chen , Bhuvana Ramabhadran , Andrew M. Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar

IPC: G10L13/047 , G10L13/08

Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

80.

发明申请
Advancing the Use of Text and Speech in ASR Pretraining With Consistency and Contrastive Losses 有权

公开(公告)号：US20230013587A1

公开(公告)日：2023-01-19

申请号：US17722264

申请日：2022-04-15

Applicant: Google LLC

Inventor： Andrew Rosenberg , Zhehuai Chen , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar , Gary Wang , Yu Zhang

IPC: G10L19/00 , G10L13/02 , G10L15/26

Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification