Patent search ap:("Google LLC") AND inv:"Yu Zhang" Page 7

61.

发明申请
Supervised and Unsupervised Training with Contrastive Loss Over Sequences 有权

公开(公告)号：US20220310065A1

公开(公告)日：2022-09-29

申请号：US17655903

申请日：2022-03-22

Applicant: Google LLC

Inventor： Andrew Rosenberg , Bhuvana Ramabhadran , Zhehuai Chen , Gary Wang , Yu Zhang , Jesse Emond

IPC: G10L15/06 , G10L15/16 , G10L15/22 , G10L13/02 , G06N3/08

Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

62.

发明授权
Building a text-to-speech system from a small amount of speech data 有权

公开(公告)号：US11335321B2

公开(公告)日：2022-05-17

申请号：US17005974

申请日：2020-08-28

Applicant: Google LLC

Inventor： Ye Jia , Byungha Chun , Yusuke Oda , Norman Casagrande , Tejas Iyer , Fan Luo , Russell John Wyatt Skerry-Ryan , Jonathan Shen , Yonghui Wu , Yu Zhang

IPC: G10L13/08 , G10L13/04 , G10L13/033 , G10L15/06

Abstract: A method of building a text-to-speech (TTS) system from a small amount of speech data includes receiving a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The method further includes training a TTS model using the first plurality of recorded speech samples from the assortment of speakers. Here, the trained TTS model is configured to output synthetic speech as an audible representation of a text input. The method also includes re-training the trained TTS model using the second plurality of recorded speech samples from the target speaker combined with the first plurality of recorded speech samples from the assortment of speakers. Here, the re-trained TTS model is configured to output synthetic speech resembling speaking characteristics of the target speaker.

63.

发明申请
Parallel Tacotron Non-Autoregressive and Controllable TTS 有权

公开(公告)号：US20220122582A1

公开(公告)日：2022-04-21

申请号：US17327076

申请日：2021-05-21

Applicant: Google LLC

Inventor： Isaac Elias , Jonathan Shen , Yu Zhang , Ye Jia , Ron J. Weiss , Yonghui Wu , Byungha Chun

IPC: G10L13/08 , G10L21/10 , G06F40/126 , G06N3/08

Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.

64.

发明申请
Using Speech Recognition to Improve Cross-Language Speech Synthesis 有权

公开(公告)号：US20220122581A1

公开(公告)日：2022-04-21

申请号：US17451613

申请日：2021-10-20

Applicant: Google LLC

Inventor： Zhehuai Chen , Bhuvana Ramabhadran , Andrew Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar

IPC: G10L13/047 , G10L13/08 , G10L13/10

Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.

65.

发明申请
Synthesis of Speech from Text in a Voice of a Target Speaker Using Neural Networks 有权

公开(公告)号：US20210217404A1

公开(公告)日：2021-07-15

申请号：US17055951

申请日：2019-05-17

Applicant: Google LLC

Inventor： Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick Nguyen

IPC: G10L13/04 , G10L19/00 , G10L17/04

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

66.

发明申请
PROCESSING TEXT SEQUENCES USING NEURAL NETWORKS 审中-公开

公开(公告)号：US20200026765A1

公开(公告)日：2020-01-23

申请号：US16338174

申请日：2017-10-03

Applicant: Google LLC

Inventor： Navdeep Jaitly , Yu Zhang , Quoc V. Le , William Chan

IPC: G06F17/28 , G10L15/16 , G10L15/197 , G06N3/08

Abstract: A computer-implemented method for training a neural network that is configured to generate a score distribution over a set of multiple output positions. The neural network is configured to process a network input to generate a respective score distribution for each of a plurality of output positions including a respective score for each token in a predetermined set of tokens that includes n-grams of multiple different sizes. Example methods described herein provide trained neural networks which produce results with improved accuracy compared to the state of the art, e.g. translations that are more accurate compared to the state of the art, or more accurate speech recognition compared to the state of the art.

67.

发明授权
Very deep convolutional neural networks for end-to-end speech recognition 有权

公开(公告)号：US10510004B2

公开(公告)日：2019-12-17

申请号：US16380101

申请日：2019-04-10

Applicant: Google LLC

Inventor： Navdeep Jaitly , Yu Zhang , William Chan

IPC: G06N3/08 , G06N3/04 , G10L15/16 , G10L15/02 , G10L15/22

Abstract: A speech recognition neural network system includes an encoder neural network and a decoder neural network. The encoder neural network generates an encoded sequence from an input acoustic sequence that represents an utterance. The input acoustic sequence includes a respective acoustic feature representation at each of a plurality of input time steps, the encoded sequence includes a respective encoded representation at each of a plurality of time reduced time steps, and the number of time reduced time steps is less than the number of input time steps. The encoder neural network includes a time reduction subnetwork, a convolutional LSTM subnetwork, and a network in network subnetwork. The decoder neural network receives the encoded sequence and processes the encoded sequence to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings.

68.

发明授权
Joint unsupervised and supervised training for multilingual ASR 有权

公开(公告)号：US12249317B2

公开(公告)日：2025-03-11

申请号：US17929934

申请日：2022-09-06

Applicant: Google LLC

Inventor： Bo Li , Junwen Bai , Yu Zhang , Ankur Bapna , Nikhil Siddhartha , Khe Chai Sim , Tara N. Sainath

IPC: G10L15/16 , G10L15/02 , G10L15/06 , G10L15/187 , G10L15/19

Abstract: A method includes receiving audio features and generating a latent speech representation based on the audio features. The method also includes generating a target quantized vector token and a target token index for a corresponding latent speech representation. The method also includes generating a contrastive context vector for a corresponding unmasked or masked latent speech representation and deriving a contrastive self-supervised loss based on the corresponding contrastive context vector and the corresponding target quantized vector token. The method also include generating a high-level context vector based on the contrastive context vector and, for each high-level context vector, learning to predict the target token index at the corresponding time step using a cross-entropy loss based on the target token index. The method also includes predicting speech recognition hypotheses for the utterance and training a multilingual automatic speech recognition (ASR) model using an unsupervised loss and a supervised loss.

69.

发明授权
Injecting text in self-supervised speech pre-training 有权

公开(公告)号：US12159617B2

公开(公告)日：2024-12-03

申请号：US17808091

申请日：2022-06-21

Applicant: Google LLC

Inventor： Zhehuai Chen , Bhuvana Ramabhadran , Andrew M. Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar

IPC: G10L15/06 , G10L13/047 , G10L13/08 , G10L15/16

Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

70.

发明授权
Synthesizing speech from text using neural networks 有权

公开(公告)号：US12148444B2

公开(公告)日：2024-11-19

申请号：US17222736

申请日：2021-04-05

Applicant: Google LLC

Inventor： Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Michael Schuster , Navdeep Jaitly , Zongheng Yang , Zhifeng Chen , Yu Zhang , Yuxuan Wang , Russell John Wyatt Skerry-Ryan , Ryan M. Rifkin , Ioannis Agiomyrgiannakis

IPC: G10L13/047 , G06N3/045 , G06N3/08 , G06N5/046 , G06N7/01 , G10L13/08 , G10L25/18 , G10L25/30

Abstract: Methods, systems, and computer program products for generating, from an input character sequence, an output sequence of audio data representing the input character sequence. The output sequence of audio data includes a respective audio output sample for each of a number of time steps. One example method includes, for each of the time steps: generating a mel-frequency spectrogram for the time step by processing a representation of a respective portion of the input character sequence using a decoder neural network; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting the audio output sample for the time step from the possible audio output samples in accordance with the probability distribution.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification