Patent search ap:("Google LLC") AND inv:"Yu Zhang" Page 6

51.

发明授权
Training speech synthesis to generate distinct speech sounds 有权

公开(公告)号：US12087272B2

公开(公告)日：2024-09-10

申请号：US17756995

申请日：2019-12-13

Applicant: Google LLC

Inventor： Andrew Rosenberg , Bhuvana Ramabhadran , Fadi Biadsy , Yu Zhang

IPC: G10L15/16 , G10L13/047 , G10L13/08 , G10L15/06

CPC classification number: G10L13/047 , G10L13/086 , G10L15/063 , G10L15/16

Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.

52.

发明授权
Phonemes and graphemes for neural text-to-speech 有权

公开(公告)号：US12020685B2

公开(公告)日：2024-06-25

申请号：US17643684

申请日：2021-12-10

Applicant: Google LLC

Inventor： Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu

IPC: G10L13/08 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047

CPC classification number: G10L13/086 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047

Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.

53.

发明公开
PARAMETER-EFFICIENT MODEL REPROGRAMMING FOR CROSS-LINGUAL SPEECH RECOGNITION 审中-公开

公开(公告)号：US20240185841A1

公开(公告)日：2024-06-06

申请号：US18490808

申请日：2023-10-20

Applicant: Google LLC

Inventor： Bo Li , Yu Zhang , Nanxin Chen , Rohit Prakash Prabhavalkar , Chao-Han Huck Yang , Tara N. Sainath , Trevor Strohman

IPC: G10L15/065 , G10L15/00

CPC classification number: G10L15/065 , G10L15/005

Abstract: A method includes obtaining an ASR model trained to recognize speech in a first language and receiving transcribed training utterances in a second language. The method also includes integrating the ASR model with an input reprogramming module and a latent reprogramming module. The method also includes adapting the ASR model to learn how to recognize speech in the second language by training the input reprogramming module and the latent reprogramming module while parameters of the ASR model are frozen.

54.

发明授权
Using speech recognition to improve cross-language speech synthesis 有权

公开(公告)号：US11990117B2

公开(公告)日：2024-05-21

申请号：US17451613

申请日：2021-10-20

Applicant: Google LLC

Inventor： Zhehuai Chen , Bhuvana Ramabhadran , Andrew Rosenberg , Yu Zhang , Pedro J. Moreno Mengibar

IPC: G10L13/047 , G10L13/08 , G10L13/10

CPC classification number: G10L13/047 , G10L13/086 , G10L13/10

Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.

55.

发明公开
Augmentation of Audiographic Images for Improved Machine Learning 审中-公开

公开(公告)号：US20230359898A1

公开(公告)日：2023-11-09

申请号：US18350464

申请日：2023-07-11

Applicant: Google LLC

Inventor： Daniel Sung-Joon Park , Quoc Le , William Chan , Ekin Dogus Cubuk , Barret Zoph , Yu Zhang , Chung-Cheng Chiu

IPC: G06V10/774 , G06N20/00 , G10L15/16 , G10L15/06 , G10L15/12 , G10L15/28 , G06V10/82

CPC classification number: G06N3/084 , G06N20/00 , G10L15/16 , G10L15/063 , G10L15/12 , G06V10/7747 , G10L15/28 , G06V10/82 , G06F18/2148

Abstract: Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.

56.

发明公开
CONDITIONAL OUTPUT GENERATION THROUGH DATA DENSITY GRADIENT ESTIMATION 审中-公开

公开(公告)号：US20230325658A1

公开(公告)日：2023-10-12

申请号：US18010426

申请日：2021-09-02

Applicant: Google LLC

Inventor： Nanxin Chen , Byungha Chun , William Chan , Ron J. Weiss , Mohammad Norouzi , Yu Zhang , Yonghui Wu

IPC: G06V10/82 , G06N3/08 , G10L13/02 , G10L25/18 , G10L25/30 , G06V10/764 , G06V10/26

CPC classification number: G06N3/08 , G06V10/26 , G06V10/764 , G06V10/82 , G10L13/02 , G10L25/18 , G10L25/30

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating outputs conditioned on network inputs using neural networks. In one aspect, a method comprises obtaining the network input; initializing a current network output; and generating the final network output by updating the current network output at each of a plurality of iterations, wherein each iteration corresponds to a respective noise level, and wherein the updating comprises, at each iteration: processing a model input for the iteration comprising (i) the current network output and (ii) the network input using a noise estimation neural network that is configured to process the model input to generate a noise output, wherein the noise output comprises a respective noise estimate for each value in the current network output; and updating the current network output using the noise estimate and the noise level for the iteration.

57.

发明公开
Alignment Prediction to Inject Text into Automatic Speech Recognition Training 审中-公开

公开(公告)号：US20230317059A1

公开(公告)日：2023-10-05

申请号：US18168470

申请日：2023-02-13

Applicant: Google LLC

Inventor： Andrew M Rosenberg , Zhehuai Chen , Yu Zhang , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar

IPC: G10L15/197 , G06F40/289 , G10L15/16 , G10L15/06

CPC classification number: G10L15/063 , G06F40/289 , G10L15/16 , G10L15/197 , G10L2015/0635

Abstract: A method includes receiving training data that includes unspoken textual utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken textual utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance paired with a corresponding transcription. The method also includes generating a corresponding alignment output for each unspoken textual utterance of the received training data using an alignment model. The method also includes pre-training an audio encoder on the alignment outputs generated for corresponding to the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

58.

发明申请
Two-Level Speech Prosody Transfer 有权

公开(公告)号：US20230064749A1

公开(公告)日：2023-03-02

申请号：US18054604

申请日：2022-11-11

Applicant: Google LLC

Inventor： Lev Finkelstein , Chun-an Chan , Byungha Chun , Ye Jia , Yu Zhang , Robert Andrew James Clark , Vincent Wan

IPC: G10L13/10 , G10L13/02 , G10L17/18

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

59.

发明授权
Multilingual speech synthesis and cross-language voice cloning 有权

公开(公告)号：US11580952B2

公开(公告)日：2023-02-14

申请号：US16855042

申请日：2020-04-22

Applicant: Google LLC

Inventor： Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran

IPC: G10L13/00 , G10L13/047

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

60.

发明授权
Two-level speech prosody transfer 有权

公开(公告)号：US11514888B2

公开(公告)日：2022-11-29

申请号：US16992410

申请日：2020-08-13

Applicant: Google LLC

Inventor： Lev Finkelstein , Chun-An Chan , Byungha Chun , Ye Jia , Yu Zhang , Robert Andrew James Clark , Vincent Wan

IPC: G10L13/10 , G10L13/02 , G10L17/18

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification