Patent search ap:("Google LLC") AND inv:"Yu Zhang" Page 5

41.

发明申请
Training Speech Synthesis to Generate Distinct Speech Sounds 有权

公开(公告)号：US20230009613A1

公开(公告)日：2023-01-12

申请号：US17756995

申请日：2019-12-13

Applicant: Google LLC

Inventor： Andrew Rosenberg , Bhuvana Ramabhadran , Fadi Biadsy , Yu Zhang

IPC: G10L13/047 , G10L13/08 , G10L15/16 , G10L15/06

Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.

42.

发明申请
Synthesis of Speech from Text in a Voice of a Target Speaker Using Neural Networks 有权

公开(公告)号：US20220351713A1

公开(公告)日：2022-11-03

申请号：US17813361

申请日：2022-07-19

Applicant: Google LLC

Inventor： Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick An Phu Nguyen

IPC: G10L13/04 , G10L17/04 , G10L19/00

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

43.

发明授权
Synthesis of speech from text in a voice of a target speaker using neural networks 有权

公开(公告)号：US11488575B2

公开(公告)日：2022-11-01

申请号：US17055951

申请日：2019-05-17

Applicant: Google LLC

Inventor： Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick Nguyen

IPC: G10L13/04 , G10L17/04 , G10L19/00 , G06N3/08 , G10L13/02

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

44.

发明申请
Streaming Automatic Speech Recognition With Non-Streaming Model Distillation 有权

公开(公告)号：US20220343894A1

公开(公告)日：2022-10-27

申请号：US17348118

申请日：2021-06-15

Applicant: Google LLC

Inventor： Thibault Doutre , Wei Han , Min Ma , Zhiyun Lu , Chung-Cheng Chiu , Ruoming Pang , Arun Narayanan , Ananya Misra , Yu Zhang , Liangliang Cao

IPC: G10L15/06 , G06N3/04 , G10L15/18 , G10L15/08

Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.

45.

发明申请
Multi-Task Learning for End-To-End Automated Speech Recognition Confidence and Deletion Estimation 有权

公开(公告)号：US20220310080A1

公开(公告)日：2022-09-29

申请号：US17643826

申请日：2021-12-11

Applicant: Google LLC

Inventor： David Qiu , Yanzhang He , Yu Zhang , Qiujia Li , Liangliang Cao , Ian McGraw

IPC: G10L15/197 , G10L15/06 , G10L15/22 , G10L15/02 , G10L15/16 , G10L15/30 , G10L15/32 , G10L15/04 , G06N3/08

Abstract: A method including receiving a speech recognition result corresponding to a transcription of an utterance spoken by a user. For each sub-word unit in a sequence of hypothesized sub-word units of the speech recognition result, using a confidence estimation module to: obtain a respective confidence embedding associated with the corresponding output step when the corresponding sub-word unit was output from the first speech recognizer; generate a confidence feature vector; generate an acoustic context vector; and generate a respective confidence output score for the corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the confidence estimation module. The method also includes determining, based on the respective confidence output score generated for each sub-word unit in the sequence of hypothesized sub-word units, an utterance-level confidence score for the transcription of the utterance.

46.

发明申请
Generating Diverse and Natural Text-To-Speech Samples 有权

公开(公告)号：US20220246132A1

公开(公告)日：2022-08-04

申请号：US17163007

申请日：2021-01-29

Applicant: Google LLC

Inventor： Yu Zhang , Bhuvana Ramabhadran , Andrew Rosenberg , Yonghui Wu , Byungha Chun , Ron Weiss , Yuan Cao

IPC: G10L13/047 , G10L25/18 , G10L13/10 , G10L15/06 , G06N3/08

Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.

47.

发明申请
Augmentation of Audiographic Images for Improved Machine Learning 有权

公开(公告)号：US20220012537A1

公开(公告)日：2022-01-13

申请号：US17487548

申请日：2021-09-28

Applicant: Google LLC

Inventor： Daniel Sung-Joon Park , Quoc V. Le , William Chan , Ekin Dogus Cubuk , Barret Zoph , Yu Zhang , Chung-Cheng Chiu

IPC: G06K9/62 , G10L15/16 , G10L15/28 , G10L15/06 , G10L15/12 , G06N20/00

Abstract: Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.

48.

发明授权
Two-level text-to-speech systems using synthetic training data 有权

公开(公告)号：US12260851B2

公开(公告)日：2025-03-25

申请号：US17305809

申请日：2021-07-14

Applicant: Google LLC

Inventor： Lev Finkelstein , Chun-an Chan , Byungha Chun , Norman Casagrande , Yu Zhang , Robert Andrew James Clark , Vincent Wan

IPC: G10L13/00 , G10L13/047 , G10L13/08

Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. The method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.

49.

发明公开
Phonemes And Graphemes for Neural Text-to-Speech 审中-公开

公开(公告)号：US20240339106A1

公开(公告)日：2024-10-10

申请号：US18746809

申请日：2024-06-18

Applicant: Google LLC

Inventor： Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu

IPC: G10L13/08 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047

CPC classification number: G10L13/086 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047

Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.

50.

发明授权
Multilingual speech synthesis and cross-language voice cloning 有权

公开(公告)号：US12087273B2

公开(公告)日：2024-09-10

申请号：US18161217

申请日：2023-01-30

Applicant: Google LLC

Inventor： Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran

IPC: G10L21/00 , G10L13/00 , G10L13/047

CPC classification number: G10L13/047

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification