Patent search ap:("GOOGLE LLC") AND inv:"Yonghui Wu" Page 6

51.

发明申请
Parallel Tacotron Non-Autoregressive and Controllable TTS 有权

公开(公告)号：US20220122582A1

公开(公告)日：2022-04-21

申请号：US17327076

申请日：2021-05-21

Applicant: Google LLC

Inventor： Isaac Elias , Jonathan Shen , Yu Zhang , Ye Jia , Ron J. Weiss , Yonghui Wu , Byungha Chun

IPC: G10L13/08 , G10L21/10 , G06F40/126 , G06N3/08

Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.

52.

发明授权
Multi-dialect and multilingual speech recognition 有权

公开(公告)号：US11238845B2

公开(公告)日：2022-02-01

申请号：US16684483

申请日：2019-11-14

Applicant: GOOGLE LLC

Inventor： Zhifeng Chen , Bo Li , Eugene Weinstein , Yonghui Wu , Pedro J. Moreno Mengibar , Ron J. Weiss , Khe Chai Sim , Tara N. Sainath , Patrick An Phu Nguyen

IPC: G10L15/00 , G10L15/07 , G10L15/16 , G10L15/06

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

53.

发明申请
MINIMUM WORD ERROR RATE TRAINING FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS 有权

公开(公告)号：US20210358491A1

公开(公告)日：2021-11-18

申请号：US17443557

申请日：2021-07-27

Applicant: Google LLC

Inventor： Rohit Prakash Prabhavalkar , Tara N. Sainath , Yonghui Wu , Patrick An Phu Nguyen , Zhifeng Chen , Chung-Cheng Chiu , Anjuli Patricia Kannan

IPC: G10L15/197 , G10L15/16 , G10L15/06 , G10L15/02 , G10L15/22

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses N-best lists of decoded hypotheses, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

54.

发明授权
Minimum word error rate training for attention-based sequence-to-sequence models 有权

公开(公告)号：US11107463B2

公开(公告)日：2021-08-31

申请号：US16529252

申请日：2019-08-01

Applicant: Google LLC

Inventor： Rohit Prakash Prabhavalkar , Tara N. Sainath , Yonghui Wu , Patrick An Phu Nguyen , Zhifeng Chen , Chung-Cheng Chiu , Anjuli Patricia Kannan

IPC: G10L15/197 , G10L15/16 , G10L15/06 , G10L15/02 , G10L15/22

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses N-best lists of decoded hypotheses, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

55.

发明申请
Synthesis of Speech from Text in a Voice of a Target Speaker Using Neural Networks 有权

公开(公告)号：US20210217404A1

公开(公告)日：2021-07-15

申请号：US17055951

申请日：2019-05-17

Applicant: Google LLC

Inventor： Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick Nguyen

IPC: G10L13/04 , G10L19/00 , G10L17/04

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

56.

发明授权
Implicit bridging of machine learning tasks 有权

公开(公告)号：US10713593B2

公开(公告)日：2020-07-14

申请号：US15394708

申请日：2016-12-29

Applicant: Google LLC

Inventor： Zhifeng Chen , Michael Schuster , Melvin Jose Johnson Premkumar , Yonghui Wu , Quoc V. Le , Maxim Krikun , Thorsten Brants

IPC: G06N20/00 , G06N3/04 , G06N3/063 , G06F40/44 , G06F40/47

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for performing machine learning tasks. One method includes receiving (i) a model input, and (ii) data identifying a first machine learning task to be performed on the model input to generate a first type of model output for the model input; augmenting the model input with an identifier for the first machine learning task to generate an augmented model input; and processing the augmented model input using a machine learning model, wherein the machine learning model has been trained on training data to perform a plurality of machine learning tasks including the first machine learning task, and wherein the machine learning model has been configured through training to process the augmented model input to generate a machine learning model output of the first type for the model input.

57.

发明授权
Implicit bridging of machine learning tasks 有权

公开(公告)号：US10679148B2

公开(公告)日：2020-06-09

申请号：US16402787

申请日：2019-05-03

Applicant: Google LLC

Inventor： Zhifeng Chen , Michael Schuster , Melvin Jose Johnson Premkumar , Yonghui Wu , Quoc V. Le , Maxim Krikun , Thorsten Brants

IPC: G06N20/00 , G06N3/04 , G06N3/063 , G06F40/44 , G06F40/47

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for performing machine learning tasks. One method includes receiving (i) a model input, and (ii) data identifying a first machine learning task to be performed on the model input to generate a first type of model output for the model input; augmenting the model input with an identifier for the first machine learning task to generate an augmented model input; and processing the augmented model input using a machine learning model. An exemplary system applying implicit bridging for machine learning tasks, as described in this specification, trains a machine learning model to perform certain types of machine learning tasks without requiring explicit training data for the certain types of machine learning tasks to be used during training.

58.

发明申请
MULTI-DIALECT AND MULTILINGUAL SPEECH RECOGNITION 审中-公开

公开(公告)号：US20200160836A1

公开(公告)日：2020-05-21

申请号：US16684483

申请日：2019-11-14

Applicant: GOOGLE LLC

Inventor： Zhifeng Chen , Bo Li , Eugene Weinstein , Yonghui Wu , Pedro J. Moreno Mengibar , Ron J. Weiss , Khe Chai Sim , Tara N. Sainath , Patrick An Phu Nguyen

IPC: G10L15/00 , G10L15/07 , G10L15/16

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

59.

发明申请
SYNTHESIZING SPEECH FROM TEXT USING NEURAL NETWORKS 审中-公开

公开(公告)号：US20200051583A1

公开(公告)日：2020-02-13

申请号：US16058640

申请日：2018-08-08

Applicant: Google LLC

Inventor： Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss

IPC: G10L25/30 , G10L13/047 , G10L13/08 , G10L25/18 , G06N3/08 , G06N3/04 , G06N5/04 , G06N7/00

Abstract: Methods, systems, and computer program products for generating, from an input character sequence, an output sequence of audio data representing the input character sequence. The output sequence of audio data includes a respective audio output sample for each of a number of time steps. One example method includes, for each of the time steps: generating a mel-frequency spectrogram for the time step by processing a representation of a respective portion of the input character sequence using a decoder neural network; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting the audio output sample for the time step from the possible audio output samples in accordance with the probability distribution.

60.

发明申请
SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS 有权

公开(公告)号：US20250095630A1

公开(公告)日：2025-03-20

申请号：US18966088

申请日：2024-12-02

Applicant: Google LLC

Inventor： Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick An Phu Nguyen

IPC: G10L13/04 , G06N3/08 , G10L13/02 , G10L17/04 , G10L19/00

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification