-
公开(公告)号:US11488575B2
公开(公告)日:2022-11-01
申请号:US17055951
申请日:2019-05-17
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick Nguyen
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
-
公开(公告)号:US20220310072A1
公开(公告)日:2022-09-29
申请号:US17616129
申请日:2020-06-03
Applicant: GOOGLE LLC
Inventor: Tara N. Sainath , Ruoming Pang , David Rybach , Yanzhang He , Rohit Prabhavalkar , Wei Li , Mirkó Visontai , Qiao Liang , Trevor Strohman , Yonghui Wu , Ian C. McGraw , Chung-Cheng Chiu
Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.
-
公开(公告)号:US20220246132A1
公开(公告)日:2022-08-04
申请号:US17163007
申请日:2021-01-29
Applicant: Google LLC
Inventor: Yu Zhang , Bhuvana Ramabhadran , Andrew Rosenberg , Yonghui Wu , Byungha Chun , Ron Weiss , Yuan Cao
IPC: G10L13/047 , G10L25/18 , G10L13/10 , G10L15/06 , G06N3/08
Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
-
公开(公告)号:US20210366463A1
公开(公告)日:2021-11-25
申请号:US17391799
申请日:2021-08-02
Applicant: Google LLC
Inventor: Samuel Bengio , Yuxuan Wang , Zongheng Yang , Zhifeng Chen , Yonghui Wu , Ioannis Agiomyrgiannakis , Ron J. Weiss , Navdeep Jaitly , Ryan M. Rifkin , Robert Andrew James Clark , Quoc V. Le , Russell J. Ryan , Ying Xiao
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.
-
公开(公告)号:US11113480B2
公开(公告)日:2021-09-07
申请号:US16336870
申请日:2017-09-25
Applicant: GOOGLE LLC
Inventor: Mohammad Norouzi , Zhifeng Chen , Yonghui Wu , Michael Schuster , Quoc V. Le
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for neural machine translation. One of the systems includes an encoder neural network comprising: an input forward long short-term memory (LSTM) layer configured to process each input token in the input sequence in a forward order to generate a respective forward representation of each input token, an input backward LSTM layer configured to process each input token in a backward order to generate a respective backward representation of each input token and a plurality of hidden LSTM layers configured to process a respective combined representation of each of the input tokens in the forward order to generate a respective encoded representation of each of the input tokens; and a decoder subsystem configured to receive the respective encoded representations and to process the encoded representations to generate an output sequence.
-
公开(公告)号:US20200098350A1
公开(公告)日:2020-03-26
申请号:US16696101
申请日:2019-11-26
Applicant: Google LLC
Inventor: Samuel Bengio , Yuxuan Wang , Zongheng Yang , Zhifeng Chen , Yonghui Wu , Ioannis Agiomyrgiannakis , Ron J. Weiss , Navdeep Jaitly , Ryan M. Rifkin , Robert Andrew James Clark , Quoc V. Le , Russell J. Ryan , Ying Xiao
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.
-
公开(公告)号:US20200043483A1
公开(公告)日:2020-02-06
申请号:US16529252
申请日:2019-08-01
Applicant: Google LLC
Inventor: Rohit Prakash Prabhavalkar , Tara N. Sainath , Yonghui Wu , Patrick An Phu Nguyen , Zhifeng Chen , Chung-Cheng Chiu , Anjuli Patricia Kannan
IPC: G10L15/197 , G10L15/16 , G10L15/22 , G10L15/06 , G10L15/02
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses N-best lists of decoded hypotheses, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.
-
公开(公告)号:US20200034435A1
公开(公告)日:2020-01-30
申请号:US16336870
申请日:2017-09-25
Applicant: GOOGLE LLC
Inventor: Mohammad Norouzi , Zhifeng Chen , Yonghui Wu , Michael Schuster , Quoc V. Le
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for neural machine translation. One of the systems includes an encoder neural network comprising: an input forward long short-term memory (LSTM) layer configured to process each input token in the input sequence in a forward order to generate a respective forward representation of each input token, an input backward LSTM layer configured to process each input token in a backward order to generate a respective backward representation of each input token and a plurality of hidden LSTM layers configured to process a respective combined representation of each of the input tokens in the forward order to generate a respective encoded representation of each of the input tokens; and a decoder subsystem configured to receive the respective encoded representations and to process the encoded representations to generate an output sequence.
-
公开(公告)号:US20200027444A1
公开(公告)日:2020-01-23
申请号:US16516390
申请日:2019-07-19
Applicant: Google LLC
Inventor: Rohit Prakash Prabhavalkar , Zhifeng Chen , Bo Li , Chung-Cheng Chiu , Kanury Kanishka Rao , Yonghui Wu , Ron J. Weiss , Navdeep Jaitly , Michiel A.U. Bacchiani , Tara N. Sainath , Jan Kazimierz Chorowski , Anjuli Patricia Kannan , Ekaterina Gonina , Patrick An Phu Nguyen
Abstract: Methods, systems, and apparatus, including computer-readable media, for performing speech recognition using sequence-to-sequence models. An automated speech recognition (ASR) system receives audio data for an utterance and provides features indicative of acoustic characteristics of the utterance as input to an encoder. The system processes an output of the encoder using an attender to generate a context vector and generates speech recognition scores using the context vector and a decoder trained using a training process that selects at least one input to the decoder with a predetermined probability. An input to the decoder during training is selected between input data based on a known value for an element in a training example, and input data based on an output of the decoder for the element in the training example. A transcription is generated for the utterance using word elements selected based on the speech recognition scores. The transcription is provided as an output of the ASR system.
-
公开(公告)号:US20190311708A1
公开(公告)日:2019-10-10
申请号:US16447862
申请日:2019-06-20
Applicant: Google LLC
Inventor: Samy Bengio , Yuxuan Wang , Zongheng Yang , Zhifeng Chen , Yonghui Wu , Ioannis Agiomyrgiannakis , Ron J. Weiss , Navdeep Jaitly , Ryan M. Rifkin , Robert Andrew James Clark , Quoc V. Le , Russell J. Ryan , Ying Xiao
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.
-
-
-
-
-
-
-
-
-