-
公开(公告)号:US20240127791A1
公开(公告)日:2024-04-18
申请号:US18516069
申请日:2023-11-21
Applicant: Google LLC
Inventor: Samuel Bengio , Yuxuan Wang , Zongheng Yang , Zhifeng Chen , Yonghui Wu , Ioannis Agiomyrgiannakis , Ron J. Weiss , Navdeep Jaitly , Ryan M. Rifkin , Robert Andrew James Clark , Quoc V. Le , Russell J. Ryan , Ying Xiao
CPC classification number: G10L13/08 , G06N3/045 , G06N3/08 , G06N3/084 , G10L13/04 , G10L15/16 , G10L25/18 , G10L25/30
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.
-
公开(公告)号:US20240062743A1
公开(公告)日:2024-02-22
申请号:US18499031
申请日:2023-10-31
Applicant: Google LLC
Inventor: Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu
Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.
-
公开(公告)号:US20220310059A1
公开(公告)日:2022-09-29
申请号:US17643684
申请日:2021-12-10
Applicant: Google LLC
Inventor: Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu
IPC: G10L13/08 , G06F40/279 , G06F40/263 , G06N3/08
Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.
-
公开(公告)号:US20220207321A1
公开(公告)日:2022-06-30
申请号:US17139525
申请日:2020-12-31
Applicant: Google LLC
Inventor: Anmol Gulati , Ruoming Pang , Niki Parmar , Jiahui Yu , Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Shibo Wang , Weikeng Qin , Zhengdong Zhang
Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.
-
公开(公告)号:US11335333B2
公开(公告)日:2022-05-17
申请号:US16717746
申请日:2019-12-17
Applicant: Google LLC
Inventor: Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Patrick Nguyen , Sergey Kishchenko
Abstract: A method includes obtaining audio data for a long-form utterance and segmenting the audio data for the long-form utterance into a plurality of overlapping segments. The method also includes, for each overlapping segment of the plurality of overlapping segments: providing features indicative of acoustic characteristics of the long-form utterance represented by the corresponding overlapping segment as input to an encoder neural network; processing an output of the encoder neural network using an attender neural network to generate a context vector; and generating word elements using the context vector and a decoder neural network. The method also includes generating a transcription for the long-form utterance by merging the word elements from the plurality of overlapping segments and providing the transcription as an output of the automated speech recognition system.
-
公开(公告)号:US20220108680A1
公开(公告)日:2022-04-07
申请号:US17492543
申请日:2021-10-01
Applicant: Google LLC
Inventor: Yu Zhang , Isaac Elias , Byungha Chun , Ye Jia , Yonghui Wu , Mike Chrzanowski , Jonathan Shen
IPC: G10L13/027 , G10L13/04
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.
-
公开(公告)号:US20220083746A1
公开(公告)日:2022-03-17
申请号:US17459041
申请日:2021-08-27
Applicant: Google LLC
Inventor: Zhifeng Chen , Macduff Richard Hughes , Yonghui Wu , Michael Schuster , Xu Chen , Llion Owen Jones , Niki J. Parmar , George Foster , Orhan Firat , Ankur Bapna , Wolfgang Macherey , Melvin Jose Johnson Premkumar
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for machine translation using neural networks. In some implementations, a text in one language is translated into a second language using a neural network model. The model can include an encoder neural network comprising a plurality of bidirectional recurrent neural network layers. The encoding vectors are processed using a multi-headed attention module configured to generate multiple attention context vectors for each encoding vector. A decoder neural network generates a sequence of decoder output vectors using the attention context vectors. The decoder output vectors can represent distributions over various language elements of the second language, allowing a translation of the text into the second language to be determined based on the sequence of decoder output vectors.
-
公开(公告)号:US20210295858A1
公开(公告)日:2021-09-23
申请号:US17222736
申请日:2021-04-05
Applicant: Google LLC
Inventor: Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Michael Schuster , Navdeep Jaitly , Zongheng Yang , Zhifeng Chen , Yu Zhang , Yuxuan Wang , Russell John Wyatt Skerry-Ryan , Ryan M. Rifkin , Ioannis Agiomyrgiannakis
Abstract: Methods, systems, and computer program products for generating, from an input character sequence, an output sequence of audio data representing the input character sequence. The output sequence of audio data includes a respective audio output sample for each of a number of time steps. One example method includes, for each of the time steps: generating a mel-frequency spectrogram for the time step by processing a representation of a respective portion of the input character sequence using a decoder neural network; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting the audio output sample for the time step from the possible audio output samples in accordance with the probability distribution.
-
公开(公告)号:US11107457B2
公开(公告)日:2021-08-31
申请号:US16696101
申请日:2019-11-26
Applicant: Google LLC
Inventor: Samuel Bengio , Yuxuan Wang , Zongheng Yang , Zhifeng Chen , Yonghui Wu , Ioannis Agiomyrgiannakis , Ron J. Weiss , Navdeep Jaitly , Ryan M. Rifkin , Robert Andrew James Clark , Quoc V. Le , Russell J. Ryan , Ying Xiao
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.
-
公开(公告)号:US20210209315A1
公开(公告)日:2021-07-08
申请号:US17056554
申请日:2020-03-07
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Melvin Johnson , Fadi Biadsy , Ron Weiss , Wolfgang Macherey
Abstract: The present disclosure provides systems and methods that train and use machine-learned models such as, for example, sequence-to-sequence models, to perform direct and text-free speech-to-speech translation. In particular, aspects of the present disclosure provide an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.
-
-
-
-
-
-
-
-
-