-
公开(公告)号:US20230178068A1
公开(公告)日:2023-06-08
申请号:US18161217
申请日:2023-01-30
Applicant: Google LLC
Inventor: Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran
IPC: G10L13/047
CPC classification number: G10L13/047
Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
-
公开(公告)号:US20220246132A1
公开(公告)日:2022-08-04
申请号:US17163007
申请日:2021-01-29
Applicant: Google LLC
Inventor: Yu Zhang , Bhuvana Ramabhadran , Andrew Rosenberg , Yonghui Wu , Byungha Chun , Ron Weiss , Yuan Cao
IPC: G10L13/047 , G10L25/18 , G10L13/10 , G10L15/06 , G06N3/08
Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
-
公开(公告)号:US20220245917A1
公开(公告)日:2022-08-04
申请号:US17559633
申请日:2021-12-22
Applicant: Google LLC
Inventor: Byungha Chun , Hideto Kazawa , Jun Suzuki , Yusuke Oda
IPC: G06V10/22 , G06V10/82 , G06V10/774 , G06F40/284
Abstract: Systems and methods of the present disclosure can include a computer-implemented method. The method can include obtaining a machine-learned model comprising one or more layers. At least a first layer of the one or more layers can be configured to receive a set of query vectors respectively associated with layer inputs, determine similarity measures the key vectors and the query vectors, apply a normalization operation to the plurality of respective similarity measures, and determine an output based on the normalized respective similarity measures and a plurality of class labels respectively associated with the plurality of key vectors.
-
公开(公告)号:US12260851B2
公开(公告)日:2025-03-25
申请号:US17305809
申请日:2021-07-14
Applicant: Google LLC
Inventor: Lev Finkelstein , Chun-an Chan , Byungha Chun , Norman Casagrande , Yu Zhang , Robert Andrew James Clark , Vincent Wan
IPC: G10L13/00 , G10L13/047 , G10L13/08
Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. The method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.
-
公开(公告)号:US20240339106A1
公开(公告)日:2024-10-10
申请号:US18746809
申请日:2024-06-18
Applicant: Google LLC
Inventor: Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu
IPC: G10L13/08 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047
CPC classification number: G10L13/086 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047
Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.
-
公开(公告)号:US12087273B2
公开(公告)日:2024-09-10
申请号:US18161217
申请日:2023-01-30
Applicant: Google LLC
Inventor: Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran
IPC: G10L21/00 , G10L13/00 , G10L13/047
CPC classification number: G10L13/047
Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
-
公开(公告)号:US12020685B2
公开(公告)日:2024-06-25
申请号:US17643684
申请日:2021-12-10
Applicant: Google LLC
Inventor: Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu
IPC: G10L13/08 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047
CPC classification number: G10L13/086 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047
Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.
-
公开(公告)号:US20230325658A1
公开(公告)日:2023-10-12
申请号:US18010426
申请日:2021-09-02
Applicant: Google LLC
Inventor: Nanxin Chen , Byungha Chun , William Chan , Ron J. Weiss , Mohammad Norouzi , Yu Zhang , Yonghui Wu
CPC classification number: G06N3/08 , G06V10/26 , G06V10/764 , G06V10/82 , G10L13/02 , G10L25/18 , G10L25/30
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating outputs conditioned on network inputs using neural networks. In one aspect, a method comprises obtaining the network input; initializing a current network output; and generating the final network output by updating the current network output at each of a plurality of iterations, wherein each iteration corresponds to a respective noise level, and wherein the updating comprises, at each iteration: processing a model input for the iteration comprising (i) the current network output and (ii) the network input using a noise estimation neural network that is configured to process the model input to generate a noise output, wherein the noise output comprises a respective noise estimate for each value in the current network output; and updating the current network output using the noise estimate and the noise level for the iteration.
-
公开(公告)号:US20230064749A1
公开(公告)日:2023-03-02
申请号:US18054604
申请日:2022-11-11
Applicant: Google LLC
Inventor: Lev Finkelstein , Chun-an Chan , Byungha Chun , Ye Jia , Yu Zhang , Robert Andrew James Clark , Vincent Wan
Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.
-
公开(公告)号:US11580952B2
公开(公告)日:2023-02-14
申请号:US16855042
申请日:2020-04-22
Applicant: Google LLC
Inventor: Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran
IPC: G10L13/00 , G10L13/047
Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
-
-
-
-
-
-
-
-
-