Patent search ap:("GOOGLE LLC") AND inv:"Byungha Chun" Page 3

21.

发明公开
MULTILINGUAL SPEECH SYNTHESIS AND CROSS-LANGUAGE VOICE CLONING 审中-公开

公开(公告)号：US20230178068A1

公开(公告)日：2023-06-08

申请号：US18161217

申请日：2023-01-30

Applicant: Google LLC

Inventor： Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran

IPC: G10L13/047

CPC classification number: G10L13/047

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

22.

发明申请
Generating Diverse and Natural Text-To-Speech Samples 有权

公开(公告)号：US20220246132A1

公开(公告)日：2022-08-04

申请号：US17163007

申请日：2021-01-29

Applicant: Google LLC

Inventor： Yu Zhang , Bhuvana Ramabhadran , Andrew Rosenberg , Yonghui Wu , Byungha Chun , Ron Weiss , Yuan Cao

IPC: G10L13/047 , G10L25/18 , G10L13/10 , G10L15/06 , G06N3/08

Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.

23.

发明申请
SYSTEMS AND METHODS FOR NEAREST-NEIGHBOR PREDICTION BASED MACHINE LEARNED MODELS 有权

公开(公告)号：US20220245917A1

公开(公告)日：2022-08-04

申请号：US17559633

申请日：2021-12-22

Applicant: Google LLC

Inventor： Byungha Chun , Hideto Kazawa , Jun Suzuki , Yusuke Oda

IPC: G06V10/22 , G06V10/82 , G06V10/774 , G06F40/284

Abstract: Systems and methods of the present disclosure can include a computer-implemented method. The method can include obtaining a machine-learned model comprising one or more layers. At least a first layer of the one or more layers can be configured to receive a set of query vectors respectively associated with layer inputs, determine similarity measures the key vectors and the query vectors, apply a normalization operation to the plurality of respective similarity measures, and determine an output based on the normalized respective similarity measures and a plurality of class labels respectively associated with the plurality of key vectors.

24.

发明授权
Two-level text-to-speech systems using synthetic training data 有权

公开(公告)号：US12260851B2

公开(公告)日：2025-03-25

申请号：US17305809

申请日：2021-07-14

Applicant: Google LLC

Inventor： Lev Finkelstein , Chun-an Chan , Byungha Chun , Norman Casagrande , Yu Zhang , Robert Andrew James Clark , Vincent Wan

IPC: G10L13/00 , G10L13/047 , G10L13/08

Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. The method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.

25.

发明公开
Phonemes And Graphemes for Neural Text-to-Speech 审中-公开

公开(公告)号：US20240339106A1

公开(公告)日：2024-10-10

申请号：US18746809

申请日：2024-06-18

Applicant: Google LLC

Inventor： Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu

IPC: G10L13/08 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047

CPC classification number: G10L13/086 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047

Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.

26.

发明授权
Multilingual speech synthesis and cross-language voice cloning 有权

公开(公告)号：US12087273B2

公开(公告)日：2024-09-10

申请号：US18161217

申请日：2023-01-30

Applicant: Google LLC

Inventor： Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran

IPC: G10L21/00 , G10L13/00 , G10L13/047

CPC classification number: G10L13/047

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

27.

发明授权
Phonemes and graphemes for neural text-to-speech 有权

公开(公告)号：US12020685B2

公开(公告)日：2024-06-25

申请号：US17643684

申请日：2021-12-10

Applicant: Google LLC

Inventor： Ye Jia , Byungha Chun , Yu Zhang , Jonathan Shen , Yonghui Wu

IPC: G10L13/08 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047

CPC classification number: G10L13/086 , G06F40/263 , G06F40/279 , G06N3/08 , G10L13/047

Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.

28.

发明公开
CONDITIONAL OUTPUT GENERATION THROUGH DATA DENSITY GRADIENT ESTIMATION 审中-公开

公开(公告)号：US20230325658A1

公开(公告)日：2023-10-12

申请号：US18010426

申请日：2021-09-02

Applicant: Google LLC

Inventor： Nanxin Chen , Byungha Chun , William Chan , Ron J. Weiss , Mohammad Norouzi , Yu Zhang , Yonghui Wu

IPC: G06V10/82 , G06N3/08 , G10L13/02 , G10L25/18 , G10L25/30 , G06V10/764 , G06V10/26

CPC classification number: G06N3/08 , G06V10/26 , G06V10/764 , G06V10/82 , G10L13/02 , G10L25/18 , G10L25/30

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating outputs conditioned on network inputs using neural networks. In one aspect, a method comprises obtaining the network input; initializing a current network output; and generating the final network output by updating the current network output at each of a plurality of iterations, wherein each iteration corresponds to a respective noise level, and wherein the updating comprises, at each iteration: processing a model input for the iteration comprising (i) the current network output and (ii) the network input using a noise estimation neural network that is configured to process the model input to generate a noise output, wherein the noise output comprises a respective noise estimate for each value in the current network output; and updating the current network output using the noise estimate and the noise level for the iteration.

29.

发明申请
Two-Level Speech Prosody Transfer 有权

公开(公告)号：US20230064749A1

公开(公告)日：2023-03-02

申请号：US18054604

申请日：2022-11-11

Applicant: Google LLC

Inventor： Lev Finkelstein , Chun-an Chan , Byungha Chun , Ye Jia , Yu Zhang , Robert Andrew James Clark , Vincent Wan

IPC: G10L13/10 , G10L13/02 , G10L17/18

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

30.

发明授权
Multilingual speech synthesis and cross-language voice cloning 有权

公开(公告)号：US11580952B2

公开(公告)日：2023-02-14

申请号：US16855042

申请日：2020-04-22

Applicant: Google LLC

Inventor： Yu Zhang , Ron J. Weiss , Byungha Chun , Yonghui Wu , Zhifeng Chen , Russell John Wyatt Skerry-Ryan , Ye Jia , Andrew M. Rosenberg , Bhuvana Ramabhadran

IPC: G10L13/00 , G10L13/047

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification