-
公开(公告)号:US11869483B2
公开(公告)日:2024-01-09
申请号:US17496636
申请日:2021-10-07
Applicant: Nvidia Corporation
Inventor: Kevin Shih , Jose Rafael Valle Gomes da Costa , Rohan Badlani , Adrian Lancucki , Wei Ping , Bryan Catanzaro
IPC: G10L13/00 , G10L13/08 , G10L13/10 , G10L13/047 , G10L25/90 , G06N3/045 , G06N3/08 , G10L13/033
CPC classification number: G10L13/047 , G06N3/045 , G06N3/08 , G10L13/0335 , G10L13/08 , G10L25/90
Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.
-
公开(公告)号:US20230402028A1
公开(公告)日:2023-12-14
申请号:US18457221
申请日:2023-08-28
Applicant: Nvidia Corporation
Inventor: Kevin Shih , Jose Rafael Valle Gomes da Costa , Rohan Badlani , Adrian Lancucki , Wei Ping , Bryan Catanzaro
IPC: G10L13/047 , G10L13/033 , G10L13/08 , G06N3/08 , G06N3/045 , G10L25/90
CPC classification number: G10L13/047 , G10L13/0335 , G10L13/08 , G06N3/08 , G06N3/045 , G10L25/90
Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.
-
公开(公告)号:US20230113950A1
公开(公告)日:2023-04-13
申请号:US17496569
申请日:2021-10-07
Applicant: Nvidia Corporation
Inventor: Kevin Shih , Jose Rafael Valle Gomes da Costa , Rohan Badlani , Adrian Lancucki , Wei Ping , Bryan Catanzaro
IPC: G10L13/047 , G10L25/90
Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.
-
公开(公告)号:US20230110905A1
公开(公告)日:2023-04-13
申请号:US17496636
申请日:2021-10-07
Applicant: Nvidia Corporation
Inventor: Kevin Shih , Jose Rafael Valle Gomes da Costa , Rohan Badlani , Adrian Lancucki , Wei Ping , Bryan Catanzaro
IPC: G10L13/08 , G10L13/047 , G10L13/033 , G06N3/08 , G06N3/04
Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.
-
公开(公告)号:US20240038212A1
公开(公告)日:2024-02-01
申请号:US18099840
申请日:2023-01-20
Applicant: NVIDIA Corporation
Inventor: Kevin Shih , José Rafael Valle Gomes da Costa , Rohan Badlani , João Felipe Santos , Bryan Catanzaro
IPC: G10L13/027 , G10L13/08 , G10L25/30
CPC classification number: G10L13/027 , G10L13/08 , G10L25/30
Abstract: Disclosed are apparatuses, systems, and techniques that may use machine learning for implementing generative text-to-speech models. The techniques include identifying a mapping of speech characteristics (SC) on a target distribution of a latent variable using a non-linear transformation for at least a subset of the SC. Parameters of the non-linear transformation are determined using a neural network that approximates a statistics of the SC with a statistics predicted for the SC based on the identified mapping and the target distribution of the latent variable.
-
公开(公告)号:US20230035306A1
公开(公告)日:2023-02-02
申请号:US17382027
申请日:2021-07-21
Applicant: Nvidia Corporation
Inventor: Ming-Yu Liu , Koki Nagano , Yeongho Seol , Jose Rafael Valle Gomes da Costa , Jaewoo Seo , Ting-Chun Wang , Arun Mallya , Sameh Khamis , Wei Ping , Rohan Badlani , Kevin Jonathan Shih , Bryan Catanzaro , Simon Yuen , Jan Kautz
Abstract: Apparatuses, systems, and techniques are presented to generate media content. In at least one embodiment, a first neural network is used to generate first video information based, at least in part, upon voice information corresponding to one or more users, and a second neural network is used to generate second video information corresponding to the one or more users based, at least in part, upon the first video information and one or more images corresponding to the one or more users
-
公开(公告)号:US20250118286A1
公开(公告)日:2025-04-10
申请号:US18483342
申请日:2023-10-09
Applicant: NVIDIA Corporation
IPC: G10L13/047 , G10L13/08 , G10L13/10 , G10L17/02 , G10L25/18
Abstract: In various examples, synthesizing speech in multiple languages in conversational AI systems and applications is described herein. Systems and methods are disclosed that use one or more models to synthesize speech from a first language spoken by a speaker to a second, target language selected by the speaker. In some examples, to perform the translation, the model(s) may disentangle one or more attributes associated with speech from speakers, such as speakers' identities, speakers' accents, and text associated with the speech. Additionally, the model(s) may allow for fine-grained control of additional attributes associated with output speech, such as one or more frequencies, one or more energies, and one or more phoneme durations. Furthermore, the model(s) may be configured to use the accent associated with the target language when generating text, such as when aligning text encodings with one or more phonemes.
-
公开(公告)号:US20230419947A1
公开(公告)日:2023-12-28
申请号:US18449969
申请日:2023-08-15
Applicant: Nvidia Corporation
Inventor: Kevin Shih , Jose Rafael Valle Gomes da Costa , Rohan Badlani , Adrian Lancucki , Wei Ping , Bryan Catanzaro
IPC: G10L13/047 , G10L25/90 , G06N3/045 , G06N3/08 , G10L13/033 , G10L13/08
CPC classification number: G10L13/047 , G10L25/90 , G10L13/08 , G06N3/08 , G10L13/0335 , G06N3/045
Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.
-
公开(公告)号:US11769481B2
公开(公告)日:2023-09-26
申请号:US17496569
申请日:2021-10-07
Applicant: Nvidia Corporation
Inventor: Kevin Shih , Jose Rafael Valle Gomes da Costa , Rohan Badlani , Adrian Lancucki , Wei Ping , Bryan Catanzaro
IPC: G10L13/00 , G10L13/10 , G10L13/06 , G10L13/07 , G10L13/047 , G10L25/90 , G06N3/045 , G06N3/08 , G10L13/033 , G10L13/08
CPC classification number: G10L13/047 , G06N3/045 , G06N3/08 , G10L13/0335 , G10L13/08 , G10L25/90
Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.
-
-
-
-
-
-
-
-