-
公开(公告)号:US20230035306A1
公开(公告)日:2023-02-02
申请号:US17382027
申请日:2021-07-21
Applicant: Nvidia Corporation
Inventor: Ming-Yu Liu , Koki Nagano , Yeongho Seol , Jose Rafael Valle Gomes da Costa , Jaewoo Seo , Ting-Chun Wang , Arun Mallya , Sameh Khamis , Wei Ping , Rohan Badlani , Kevin Jonathan Shih , Bryan Catanzaro , Simon Yuen , Jan Kautz
Abstract: Apparatuses, systems, and techniques are presented to generate media content. In at least one embodiment, a first neural network is used to generate first video information based, at least in part, upon voice information corresponding to one or more users, and a second neural network is used to generate second video information corresponding to the one or more users based, at least in part, upon the first video information and one or more images corresponding to the one or more users
-
公开(公告)号:US20210358188A1
公开(公告)日:2021-11-18
申请号:US17318871
申请日:2021-05-12
Applicant: NVIDIA Corporation
Inventor: Rev Lebaredian , Simon Yuen , Santanu Dutta , Jonathan Michael Cohen , Ratin Kumar
Abstract: In various examples, a virtually animated and interactive agent may be rendered for visual and audible communication with one or more users with an application. For example, a conversational artificial intelligence (AI) assistant may be rendered and displayed for visual communication in addition to audible communication with end-users. As such, the AI assistant may leverage the visual domain—in addition to the audible domain—to more clearly communicate with users, including interacting with a virtual environment in which the AI assistant is rendered. Similarly, the AI assistant may leverage audio, video, and/or text inputs from a user to determine a request, mood, gesture, and/or posture of a user for more accurately responding to and interacting with the user.
-
公开(公告)号:US20250045996A1
公开(公告)日:2025-02-06
申请号:US18921922
申请日:2024-10-21
Applicant: NVIDIA Corporation
Inventor: Rev Lebaredian , Simon Yuen , Santanu Dutta , Jonathan Michael Cohen , Ratin Kumar
Abstract: In various examples, a virtually animated and interactive agent may be rendered for visual and audible communication with one or more users with an application. For example, a conversational artificial intelligence (AI) assistant may be rendered and displayed for visual communication in addition to audible communication with end-users. As such, the AI assistant may leverage the visual domain—in addition to the audible domain—to more clearly communicate with users, including interacting with a virtual environment in which the AI assistant is rendered. Similarly, the AI assistant may leverage audio, video, and/or text inputs from a user to determine a request, mood, gesture, and/or posture of a user for more accurately responding to and interacting with the user.
-
公开(公告)号:US20240013462A1
公开(公告)日:2024-01-11
申请号:US17859615
申请日:2022-07-07
Applicant: Nvidia Corporation
Inventor: Yeongho Seol , Simon Yuen , Dmitry Aleksandrovich Korobchenko , Mingquan Zhou , Ronan Browne , Wonmin Byeon
CPC classification number: G06T13/205 , G06T13/40 , G06T17/20 , G10L25/63 , G10L15/16
Abstract: A deep neural network can be trained to output motion or deformation information for a character that is representative of the character uttering speech contained in audio input, which is accurate for an emotional state of the character. The character can have different facial components or regions (e.g., head, skin, eyes, tongue) modeled separately, such that the network can output motion or deformation information for each of these different facial components. During training, the network can be provided with emotion and/or style vectors that indicate information to be used in generating realistic animation for input speech, as may relate to one or more emotions to be exhibited by the character, a relative weighting of those emotions, and any style or adjustments to be made to how the character expresses that emotional state. The network output can be provided to a renderer to generate audio-driven facial animation that is emotion-accurate.
-
公开(公告)号:US20250061634A1
公开(公告)日:2025-02-20
申请号:US18457251
申请日:2023-08-28
Applicant: Nvidia Corporation
Inventor: Zhengyu Huang , Rui Zhang , Tao Li , Yingying Zhong , Weihua Zhang , Junjie Lai , Yeongho Seol , Dmitry Korobchenko , Simon Yuen
Abstract: Systems and methods of the present disclosure include animating virtual avatars or agents according to input audio and one or more selected or determined emotions and/or styles. For example, a deep neural network can be trained to output motion or deformation information for a character that is representative of the character uttering speech contained in audio input. The character can have different facial components or regions (e.g., head, skin, eyes, tongue) modeled separately, such that the network can output motion or deformation information for each of these different facial components. During training, the network can use a transformer-based audio encoder with locked parameters to train an associated decoder using a weighted feature vector. The network output can be provided to a renderer to generate audio-driven facial animation that is emotion-accurate.
-
公开(公告)号:US12205210B2
公开(公告)日:2025-01-21
申请号:US17318871
申请日:2021-05-12
Applicant: NVIDIA Corporation
Inventor: Rev Lebaredian , Simon Yuen , Santanu Dutta , Jonathan Michael Cohen , Ratin Kumar
Abstract: In various examples, a virtually animated and interactive agent may be rendered for visual and audible communication with one or more users with an application. For example, a conversational artificial intelligence (AI) assistant may be rendered and displayed for visual communication in addition to audible communication with end-users. As such, the AI assistant may leverage the visual domain—in addition to the audible domain—to more clearly communicate with users, including interacting with a virtual environment in which the AI assistant is rendered. Similarly, the AI assistant may leverage audio, video, and/or text inputs from a user to determine a request, mood, gesture, and/or posture of a user for more accurately responding to and interacting with the user.
-
公开(公告)号:US20240233229A1
公开(公告)日:2024-07-11
申请号:US18007867
申请日:2021-11-08
Applicant: NVIDIA Corporation
Inventor: Evgeny Aleksandrovich Tumanov , Dmitry Aleksandrovich Korobchenko , Simon Yuen , Kevin Margo
CPC classification number: G06T13/205 , G06T13/40
Abstract: In various examples, animations may be generated using audio-driven body animation synthesized with voice tempo. For example, full body animation may be driven from an audio input representative of recorded speech, where voice tempo (e.g., a number of phonemes per unit time) may be used to generate a 1D audio signal for comparing to datasets including data samples that each include an animation and a corresponding 1D audio signal. One or more loss functions may be used to compare the 1D audio signal from the input audio to the audio signals of the datasets, as well as to compare joint information of joints of an actor between animations of two or more data samples, in order to identify optimal transition points between the animations. The animations may then be stitched together—e.g., using interpolation and/or a neural network trained to seamlessly stitch sequences together—using the transition points.
-
公开(公告)号:US20230144458A1
公开(公告)日:2023-05-11
申请号:US18051209
申请日:2022-10-31
Applicant: NVIDIA Corporation
Inventor: Alexander Malafeev , Shalini De Mello , Jaewoo Seo , Umar Iqbal , Koki Nagano , Jan Kautz , Simon Yuen
CPC classification number: G06V40/174 , G06V40/171 , G06V40/165 , G06V10/82 , G06T13/40
Abstract: In examples, locations of facial landmarks may be applied to one or more machine learning models (MLMs) to generate output data indicating profiles corresponding to facial expressions, such as facial action coding system (FACS) values. The output data may be used to determine geometry of a model. For example, video frames depicting one or more faces may be analyzed to determine the locations. The facial landmarks may be normalized, then be applied to the MLM(s) to infer the profile(s), which may then be used to animate the mode for expression retargeting from the video. The MLM(s) may include sub-networks that each analyze a set of input data corresponding to a region of the face to determine profiles that correspond to the region. The profiles from the sub-networks, along global locations of facial landmarks may be used by a subsequent network to infer the profiles for the overall face.
-
-
-
-
-
-
-