-
公开(公告)号:US20240078423A1
公开(公告)日:2024-03-07
申请号:US17893026
申请日:2022-08-22
Applicant: NVIDIA Corporation
Inventor: Xiaojian Ma , Weili Nie , Zhiding Yu , Huaizu Jiang , Chaowei Xiao , Yuke Zhu , Anima Anandkumar
Abstract: A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts and associated features may be created and used to train the global and local tasks, which may then enable the ViT to perform visual relational reasoning faster, without supervision, and outside of a synthetic domain.
-
公开(公告)号:US10986325B2
公开(公告)日:2021-04-20
申请号:US16569104
申请日:2019-09-12
Applicant: NVIDIA Corporation
Inventor: Deqing Sun , Varun Jampani , Erik Gundersen Learned-Miller , Huaizu Jiang
IPC: H04N13/122 , H04N13/128 , G06N3/08 , H04N13/00
Abstract: Scene flow represents the three-dimensional (3D) structure and movement of objects in a video sequence in three dimensions from frame-to-frame and is used to track objects and estimate speeds for autonomous driving applications. Scene flow is recovered by a neural network system from a video sequence captured from at least two viewpoints (e.g., cameras), such as a left-eye and right-eye of a viewer. An encoder portion of the system extracts features from frames of the video sequence. The features are input to a first decoder to predict optical flow and a second decoder to predict disparity. The optical flow represents pixel movement in (x,y) and the disparity represents pixel movement in z (depth). When combined, the optical flow and disparity represent the scene flow.
-
公开(公告)号:US10776688B2
公开(公告)日:2020-09-15
申请号:US16169851
申请日:2018-10-24
Applicant: NVIDIA Corporation
Inventor: Huaizu Jiang , Deqing Sun , Varun Jampani
Abstract: Video interpolation is used to predict one or more intermediate frames at timesteps defined between two consecutive frames. A first neural network model approximates optical flow data defining motion between the two consecutive frames. A second neural network model refines the optical flow data and predicts visibility maps for each timestep. The two consecutive frames are warped according to the refined optical flow data for each timestep to produce pairs of warped frames for each timestep. The second neural network model then fuses the pair of warped frames based on the visibility maps to produce the intermediate frame for each timestep. Artifacts caused by motion boundaries and occlusions are reduced in the predicted intermediate frames.
-
公开(公告)号:US20190138889A1
公开(公告)日:2019-05-09
申请号:US16169851
申请日:2018-10-24
Applicant: NVIDIA Corporation
Inventor: Huaizu Jiang , Deqing Sun , Varun Jampani
Abstract: Video interpolation is used to predict one or more intermediate frames at timesteps defined between two consecutive frames. A first neural network model approximates optical flow data defining motion between the two consecutive frames. A second neural network model refines the optical flow data and predicts visibility maps for each timestep. The two consecutive frames are warped according to the refined optical flow data for each timestep to produce pairs of warped frames for each timestep. The second neural network model then fuses the pair of warped frames based on the visibility maps to produce the intermediate frame for each timestep. Artifacts caused by motion boundaries and occlusions are reduced in the predicted intermediate frames.
-
公开(公告)号:US20240062534A1
公开(公告)日:2024-02-22
申请号:US17893038
申请日:2022-08-22
Applicant: NVIDIA Corporation
Inventor: Xiaojian Ma , Weili Nie , Zhiding Yu , Huaizu Jiang , Chaowei Xiao , Yuke Zhu , Anima Anandkumar
CPC classification number: G06V10/82 , G06V10/255 , G06V10/94
Abstract: A vision transformer (ViT) is a deep learning model that performs one or more vision processing tasks. ViTs may be modified to include a global task that clusters images with the same concept together to produce semantically consistent relational representations, as well as a local task that guides the ViT to discover object-centric semantic correspondence across images. A database of concepts and associated features may be created and used to train the global and local tasks, which may then enable the ViT to perform visual relational reasoning faster, without supervision, and outside of a synthetic domain.
-
公开(公告)号:US20200084427A1
公开(公告)日:2020-03-12
申请号:US16569104
申请日:2019-09-12
Applicant: NVIDIA Corporation
Inventor: Deqing Sun , Varun Jampani , Erik Gundersen Learned-Miller , Huaizu Jiang
IPC: H04N13/122 , H04N13/128 , G06N3/08
Abstract: Scene flow represents the three-dimensional (3D) structure and movement of objects in a video sequence in three dimensions from frame-to-frame and is used to track objects and estimate speeds for autonomous driving applications. Scene flow is recovered by a neural network system from a video sequence captured from at least two viewpoints (e.g., cameras), such as a left-eye and right-eye of a viewer. An encoder portion of the system extracts features from frames of the video sequence. The features are input to a first decoder to predict optical flow and a second decoder to predict disparity. The optical flow represents pixel movement in (x,y) and the disparity represents pixel movement in z (depth). When combined, the optical flow and disparity represent the scene flow.
-
-
-
-
-