Abstract:
Systems and methods are disclosed for road scene understanding of vehicles in traffic by capturing images of traffic with a camera coupled to a vehicle; generating a continuous model of occlusions with a continuous occlusion mode for traffic participants to enhance point track association accuracy without distinguishing between moving and static objects; applying the continuous occlusion model to handle visibility constraints in object tracks; and combining point track association and soft object track modeling to improve 3D localization accuracy.
Abstract:
A computer-implemented method for training a convolutional neural network (CNN) is presented. The method includes extracting coordinates of corresponding points in the first and second locations, identifying positive points in the first and second locations, identifying negative points in the first and second locations, training features that correspond to positive points of the first and second locations to move closer to each other, and training features that correspond to negative points in the first and second locations to move away from each other.
Abstract:
A computer-implemented method for training a deep learning network is presented. The method includes receiving a first image and a second image, mining exemplar thin-plate spline (TPS) to determine transformations for generating point correspondences between the first and second images, using artificial point correspondences to train the deep neural network, learning and using the TPS transformation output through a spatial transformer, and applying heuristics for selecting an acceptable set of images to match for accurate reconstruction. The deep learning network learns to warp points in the first image to points in the second image.
Abstract:
A method for performing three-dimensional (3D) localization requiring only a single camera including capturing images from only one camera; generating a cue combination from sparse features, dense stereo and object bounding boxes; correcting for scale in monocular structure from motion (SFM) using the cue combination for estimating a ground plane; and performing localization by combining SFM, ground plane and object bounding boxes to produce a 3D object localization.
Abstract:
Systems and methods are disclosed for determining three dimensional (3D) shape by capturing with a camera a plurality of images of an object in differential motion; derive a general relation that relates spatial and temporal image derivatives to BRDF derivatives; exploiting rank deficiency to eliminate BRDF terms and recover depth or normal for directional lighting; and using depth-normal-BRDF relation to recover depth or normal for unknown arbitrary lightings.
Abstract:
Systems and methods include detecting one or more objects in an image and generating one or more captions for the image. One or more predicted categories of the one or more objects detected in the image and the one or more captions are matched. From the one or more predicted categories, a category that is not successfully predicted in the image is identified. Data is curated to improve the category that is not successfully predicted in the image. A perception model is finetuned using data curated.
Abstract:
A computer-implemented method for synthesizing an image includes capturing data from a scene and fusing grid-based representations of the scene from different encodings to inherit beneficial properties of the different encodings, The encodings include Lidar encoding and a high definition map encoding. Rays are rendered from fused grid-based representations. A density and color are determined for points in the rays. A volume rendering is employed for the rays with the density and color. An image is synthesized from the volume rendered rays with the density and the color.
Abstract:
Systems and methods for automatic multi-modality sensor calibration with near-infrared images (NIR). Image keypoints from collected images and NIR keypoints from NIR can be detected. A deep-learning-based neural network that learns relation graphs between the image keypoints and the NIR keypoints can match the image keypoints and the NIR keypoints. Three dimensional (3D) points from 3D point cloud data can be filtered based on corresponding 3D points from the NIR keypoints (NIR-to-3D points) to obtain filtered NIR-to-3D points. An extrinsic calibration can be optimized based on a reprojection error computed from the filtered NIR-to-3D points to obtain an optimized extrinsic calibration for an autonomous entity control system. An entity can be controlled by employing the optimized extrinsic calibration for the autonomous entity control system.
Abstract:
Systems and methods for generating adversarial driving scenarios for autonomous vehicles. An artificial intelligence model can compute an adversarial loss function by minimizing the distance between predicted adversarial perturbed trajectories and corresponding generated neighbor future trajectories from input data. A traffic violation loss function can be computed based on observed adversarial agents adhering to driving rules from the input data. A comfort loss function can be computed based on the predicted driving characteristics of adversarial vehicles relevant to comfort of hypothetical passengers from the input data. A planner module can be trained for autonomous vehicles based on a combined loss function of the adversarial loss function, the traffic violation loss function and the comfort loss function to generate adversarial driving scenarios. An autonomous vehicle can be controlled based on trajectories generated in the adversarial driving scenarios.
Abstract:
Methods and systems for detecting faults include capturing an image of a scene using a camera. The image is embedded using a segmentation model that includes an image branch having an image embedding layer that embeds images into a joint latent space and a text branch having a text embedding layer that embeds text into the joint latent space. Semantic information is generated for a region of the image corresponding to a predetermined static object using the embedded image. A fault of the camera is identified based on a discrepancy between the semantic information and semantic information of the predetermined static image. The fault of the camera is corrected.