PERFORMING SEMANTIC SEGMENTATION TRAINING WITH IMAGE/TEXT PAIRS

    公开(公告)号:US20230177810A1

    公开(公告)日:2023-06-08

    申请号:US17853631

    申请日:2022-06-29

    CPC classification number: G06V10/774 G06V10/26

    Abstract: Semantic segmentation includes the task of providing pixel-wise annotations for a provided image. To train a machine learning environment to perform semantic segmentation, image/caption pairs are retrieved from one or more databases. These image/caption pairs each include an image and associated textual caption. The image portion of each image/caption pair is passed to an image encoder of the machine learning environment that outputs potential pixel groupings (e.g., potential segments of pixels) within each image, while nouns are extracted from the caption portion and are converted to text prompts which are then passed to a text encoder that outputs a corresponding text representation. Contrastive loss operations are then performed on features extracted from these pixel groupings and text representations to determine an extracted feature for each noun of each caption that most closely matches the extracted features for the associated image.

    FUTURE OBJECT TRAJECTORY PREDICTIONS FOR AUTONOMOUS MACHINE APPLICATIONS

    公开(公告)号:US20230088912A1

    公开(公告)日:2023-03-23

    申请号:US17952866

    申请日:2022-09-26

    Abstract: In various examples, historical trajectory information of objects in an environment may be tracked by an ego-vehicle and encoded into a state feature. The encoded state features for each of the objects observed by the ego-vehicle may be used—e.g., by a bi-directional long short-term memory (LSTM) network—to encode a spatial feature. The encoded spatial feature and the encoded state feature for an object may be used to predict lateral and/or longitudinal maneuvers for the object, and the combination of this information may be used to determine future locations of the object. The future locations may be used by the ego-vehicle to determine a path through the environment, or may be used by a simulation system to control virtual objects—according to trajectories determined from the future locations—through a simulation environment.

    PRUNING A VISION TRANSFORMER
    113.
    发明申请

    公开(公告)号:US20230080247A1

    公开(公告)日:2023-03-16

    申请号:US17551005

    申请日:2021-12-14

    Abstract: A vision transformer is a deep learning model used to perform vision processing tasks such as image recognition. Vision transformers are currently designed with a plurality of same-size blocks that perform the vision processing tasks. However, some portions of these blocks are unnecessary and not only slow down the vision transformer but use more memory than required. In response, parameters of these blocks are analyzed to determine a score for each parameter, and if the score falls below a threshold, the parameter is removed from the associated block. This reduces a size of the resulting vision transformer, which reduces unnecessary memory usage and increases performance.

    LEARNING CONTRASTIVE REPRESENTATION FOR SEMANTIC CORRESPONDENCE

    公开(公告)号:US20230074706A1

    公开(公告)日:2023-03-09

    申请号:US17412091

    申请日:2021-08-25

    Abstract: A multi-level contrastive training strategy for training a neural network relies on image pairs (no other labels) to learn semantic correspondences at the image level and region or pixel level. The neural network is trained using contrasting image pairs including different objects and corresponding image pairs including different views of the same object. Conceptually, contrastive training pulls corresponding image pairs closer and pushes contrasting image pairs apart. An image-level contrastive loss is computed from the outputs (predictions) of the neural network and used to update parameters (weights) of the neural network via backpropagation. The neural network is also trained via pixel-level contrastive learning using only image pairs. Pixel-level contrastive learning receives an image pair, where each image includes an object in a particular category.

    View synthesis for dynamic scenes
    115.
    发明授权

    公开(公告)号:US11546568B1

    公开(公告)日:2023-01-03

    申请号:US16811356

    申请日:2020-03-06

    Abstract: Apparatuses, systems, and techniques are presented to perform monocular view synthesis of a dynamic scene. Single and multi-view depth information can be determined for a collection of images of a dynamic scene, and a blender network can be used to combine image features for foreground, background, and missing image regions using fused depth maps inferred form the single and multi-view depth information.

    Cross-domain image processing for object re-identification

    公开(公告)号:US11367268B2

    公开(公告)日:2022-06-21

    申请号:US16998890

    申请日:2020-08-20

    Abstract: Object re-identification refers to a process by which images that contain an object of interest are retrieved from a set of images captured using disparate cameras or in disparate environments. Object re-identification has many useful applications, particularly as it is applied to people (e.g. person tracking). Current re-identification processes rely on convolutional neural networks (CNNs) that learn re-identification for a particular object class from labeled training data specific to a certain domain (e.g. environment), but that do not apply well in other domains. The present disclosure provides cross-domain disentanglement of id-related and id-unrelated factors. In particular, the disentanglement is performed using a labeled image set and an unlabeled image set, respectively captured from different domains but for a same object class. The identification-related features may then be used to train a neural network to perform re-identification of objects in that object class from images captured from the second domain.

    Articulated body mesh estimation using three-dimensional (3D) body keypoints

    公开(公告)号:US11361507B1

    公开(公告)日:2022-06-14

    申请号:US17315060

    申请日:2021-05-07

    Abstract: Estimating a three-dimensional (3D) pose and shape of an articulated body mesh is useful for many different applications including health and fitness, entertainment, and computer graphics. A set of estimated 3D keypoint positions for a human body structure are processed to compute parameters defining the pose and shape of a parametric human body mesh using a set of geometric operations. During processing, 3D keypoints are extracted from the parametric human body mesh and a set of rotations are computed to align the extracted 3D keypoints with the estimated 3D keypoints. The set of rotations may correctly position a particular 3D keypoint location at a “joint”, but an arbitrary number of rotations of the “joint” keypoint may produce a twist in a connection to a child keypoint. Rules are applied to the set of rotations to resolve ambiguous twists and articulate the parametric human body mesh according to the computed parameters.

    Three-dimensional object reconstruction from a video

    公开(公告)号:US11354847B2

    公开(公告)日:2022-06-07

    申请号:US16945455

    申请日:2020-07-31

    Abstract: A three-dimensional (3D) object reconstruction neural network system learns to predict a 3D shape representation of an object from a video that includes the object. The 3D reconstruction technique may be used for content creation, such as generation of 3D characters for games, movies, and 3D printing. When 3D characters are generated from video, the content may also include motion of the character, as predicted based on the video. The 3D object construction technique exploits temporal consistency to reconstruct a dynamic 3D representation of the object from an unlabeled video. Specifically, an object in a video has a consistent shape and consistent texture across multiple frames. Texture, base shape, and part correspondence invariance constraints may be applied to fine-tune the neural network system. The reconstruction technique generalizes well—particularly for non-rigid objects.

Patent Agency Ranking