DIFFUSION-BASED OPEN-VOCABULARY SEGMENTATION

    公开(公告)号:US20240153093A1

    公开(公告)日:2024-05-09

    申请号:US18310414

    申请日:2023-05-01

    CPC classification number: G06T7/10 G06V10/40 G06T2207/20081 G06T2207/20084

    Abstract: An open-vocabulary diffusion-based panoptic segmentation system is not limited to perform segmentation using only object categories seen during training, and instead can also successfully perform segmentation of object categories not seen during training and only seen during testing and inferencing. In contrast with conventional techniques, a text-conditioned diffusion (generative) model is used to perform the segmentation. The text-conditioned diffusion model is pre-trained to generate images from text captions, including computing internal representations that provide spatially well-differentiated object features. The internal representations computed within the diffusion model comprise object masks and a semantic visual representation of the object. The semantic visual representation may be extracted from the diffusion model and used in conjunction with a text representation of a category label to classify the object. Objects are classified by associating the text representations of category labels with the object masks and their semantic visual representations to produce panoptic segmentation data.

    PERFORMING SEMANTIC SEGMENTATION TRAINING WITH IMAGE/TEXT PAIRS

    公开(公告)号:US20230177810A1

    公开(公告)日:2023-06-08

    申请号:US17853631

    申请日:2022-06-29

    CPC classification number: G06V10/774 G06V10/26

    Abstract: Semantic segmentation includes the task of providing pixel-wise annotations for a provided image. To train a machine learning environment to perform semantic segmentation, image/caption pairs are retrieved from one or more databases. These image/caption pairs each include an image and associated textual caption. The image portion of each image/caption pair is passed to an image encoder of the machine learning environment that outputs potential pixel groupings (e.g., potential segments of pixels) within each image, while nouns are extracted from the caption portion and are converted to text prompts which are then passed to a text encoder that outputs a corresponding text representation. Contrastive loss operations are then performed on features extracted from these pixel groupings and text representations to determine an extracted feature for each noun of each caption that most closely matches the extracted features for the associated image.

Patent Agency Ranking