TEXT-CONDITIONED VISUAL ATTENTION FOR MULTIMODAL MACHINE LEARNING MODELS

    公开(公告)号:US20250022263A1

    公开(公告)日:2025-01-16

    申请号:US18351211

    申请日:2023-07-12

    Applicant: Adobe Inc.

    Abstract: The present disclosure relates to systems, non-transitory computer-readable media, and methods for conditioning images on modification texts to generate multi-modal gradient attention maps. In particular, in some embodiments, the disclosed systems generate, utilizing a vision-language neural network of an image-text comparison machine learning model, a reference text-image feature vector based on a reference image and a modification text. Additionally, in some embodiments, the disclosed systems generate, utilizing the vision-language neural network of the image-text comparison machine learning model, a target text-image feature vector based on a target image and the modification text. Moreover, in some implementations, the disclosed systems generate, from the reference text-image feature vector and the target text-image feature vector, a multi-modal gradient attention map reflecting a visual grounding of the image-text comparison machine learning model relative to the modification text.

Patent Agency Ranking