-
公开(公告)号:US20240428468A1
公开(公告)日:2024-12-26
申请号:US18337634
申请日:2023-06-20
Applicant: Adobe Inc.
Inventor: Aishwarya Agarwal , Srikrishna Karanam , Joseph Koonthanam Jose , Apoorv Umang Saxena , Koustava Goswami , Balaji Vasan Srinivasan
IPC: G06T11/00 , G06N3/0455
Abstract: The present disclosure relates to systems, methods, and non-transitory computer-readable media that utilizes attention segregation loss and/or attention retention loss at inference time of a diffusion neural network to generate a text-conditioned image. In particular, in some embodiments, the disclosed systems utilize the attention segregation loss to reduce overlap between concepts by comparing attention maps for multiple concepts of a text query corresponding to a denoising step. Further, in some embodiments, the disclosed systems utilize the attention retention loss to improve information retention for concepts across denoising steps by comparing attention maps between different denoising steps. Accordingly, in some embodiments, by utilizing the attention segregation loss and the attention retention loss, the disclosed systems accurately maintain multiple concepts from a text query when generating a text-conditioned image.
-
公开(公告)号:US20250005296A1
公开(公告)日:2025-01-02
申请号:US18342954
申请日:2023-06-28
Applicant: Adobe Inc.
Inventor: Koustava Goswami , Srikrishna Karanam , Joseph Koonthanam Jose , Prateksha Udhayanan , Balaji Vasan Srinivasan
Abstract: The present disclosure relates to systems, methods, and non-transitory computer-readable media that implements a vision language machine learning model to generate text representations of an input digital image from localized context tokens. In particular, in some embodiments, the disclosed systems generate image patch feature representations that represent patches from an input image. Further, in some embodiments, the disclosed systems generate localized context tokens from the image patch feature representations and prompt context tokens. Moreover, in some embodiments, by utilizing the localized context tokens, the disclosed systems generate a text representation by utilizing a text encoder of the vision language machine learning model.
-