-
公开(公告)号:US20230115551A1
公开(公告)日:2023-04-13
申请号:US17499193
申请日:2021-10-12
Applicant: ADOBE INC.
Inventor: Hailin Jin , Bryan Russell , Reuben Xin Hong Tan
Abstract: Methods, system, and computer storage media are provided for multi-modal localization. Input data comprising two modalities, such as image data and corresponding text or audio data, may be received. A phrase may be extracted from the text or audio data, and a neural network system may be utilized to spatially and temporally localize the phrase within the image data. The neural network system may include a plurality of cross-modal attention layers that each compare features across the first and second modalities without comparing features of the same modality. Using the cross-modal attention layers, a region or subset of pixels within one or more frames of the image data may be identified as corresponding to the phrase, and a localization indicator may be presented for display with the image data. Embodiments may also include unsupervised training of the neural network system.
-
公开(公告)号:US12118787B2
公开(公告)日:2024-10-15
申请号:US17499193
申请日:2021-10-12
Applicant: ADOBE INC.
Inventor: Hailin Jin , Bryan Russell , Reuben Xin Hong Tan
IPC: G06K9/00 , G06F18/214 , G06F18/22 , G06N3/04 , G06V20/40 , G10L15/02 , G10L15/16 , G10L15/19 , G10L15/26
CPC classification number: G06V20/41 , G06F18/214 , G06F18/22 , G06N3/04 , G06V20/46 , G10L15/02 , G10L15/16 , G10L15/19 , G10L15/26
Abstract: Methods, system, and computer storage media are provided for multi-modal localization. Input data comprising two modalities, such as image data and corresponding text or audio data, may be received. A phrase may be extracted from the text or audio data, and a neural network system may be utilized to spatially and temporally localize the phrase within the image data. The neural network system may include a plurality of cross-modal attention layers that each compare features across the first and second modalities without comparing features of the same modality. Using the cross-modal attention layers, a region or subset of pixels within one or more frames of the image data may be identified as corresponding to the phrase, and a localization indicator may be presented for display with the image data. Embodiments may also include unsupervised training of the neural network system.
-