-
公开(公告)号:US20240370718A1
公开(公告)日:2024-11-07
申请号:US18400477
申请日:2023-12-29
Applicant: Salesforce, Inc.
Inventor: Artemis Panagopoulou , Le Xue , Ning Yu , Junnan Li , Dongxu Li , Silvio Savarese , Shafiq Rayhan Joty , Ran Xu , Caiming Xiong , Juan Carlos Niebles Duque
IPC: G06N3/08 , G06N3/0455
Abstract: Embodiments described herein provide a method of generating a multi-modal task output to a text instruction relating to inputs of multiple different modalities (e.g., text, audio, video, 3D). The method comprises receiving, via a data interface, a first input of a first modality, a second input of a second modality and the text instruction relating to the first and the second inputs; encoding, by a first multimodal encoder adapted for the first modality, the first input of the first modality into a first encoded representation conditioned on the text instruction; encoding, by a second multimodal encoder adapted for the second modality, the second input of the second modality into a second encoded representation conditioned on the text instruction; and generating, by a neural network based language model, the multi-modal task output based on an input combining the first encoded representation, the second encoded representation, and the text instruction.
-
公开(公告)号:US20240161369A1
公开(公告)日:2024-05-16
申请号:US18498768
申请日:2023-10-31
Applicant: Salesforce, Inc.
Inventor: Junnan Li , Chu Hong Hoi , Dongxu Li
CPC classification number: G06T11/60 , G06T9/00 , G06V10/761 , G06V10/82
Abstract: Embodiments described herein provide systems and methods of subject-driven image generation. In at least one embodiment, a system receives, via a data interface, an image containing a subject, a text description of the subject in the image, and a text prompt relating to a different rendition of the subject. The system encodes, via an image encoder, the image into an image feature vector. The system encodes, via a text encoder, the text description int a text feature vector. The system generates, by a multimodal encoder, a vector representation of the subject based on the image feature vector and the text feature vector. The system generates, by a neural network based image generation model, an output image based on an input combining the text prompt and the vector representation.
-
公开(公告)号:US11989941B2
公开(公告)日:2024-05-21
申请号:US17566173
申请日:2021-12-30
Applicant: Salesforce, Inc.
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
IPC: G06V10/00 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/74 , G06V10/774 , G06V10/776 , G06V10/80 , G06V20/40
CPC classification number: G06V20/41 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/761 , G06V10/774 , G06V10/776 , G06V10/806 , G06V20/46 , G06V20/47
Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
-
公开(公告)号:US20240160858A1
公开(公告)日:2024-05-16
申请号:US18505982
申请日:2023-11-09
Applicant: Salesforce, Inc.
Inventor: Wenliang Dai , Junnan Li , Chu Hong Hoi , Dongxu Li
IPC: G06F40/40 , G06V10/774 , G06V10/82 , G06V20/70
CPC classification number: G06F40/40 , G06V10/774 , G06V10/82 , G06V20/70
Abstract: Embodiments described herein provide a method of generating a vision-language task output to a text instruction relating to an input image, the method comprising receiving, via a data interface, the input image and the text instruction comprising an instruction relating to the image. The method further includes encoding, via an image encoder, the image into a first image representation. The method further includes generating, by a multimodal encoder, a second image representation based on cross-attending the first image representation to the text instruction. The method further includes generating, by a neural network based language model, a vision-language task output in response to the text instruction based on an input combining the second image representation and the text instruction.
-
公开(公告)号:US12198432B2
公开(公告)日:2025-01-14
申请号:US17566061
申请日:2021-12-30
Applicant: Salesforce, Inc.
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
IPC: G06V20/40 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/74 , G06V10/774 , G06V10/776 , G06V10/80
Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
-
-
-
-