-
公开(公告)号:US12056610B2
公开(公告)日:2024-08-06
申请号:US17005763
申请日:2020-08-28
Applicant: Salesforce, Inc.
Inventor: Junnan Li , Chu Hong Hoi
IPC: G06N3/08 , G06F18/21 , G06F18/214 , G06F18/2431
CPC classification number: G06N3/08 , G06F18/2148 , G06F18/217 , G06F18/2431
Abstract: A learning mechanism with partially-labeled web images is provided while correcting the noise labels during the learning. Specifically, the mechanism employs a momentum prototype that represents common characteristics of a specific class. One training objective is to minimize the difference between the normalized embedding of a training image sample and the momentum prototype of the corresponding class. Meanwhile, during the training process, the momentum prototype is used to generate a pseudo label for the training image sample, which can then be used to identify and remove out of distribution (OOD) samples to correct the noisy labels from the original partially-labeled training images. The momentum prototype for each class is in turn constantly updated based on the embeddings of new training samples and their pseudo labels.
-
公开(公告)号:US11989941B2
公开(公告)日:2024-05-21
申请号:US17566173
申请日:2021-12-30
Applicant: Salesforce, Inc.
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
IPC: G06V10/00 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/74 , G06V10/774 , G06V10/776 , G06V10/80 , G06V20/40
CPC classification number: G06V20/41 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/761 , G06V10/774 , G06V10/776 , G06V10/806 , G06V20/46 , G06V20/47
Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
-
公开(公告)号:US20240161520A1
公开(公告)日:2024-05-16
申请号:US18160664
申请日:2023-01-27
Applicant: Salesforce, Inc.
Inventor: Junnan Li , Chu Hong Hoi
IPC: G06V20/70 , G06F40/10 , G06V10/74 , G06V10/764 , G06V10/774
CPC classification number: G06V20/70 , G06F40/10 , G06V10/74 , G06V10/764 , G06V10/774
Abstract: Embodiments described herein provide a multimodal vision-language model. The multimodal vision-language model contains a Generalist Multimodal Transformer capable of complete multiple tasks using the same set of parameters learning from pre-training. The Generalist Multimodal Transformer allows alignment between frozen, unimodal encoders, such as image encoders and large language models. The Generalist Multimodal Transformer eliminates the need for fine-tuning the image encoders and large language models.
-
公开(公告)号:US20240160858A1
公开(公告)日:2024-05-16
申请号:US18505982
申请日:2023-11-09
Applicant: Salesforce, Inc.
Inventor: Wenliang Dai , Junnan Li , Chu Hong Hoi , Dongxu Li
IPC: G06F40/40 , G06V10/774 , G06V10/82 , G06V20/70
CPC classification number: G06F40/40 , G06V10/774 , G06V10/82 , G06V20/70
Abstract: Embodiments described herein provide a method of generating a vision-language task output to a text instruction relating to an input image, the method comprising receiving, via a data interface, the input image and the text instruction comprising an instruction relating to the image. The method further includes encoding, via an image encoder, the image into a first image representation. The method further includes generating, by a multimodal encoder, a second image representation based on cross-attending the first image representation to the text instruction. The method further includes generating, by a neural network based language model, a vision-language task output in response to the text instruction based on an input combining the second image representation and the text instruction.
-
公开(公告)号:US20240160853A1
公开(公告)日:2024-05-16
申请号:US18160722
申请日:2023-01-27
Applicant: Salesforce, Inc.
Inventor: Junnan Li , Chu Hong Hoi
IPC: G06F40/40 , G06F40/126 , G06F40/284 , G06F40/35 , G06N20/00 , G06T9/00
CPC classification number: G06F40/40 , G06F40/126 , G06F40/284 , G06F40/35 , G06N20/00 , G06T9/00
Abstract: Embodiments described herein provide a multimodal vision-language model. The multimodal vision-language model contains a Generalist Multimodal Transformer capable of complete multiple tasks using the same set of parameters learning from pre-training. The Generalist Multimodal Transformer allows alignment between frozen, unimodal encoders, such as image encoders and large language models. The Generalist Multimodal Transformer eliminates the need for fine-tuning the image encoders and large language models.
-
公开(公告)号:US20230237773A1
公开(公告)日:2023-07-27
申请号:US17745634
申请日:2022-05-16
Applicant: Salesforce, Inc.
Inventor: Junnan Li , Chu Hong Hoi
IPC: G06V10/774 , G06V10/764
CPC classification number: G06V10/774 , G06V10/764
Abstract: Embodiments described herein provide bootstrapping language-images pretraining for unified vision-language understanding and generation (BLIP), a unified VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP enables a wider range of downstream tasks, improving on both shortcomings of existing models.
-
-
-
-
-