-
1.
公开(公告)号:US20240289606A1
公开(公告)日:2024-08-29
申请号:US18174547
申请日:2023-02-24
Applicant: Salesforce, Inc.
Inventor: Yue Wang , Hung Le , Akhilesh Deepak Gotmare , Junnan Li , Chu Hong Hoi
IPC: G06N3/08
CPC classification number: G06N3/08
Abstract: Embodiments described herein provide a mixture of encoder-decoder Transformer framework for multi-task pretraining and flexible finetuning for both code understanding and generation tasks. Specifically, the framework is built on multimodal encoder and decoder modules. During pre-training, the encoder-decoder framework is trained with multiple learning objectives, including a diverse set of self-supervised tasks over two major stages of pretraining on unimodal and bimodal data.
-
公开(公告)号:US20240161369A1
公开(公告)日:2024-05-16
申请号:US18498768
申请日:2023-10-31
Applicant: Salesforce, Inc.
Inventor: Junnan Li , Chu Hong Hoi , Dongxu Li
CPC classification number: G06T11/60 , G06T9/00 , G06V10/761 , G06V10/82
Abstract: Embodiments described herein provide systems and methods of subject-driven image generation. In at least one embodiment, a system receives, via a data interface, an image containing a subject, a text description of the subject in the image, and a text prompt relating to a different rendition of the subject. The system encodes, via an image encoder, the image into an image feature vector. The system encodes, via a text encoder, the text description int a text feature vector. The system generates, by a multimodal encoder, a vector representation of the subject based on the image feature vector and the text feature vector. The system generates, by a neural network based image generation model, an output image based on an input combining the text prompt and the vector representation.
-
3.
公开(公告)号:US20240054350A1
公开(公告)日:2024-02-15
申请号:US18064122
申请日:2022-12-09
Applicant: Salesforce Inc.
Inventor: Yutong Dai , Zeyuan Chen , Junnan Li
IPC: G06N3/098
CPC classification number: G06N3/098
Abstract: Embodiments described herein provide systems and methods for federated learning. A central system may store a neural network model which has a body of a number of layers, and a classification layer comprising class prototypes which classifies the latent representations output by the body of the model. The central system may initialize the class prototypes so that they are uniformly distributed in the representation space. The model and class prototypes may be broadcast to a number of client systems, which update the body of the model locally while keeping the class prototypes fixed. The clients may return information to the central system including updated local model parameters, and a local representation of the classes based on the latent representation of items in the local training data. Based on the information from the clients, the neural network model may be updated. This process may be repeated iteratively.
-
公开(公告)号:US20230419652A1
公开(公告)日:2023-12-28
申请号:US17934671
申请日:2022-09-23
Applicant: Salesforce, Inc.
Inventor: Anthony Meng Huat Tiong , Junnan Li , Chu Hong Hoi
IPC: G06V10/86 , G06N3/04 , G06V10/82 , G06V10/774 , G06V10/26
CPC classification number: G06V10/86 , G06N3/0454 , G06V10/82 , G06V10/774 , G06V10/26
Abstract: Embodiments described herein provide a zero-shot visual question answering (VQA) framework, which conjoins foundation network models with zero additional training. A first image and a question relating to the first image are received. The first image is divided into a plurality of image patches. A plurality of relevant image patches that are relevant to the question are determined, using a first neural network model, from the plurality of image patches. A plurality of image captions are generated, using a second neural network model, based on the plurality of relevant image patches. An answer to the question is generated based on the plurality of image captions.
-
公开(公告)号:US20240370718A1
公开(公告)日:2024-11-07
申请号:US18400477
申请日:2023-12-29
Applicant: Salesforce, Inc.
Inventor: Artemis Panagopoulou , Le Xue , Ning Yu , Junnan Li , Dongxu Li , Silvio Savarese , Shafiq Rayhan Joty , Ran Xu , Caiming Xiong , Juan Carlos Niebles Duque
IPC: G06N3/08 , G06N3/0455
Abstract: Embodiments described herein provide a method of generating a multi-modal task output to a text instruction relating to inputs of multiple different modalities (e.g., text, audio, video, 3D). The method comprises receiving, via a data interface, a first input of a first modality, a second input of a second modality and the text instruction relating to the first and the second inputs; encoding, by a first multimodal encoder adapted for the first modality, the first input of the first modality into a first encoded representation conditioned on the text instruction; encoding, by a second multimodal encoder adapted for the second modality, the second input of the second modality into a second encoded representation conditioned on the text instruction; and generating, by a neural network based language model, the multi-modal task output based on an input combining the first encoded representation, the second encoded representation, and the text instruction.
-
6.
公开(公告)号:US20240312128A1
公开(公告)日:2024-09-19
申请号:US18493035
申请日:2023-10-24
Applicant: Salesforce, Inc.
Inventor: Le Xue , Ning Yu , Shu Zhang , Junnan Li , Caiming Xiong , Silvio Savarese , Juan Carlos Niebles Duque , Ran Xu
Abstract: A method of training a neural network based three-dimensional (3D) encoder is provided. A first plurality of samples of a training dataset are generated using a first 3D model. An image generator with multi-view rendering is used to generate a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model. A first language model is used to generate a plurality of texts corresponding to the plurality of 2D images respectively. A first text for a first image is generated by using one or more text descriptions generated by the first language model. A point cloud is generated by randomly sampling points in the 3D model. The first plurality of samples are generated using the plurality of 2D images, the corresponding plurality of texts, and the point cloud. The neural network based 3D encoder is trained using the training dataset including the first plurality of samples.
-
公开(公告)号:US20230359900A1
公开(公告)日:2023-11-09
申请号:US17827339
申请日:2022-05-27
Applicant: Salesforce, Inc.
Inventor: Junnan Li , Chu Hong Hoi
IPC: G06N3/08 , G06V10/75 , G06V10/82 , G06V10/764
CPC classification number: G06N3/088 , G06V10/751 , G06V10/82 , G06V10/764
Abstract: Embodiments described herein provide a masked self-training (MaST) which is an unsupervised learning approach leveraging two complimentary sources of supervision: pseudo-labels and raw image pixels. Specifically, MaST jointly optimizes three objectives to finetune a pre-trained classification model on unlabeled images: (1) self-training objective to learn global task-specific class prediction; (2) masked image modeling objective to learn local pixel-level information; (3) global-local feature alignment objective to bridge the knowledge learned from the two sources of supervision.
-
公开(公告)号:US20230237772A1
公开(公告)日:2023-07-27
申请号:US17745540
申请日:2022-05-16
Applicant: Salesforce, Inc.
Inventor: Junnan Li , Chu Hong Hoi
IPC: G06V10/774 , G06F40/284 , G06F40/126 , G06T9/00 , G06V10/80
CPC classification number: G06V10/774 , G06F40/284 , G06F40/126 , G06T9/00 , G06V10/803
Abstract: Embodiments described herein provide bootstrapping language-images pre-training for unified vision-language understanding and generation (BLIP), a unified VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP enables a wider range of downstream tasks, improving on both shortcomings of existing models.
-
公开(公告)号:US12198432B2
公开(公告)日:2025-01-14
申请号:US17566061
申请日:2021-12-30
Applicant: Salesforce, Inc.
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
IPC: G06V20/40 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/74 , G06V10/774 , G06V10/776 , G06V10/80
Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
-
公开(公告)号:US12112523B2
公开(公告)日:2024-10-08
申请号:US17589725
申请日:2022-01-31
Applicant: Salesforce, Inc.
Inventor: Shu Zhang , Junnan Li , Ran Xu , Caiming Xiong , Chetan Ramaiah
IPC: G06V10/776 , G06F16/56 , G06F16/583 , G06F40/126 , G06F40/166 , G06F40/284 , G06V10/74 , G06V10/80
CPC classification number: G06V10/776 , G06F16/56 , G06F16/5846 , G06F40/126 , G06F40/166 , G06F40/284 , G06V10/761 , G06V10/806
Abstract: Embodiments described herein a CROss-Modal Distribution Alignment (CROMDA) model for vision-language pretraining, which can be used for retrieval downstream tasks. In the CROMDA mode, global cross-modal representations are aligned on each unimodality. Specifically, a uni-modal global similarity between an image/text and the image/text feature queue are computed. A softmax-normalized distribution is then generated based on the computed similarity. The distribution thus takes advantage of property of the global structure of the queue. CROMDA then aligns the two distributions and learns a modal invariant global representation. In this way, CROMDA is able to obtain invariant property in each modality, where images with similar text representations should be similar and vice versa.
-
-
-
-
-
-
-
-
-