Patent search ap:("Salesforce Page Inc.") AND inv:"Junnan Li"

1.

发明公开
SYSTEMS AND METHODS FOR AN ENCODER-DECODER BASED FRAMEWORK FOR CODE GENERATION AND UNDERSTANDING 审中-公开

公开(公告)号：US20240289606A1

公开(公告)日：2024-08-29

申请号：US18174547

申请日：2023-02-24

Applicant: Salesforce, Inc.

Inventor： Yue Wang , Hung Le , Akhilesh Deepak Gotmare , Junnan Li , Chu Hong Hoi

IPC: G06N3/08

CPC classification number: G06N3/08

Abstract: Embodiments described herein provide a mixture of encoder-decoder Transformer framework for multi-task pretraining and flexible finetuning for both code understanding and generation tasks. Specifically, the framework is built on multimodal encoder and decoder modules. During pre-training, the encoder-decoder framework is trained with multiple learning objectives, including a diverse set of self-supervised tasks over two major stages of pretraining on unimodal and bimodal data.

2.

发明公开
SYSTEMS AND METHODS FOR SUBJECT-DRIVEN IMAGE GENERATION 审中-公开

公开(公告)号：US20240161369A1

公开(公告)日：2024-05-16

申请号：US18498768

申请日：2023-10-31

Applicant: Salesforce, Inc.

Inventor： Junnan Li , Chu Hong Hoi , Dongxu Li

IPC: G06T11/60 , G06T9/00 , G06V10/74 , G06V10/82

CPC classification number: G06T11/60 , G06T9/00 , G06V10/761 , G06V10/82

Abstract: Embodiments described herein provide systems and methods of subject-driven image generation. In at least one embodiment, a system receives, via a data interface, an image containing a subject, a text description of the subject in the image, and a text prompt relating to a different rendition of the subject. The system encodes, via an image encoder, the image into an image feature vector. The system encodes, via a text encoder, the text description int a text feature vector. The system generates, by a multimodal encoder, a vector representation of the subject based on the image feature vector and the text feature vector. The system generates, by a neural network based image generation model, an output image based on an input combining the text prompt and the vector representation.

3.

发明公开
SYSTEMS AND METHODS FOR A DISTRIBUTED TRAINING FRAMEWORK USING UNIFORM CLASS PROTOTYPES 审中-公开

公开(公告)号：US20240054350A1

公开(公告)日：2024-02-15

申请号：US18064122

申请日：2022-12-09

Applicant: Salesforce Inc.

Inventor： Yutong Dai , Zeyuan Chen , Junnan Li

IPC: G06N3/098

CPC classification number: G06N3/098

Abstract: Embodiments described herein provide systems and methods for federated learning. A central system may store a neural network model which has a body of a number of layers, and a classification layer comprising class prototypes which classifies the latent representations output by the body of the model. The central system may initialize the class prototypes so that they are uniformly distributed in the representation space. The model and class prototypes may be broadcast to a number of client systems, which update the body of the model locally while keeping the class prototypes fixed. The clients may return information to the central system including updated local model parameters, and a local representation of the classes based on the latent representation of items in the local training data. Based on the information from the clients, the neural network model may be updated. This process may be repeated iteratively.

4.

发明公开
SYSTEMS AND METHODS FOR VISUAL QUESTION ANSWERING 审中-公开

公开(公告)号：US20230419652A1

公开(公告)日：2023-12-28

申请号：US17934671

申请日：2022-09-23

Applicant: Salesforce, Inc.

Inventor： Anthony Meng Huat Tiong , Junnan Li , Chu Hong Hoi

IPC: G06V10/86 , G06N3/04 , G06V10/82 , G06V10/774 , G06V10/26

CPC classification number: G06V10/86 , G06N3/0454 , G06V10/82 , G06V10/774 , G06V10/26

Abstract: Embodiments described herein provide a zero-shot visual question answering (VQA) framework, which conjoins foundation network models with zero additional training. A first image and a question relating to the first image are received. The first image is divided into a plurality of image patches. A plurality of relevant image patches that are relevant to the question are determined, using a first neural network model, from the plurality of image patches. A plurality of image captions are generated, using a second neural network model, based on the plurality of relevant image patches. An answer to the question is generated based on the plurality of image captions.

5.

发明申请
SYSTEMS AND METHODS FOR MULTI-MODAL LANGUAGE MODELS 有权

公开(公告)号：US20240370718A1

公开(公告)日：2024-11-07

申请号：US18400477

申请日：2023-12-29

Applicant: Salesforce, Inc.

Inventor： Artemis Panagopoulou , Le Xue , Ning Yu , Junnan Li , Dongxu Li , Silvio Savarese , Shafiq Rayhan Joty , Ran Xu , Caiming Xiong , Juan Carlos Niebles Duque

IPC: G06N3/08 , G06N3/0455

Abstract: Embodiments described herein provide a method of generating a multi-modal task output to a text instruction relating to inputs of multiple different modalities (e.g., text, audio, video, 3D). The method comprises receiving, via a data interface, a first input of a first modality, a second input of a second modality and the text instruction relating to the first and the second inputs; encoding, by a first multimodal encoder adapted for the first modality, the first input of the first modality into a first encoded representation conditioned on the text instruction; encoding, by a second multimodal encoder adapted for the second modality, the second input of the second modality into a second encoded representation conditioned on the text instruction; and generating, by a neural network based language model, the multi-modal task output based on an input combining the first encoded representation, the second encoded representation, and the text instruction.

6.

发明公开
SYSTEMS AND METHODS FOR MULTIMODAL PRETRAINING FOR THREE-DIMENSIONAL UNDERSTANDING MODELS 审中-公开

公开(公告)号：US20240312128A1

公开(公告)日：2024-09-19

申请号：US18493035

申请日：2023-10-24

Applicant: Salesforce, Inc.

Inventor： Le Xue , Ning Yu , Shu Zhang , Junnan Li , Caiming Xiong , Silvio Savarese , Juan Carlos Niebles Duque , Ran Xu

IPC: G06T17/00 , G06F40/40

CPC classification number: G06T17/00 , G06F40/40

Abstract: A method of training a neural network based three-dimensional (3D) encoder is provided. A first plurality of samples of a training dataset are generated using a first 3D model. An image generator with multi-view rendering is used to generate a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model. A first language model is used to generate a plurality of texts corresponding to the plurality of 2D images respectively. A first text for a first image is generated by using one or more text descriptions generated by the first language model. A point cloud is generated by randomly sampling points in the 3D model. The first plurality of samples are generated using the plurality of 2D images, the corresponding plurality of texts, and the point cloud. The neural network based 3D encoder is trained using the training dataset including the first plurality of samples.

7.

发明公开
SYSTEMS AND METHODS FOR MASKED SELF-TRAINING OF UNSUPERVISED IMAGE CLASSIFICATION 审中-公开

公开(公告)号：US20230359900A1

公开(公告)日：2023-11-09

申请号：US17827339

申请日：2022-05-27

Applicant: Salesforce, Inc.

Inventor： Junnan Li , Chu Hong Hoi

IPC: G06N3/08 , G06V10/75 , G06V10/82 , G06V10/764

CPC classification number: G06N3/088 , G06V10/751 , G06V10/82 , G06V10/764

Abstract: Embodiments described herein provide a masked self-training (MaST) which is an unsupervised learning approach leveraging two complimentary sources of supervision: pseudo-labels and raw image pixels. Specifically, MaST jointly optimizes three objectives to finetune a pre-trained classification model on unlabeled images: (1) self-training objective to learn global task-specific class prediction; (2) masked image modeling objective to learn local pixel-level information; (3) global-local feature alignment objective to bridge the knowledge learned from the two sources of supervision.

8.

发明公开
SYSTEMS AND METHODS FOR UNIFIED VISION-LANGUAGE UNDERSTANDING AND GENERATION 审中-公开

公开(公告)号：US20230237772A1

公开(公告)日：2023-07-27

申请号：US17745540

申请日：2022-05-16

Applicant: Salesforce, Inc.

Inventor： Junnan Li , Chu Hong Hoi

IPC: G06V10/774 , G06F40/284 , G06F40/126 , G06T9/00 , G06V10/80

CPC classification number: G06V10/774 , G06F40/284 , G06F40/126 , G06T9/00 , G06V10/803

Abstract: Embodiments described herein provide bootstrapping language-images pre-training for unified vision-language understanding and generation (BLIP), a unified VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP enables a wider range of downstream tasks, improving on both shortcomings of existing models.

9.

发明授权
Systems and methods for video and language pre-training 有权

公开(公告)号：US12198432B2

公开(公告)日：2025-01-14

申请号：US17566061

申请日：2021-12-30

Applicant: Salesforce, Inc.

Inventor： Dongxu Li , Junnan Li , Chu Hong Hoi

IPC: G06V20/40 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/74 , G06V10/774 , G06V10/776 , G06V10/80

Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

10.

发明授权
Systems and methods for vision-language distribution alignment 有权

公开(公告)号：US12112523B2

公开(公告)日：2024-10-08

申请号：US17589725

申请日：2022-01-31

Applicant: Salesforce, Inc.

Inventor： Shu Zhang , Junnan Li , Ran Xu , Caiming Xiong , Chetan Ramaiah

IPC: G06V10/776 , G06F16/56 , G06F16/583 , G06F40/126 , G06F40/166 , G06F40/284 , G06V10/74 , G06V10/80

CPC classification number: G06V10/776 , G06F16/56 , G06F16/5846 , G06F40/126 , G06F40/166 , G06F40/284 , G06V10/761 , G06V10/806

Abstract: Embodiments described herein a CROss-Modal Distribution Alignment (CROMDA) model for vision-language pretraining, which can be used for retrieval downstream tasks. In the CROMDA mode, global cross-modal representations are aligned on each unimodality. Specifically, a uni-modal global similarity between an image/text and the image/text feature queue are computed. A softmax-normalized distribution is then generated based on the computed similarity. The distribution thus takes advantage of property of the global structure of the queue. CROMDA then aligns the two distributions and learns a modal invariant global representation. In this way, CROMDA is able to obtain invariant property in each modality, where images with similar text representations should be similar and vice versa.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification