Patent search ap:("Salesforce Page Inc.") AND inv:"Dongxu Li"

1.

发明申请
SYSTEMS AND METHODS FOR MULTI-MODAL LANGUAGE MODELS 有权

公开(公告)号：US20240370718A1

公开(公告)日：2024-11-07

申请号：US18400477

申请日：2023-12-29

Applicant: Salesforce, Inc.

Inventor： Artemis Panagopoulou , Le Xue , Ning Yu , Junnan Li , Dongxu Li , Silvio Savarese , Shafiq Rayhan Joty , Ran Xu , Caiming Xiong , Juan Carlos Niebles Duque

IPC: G06N3/08 , G06N3/0455

Abstract: Embodiments described herein provide a method of generating a multi-modal task output to a text instruction relating to inputs of multiple different modalities (e.g., text, audio, video, 3D). The method comprises receiving, via a data interface, a first input of a first modality, a second input of a second modality and the text instruction relating to the first and the second inputs; encoding, by a first multimodal encoder adapted for the first modality, the first input of the first modality into a first encoded representation conditioned on the text instruction; encoding, by a second multimodal encoder adapted for the second modality, the second input of the second modality into a second encoded representation conditioned on the text instruction; and generating, by a neural network based language model, the multi-modal task output based on an input combining the first encoded representation, the second encoded representation, and the text instruction.

2.

发明公开
SYSTEMS AND METHODS FOR SUBJECT-DRIVEN IMAGE GENERATION 审中-公开

公开(公告)号：US20240161369A1

公开(公告)日：2024-05-16

申请号：US18498768

申请日：2023-10-31

Applicant: Salesforce, Inc.

Inventor： Junnan Li , Chu Hong Hoi , Dongxu Li

IPC: G06T11/60 , G06T9/00 , G06V10/74 , G06V10/82

CPC classification number: G06T11/60 , G06T9/00 , G06V10/761 , G06V10/82

Abstract: Embodiments described herein provide systems and methods of subject-driven image generation. In at least one embodiment, a system receives, via a data interface, an image containing a subject, a text description of the subject in the image, and a text prompt relating to a different rendition of the subject. The system encodes, via an image encoder, the image into an image feature vector. The system encodes, via a text encoder, the text description int a text feature vector. The system generates, by a multimodal encoder, a vector representation of the subject based on the image feature vector and the text feature vector. The system generates, by a neural network based image generation model, an output image based on an input combining the text prompt and the vector representation.

3.

发明授权
Systems and methods for video and language pre-training 有权

公开(公告)号：US11989941B2

公开(公告)日：2024-05-21

申请号：US17566173

申请日：2021-12-30

Applicant: Salesforce, Inc.

Inventor： Dongxu Li , Junnan Li , Chu Hong Hoi

IPC: G06V10/00 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/74 , G06V10/774 , G06V10/776 , G06V10/80 , G06V20/40

CPC classification number: G06V20/41 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/761 , G06V10/774 , G06V10/776 , G06V10/806 , G06V20/46 , G06V20/47

Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

4.

发明公开
SYSTEMS AND METHODS FOR VISION-LANGUAGE MODEL INSTRUCTION TUNING 审中-公开

公开(公告)号：US20240160858A1

公开(公告)日：2024-05-16

申请号：US18505982

申请日：2023-11-09

Applicant: Salesforce, Inc.

Inventor： Wenliang Dai , Junnan Li , Chu Hong Hoi , Dongxu Li

IPC: G06F40/40 , G06V10/774 , G06V10/82 , G06V20/70

CPC classification number: G06F40/40 , G06V10/774 , G06V10/82 , G06V20/70

Abstract: Embodiments described herein provide a method of generating a vision-language task output to a text instruction relating to an input image, the method comprising receiving, via a data interface, the input image and the text instruction comprising an instruction relating to the image. The method further includes encoding, via an image encoder, the image into a first image representation. The method further includes generating, by a multimodal encoder, a second image representation based on cross-attending the first image representation to the text instruction. The method further includes generating, by a neural network based language model, a vision-language task output in response to the text instruction based on an input combining the second image representation and the text instruction.

5.

发明授权
Systems and methods for video and language pre-training 有权

公开(公告)号：US12198432B2

公开(公告)日：2025-01-14

申请号：US17566061

申请日：2021-12-30

Applicant: Salesforce, Inc.

Inventor： Dongxu Li , Junnan Li , Chu Hong Hoi

IPC: G06V20/40 , G06F40/279 , G06F40/284 , G06V10/26 , G06V10/74 , G06V10/774 , G06V10/776 , G06V10/80

Abstract: Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

Patent Agency Ranking