Invention Grant
- Patent Title: Systems and methods for video and language pre-training
-
Application No.: US17566061Application Date: 2021-12-30
-
Publication No.: US12198432B2Publication Date: 2025-01-14
- Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
- Applicant: Salesforce, Inc.
- Applicant Address: US CA San Francisco
- Assignee: Salesforce, Inc.
- Current Assignee: Salesforce, Inc.
- Current Assignee Address: US CA San Francisco
- Agency: Haynes and Boone, LLP
- Main IPC: G06V20/40
- IPC: G06V20/40 ; G06F40/279 ; G06F40/284 ; G06V10/26 ; G06V10/74 ; G06V10/774 ; G06V10/776 ; G06V10/80

Abstract:
Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.
Public/Granted literature
- US20230154146A1 SYSTEMS AND METHODS FOR VIDEO AND LANGUAGE PRE-TRAINING Public/Granted day:2023-05-18
Information query