Systems and methods for video and language pre-training

Invention Grant

US12198432B2 Systems and methods for video and language pre-training 有权

Please log in to see more content

Patent Title: Systems and methods for video and language pre-training
Application No.: US17566061

Application Date: 2021-12-30
Publication No.: US12198432B2

Publication Date: 2025-01-14
Inventor: Dongxu Li , Junnan Li , Chu Hong Hoi
Applicant: Salesforce, Inc.
Applicant Address: US CA San Francisco
Assignee: Salesforce, Inc.
Current Assignee: Salesforce, Inc.
Current Assignee Address: US CA San Francisco
Agency: Haynes and Boone, LLP
Main IPC: G06V20/40
IPC: G06V20/40 ; G06F40/279 ; G06F40/284 ; G06V10/26 ; G06V10/74 ; G06V10/774 ; G06V10/776 ; G06V10/80

Systems and methods for video and language pre-training

Abstract:

Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

Public/Granted literature

US20230154146A1 SYSTEMS AND METHODS FOR VIDEO AND LANGUAGE PRE-TRAINING Public/Granted day:2023-05-18

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06V	图像或视频识别或理解
G06V20/00	场景；特定场景元素（控制数码相机 H04N5/232）
G06V20/40	.在视频内容中（提取叠加文本 G06V20/62）（视频检索 G06F16/70）（在视频服务器中处理视频基本流H04N21/234）（在视频客户端中处理视频基本流H04N21/44）