Patent search ap:("Google LLC") AND inv:"Shen Yan" Page 1

1.

发明申请
VIDEO-TEXT MODELING WITH ZERO-SHOT TRANSFER FROM CONTRASTIVE CAPTIONERS 有权

公开(公告)号：US20250124708A1

公开(公告)日：2025-04-17

申请号：US18694604

申请日：2023-12-08

Applicant: Google LLC

Inventor： Shen Yan , Tao Zhu , Zirui Wang , Yuan Cao , Jiahui Yu

IPC: G06V20/40 , G06F16/583

Abstract: Provided is an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. Some example implementations include a model which can be referred to as VideoCoCa. Example implementations reuse a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with little or minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, aspects of the present disclosure leverage findings that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to “flattened frame embeddings”, yielding a strong zero-shot transfer baseline for many video-text tasks.

2.

发明申请
VIDEO LOCALIZATION USING ARTIFICIAL INTELLIGENCE 有权

公开(公告)号：US20240371164A1

公开(公告)日：2024-11-07

申请号：US18652703

申请日：2024-05-01

Applicant: Google LLC

Inventor： Shen Yan , Xuehan Xiong , Arsha Nagrani , Anurag Arnab , David Alexander Ross , Cordelia Schmid

IPC: G06V20/40 , G06V10/774 , G06V10/80

Abstract: Methods and systems for video localization using artificial intelligence are provided herein. A set of video embeddings representing features of one or more video frames of a media it em and a set of textual embeddings corresponding to an event associated with the media item are obtained. Fused video-textual data is generated based on the set of video embeddings and the set of textual embeddings. The fused video-textual data indicates features of the video frames of the media item and textual data pertaining to the media item. The fused video-textual data is provided as an input to an artificial intelligence (AI) model trained to perform multiple video localization tasks with respect to media items of a platform. One or move outputs of the AI model are obtained. A segment of the media item that depicts the event is determined based on the one or move outputs of the AI model.

3.

发明申请
METHODS AND SYSTEMS FOR SHORT FORM PREVIEWS OF LONG FORM MEDIA ITEMS 有权

公开(公告)号：US20250054306A1

公开(公告)日：2025-02-13

申请号：US18797297

申请日：2024-08-07

Applicant: Google LLC

Inventor： Daniel S. Cohen , Christopher R. Conover , Emily Rose Smith , Anoop Menon , Benjamin Lehn , Sudheendra Vijayanarasimhan , Bo Hu , Shen Yan , Xuehan Xiong , David Alexander Ross

IPC: G06V20/40 , G06V10/70 , H04N21/8549

Abstract: Aspects of the disclosure are directed to methods and systems for short form previews of long form media items. A server can provide, to an artificial intelligence (AI) model, a long form media item to be shared with users. The server can receive, from the AI model, one or more frames that are predicted to contain content that is of interest to the users. The server can extract a segment of the long form media item that corresponds to the one or more frames, where the extracted segment corresponds to a short form media item preview. The short form media item preview can be provided for presentation to the users.

Patent Agency Ranking