-
公开(公告)号:US20250124708A1
公开(公告)日:2025-04-17
申请号:US18694604
申请日:2023-12-08
Applicant: Google LLC
Inventor: Shen Yan , Tao Zhu , Zirui Wang , Yuan Cao , Jiahui Yu
IPC: G06V20/40 , G06F16/583
Abstract: Provided is an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. Some example implementations include a model which can be referred to as VideoCoCa. Example implementations reuse a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with little or minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, aspects of the present disclosure leverage findings that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to “flattened frame embeddings”, yielding a strong zero-shot transfer baseline for many video-text tasks.
-
公开(公告)号:US20250111671A1
公开(公告)日:2025-04-03
申请号:US18900457
申请日:2024-09-27
Applicant: Google LLC
Inventor: Tao Zhu , Jiahui Yu , Jingchen Feng , Kai Chen , Pooya Abolghasemi , Gagan Bansal , Jieren Xu , Hui Miao , Yaping Zhang , Shuchao Bi , Yonghui Wu , Claire Cui , Rohan Anil
IPC: G06V20/40 , G06F40/284 , G10L25/57
Abstract: Methods and systems for media item characterization based on multimodal embeddings are provided herein. A media item including a sequence of video frames is identified. A set of video embeddings representing visual features of the sequence of video frames is obtained. A set of audio embeddings representing audio features of the sequence of video frames is obtained. A set of audiovisual embeddings is generated based on the set of video embeddings and the set of audio embeddings. Each of the set of audiovisual embeddings represents a visual feature and an audio feature of a respective video frame of the sequence of video frames. One or more media characteristics associated with the media item are determined based on the set of audiovisual embeddings.
-
公开(公告)号:US20250118060A1
公开(公告)日:2025-04-10
申请号:US18900473
申请日:2024-09-27
Applicant: Google LLC
Inventor: Mingyan Gao , Tao Zhu , Hui Miao , Ye Jin , Bibang Liu , Qiao Zhang , Jeffrey Daniel Forrester
Abstract: Methods and systems for media trend identification of content sharing platforms are provided herein. A set of audiovisual embeddings that represent audiovisual features of a media item is obtained. A set of textual embeddings that represent textual features of the media item is obtained. The obtained set of audiovisual embeddings and the obtained set of textual embeddings are provided as an input to an artificial intelligence (AI) model trained to predict whether a respective media item is associated with one or more media trends of a platform based on given embeddings for the media item. One or more outputs of the AI model are obtained. A determination is made, based on the one or more outputs of the AI model, whether the media item is associated with the one or more media trends of the platform.
-
-