-
公开(公告)号:US20250053753A1
公开(公告)日:2025-02-13
申请号:US18448508
申请日:2023-08-11
Applicant: Google LLC
Inventor: Xingyi Zhou , Anurag Arnab , Chen Sun , Cordelia Luise Schmid
IPC: G06F40/40 , G06T7/246 , G06V10/22 , G06V10/774 , G06V10/776 , G06V20/40
Abstract: Provided are a new task and model for dense video object captioning—detecting, tracking, and captioning trajectories of all objects in a video. This task unifies spatial and temporal understanding of the video, and requires fine-grained language description. Example implementations of the proposed model for dense video object captioning can be trained end-to-end and can include different models for spatial localization, tracking, and captioning. As such, some example implementations of the present disclosure can train the proposed model with a mixture of disjoint tasks, and leverage diverse, large-scale datasets which supervise different parts of an example proposed model. This results in noteworthy zero-shot performance.