Dense Video Object Captioning from Disjoint Vision

    公开(公告)号:US20250053753A1

    公开(公告)日:2025-02-13

    申请号:US18448508

    申请日:2023-08-11

    Applicant: Google LLC

    Abstract: Provided are a new task and model for dense video object captioning—detecting, tracking, and captioning trajectories of all objects in a video. This task unifies spatial and temporal understanding of the video, and requires fine-grained language description. Example implementations of the proposed model for dense video object captioning can be trained end-to-end and can include different models for spatial localization, tracking, and captioning. As such, some example implementations of the present disclosure can train the proposed model with a mixture of disjoint tasks, and leverage diverse, large-scale datasets which supervise different parts of an example proposed model. This results in noteworthy zero-shot performance.

Patent Agency Ranking