VIDEO CAPTIONING GENERATION SYSTEM AND METHOD

    公开(公告)号:US20240380949A1

    公开(公告)日:2024-11-14

    申请号:US18314019

    申请日:2023-05-08

    Applicant: Lemon Inc.

    Abstract: A system and a method are provided that include a processor executing a caption generation program to receive an input video, sample video frames from the input video, extract video frames from the input video, extract video embeddings and audio embeddings from the video frames, including local video tokens and local audio tokens, respectively, input the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings, and generate video captions based on the multi-modal embeddings using a caption decoder.

Patent Agency Ranking