VIDEO CAPTIONING GENERATION SYSTEM AND METHOD

    公开(公告)号:US20240380949A1

    公开(公告)日:2024-11-14

    申请号:US18314019

    申请日:2023-05-08

    Applicant: Lemon Inc.

    Abstract: A system and a method are provided that include a processor executing a caption generation program to receive an input video, sample video frames from the input video, extract video frames from the input video, extract video embeddings and audio embeddings from the video frames, including local video tokens and local audio tokens, respectively, input the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings, and generate video captions based on the multi-modal embeddings using a caption decoder.

    VIDEO PROCESSING METHOD AND DEVICE

    公开(公告)号:US20250104423A1

    公开(公告)日:2025-03-27

    申请号:US18725683

    申请日:2022-12-27

    Applicant: Lemon Inc.

    Abstract: Provided in the embodiments of the present disclosure are a video processing method and device. The video processing method includes: determining a target image to be processed in a video; performing semantic segmentation on the target image through a convolutional neural network to obtain a first feature map, wherein the first feature map comprises a feature map corresponding to at least one semantic class; determining a target image region corresponding to the at least one semantic class in the target image according to the first feature map; wherein the at least one semantic class comprises an object-in-hand, and a training image adopted by the convolutional neural network in a training process is marked with an image region corresponding to the at least one semantic class.

Patent Agency Ranking