-
公开(公告)号:US20240380949A1
公开(公告)日:2024-11-14
申请号:US18314019
申请日:2023-05-08
Applicant: Lemon Inc.
Inventor: Linjie YANG , Heng WANG , Yuhan SHEN , Longyin WEN , Haichao YU
IPC: H04N21/488 , H04N21/2389 , H04N21/84
Abstract: A system and a method are provided that include a processor executing a caption generation program to receive an input video, sample video frames from the input video, extract video frames from the input video, extract video embeddings and audio embeddings from the video frames, including local video tokens and local audio tokens, respectively, input the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings, and generate video captions based on the multi-modal embeddings using a caption decoder.
-
2.
公开(公告)号:US20240233350A1
公开(公告)日:2024-07-11
申请号:US18408967
申请日:2024-01-10
Applicant: Lemon Inc. , Beijing Zitiao Network Technology Co., Ltd.
Inventor: Xiaojie JIN , Fan MA , Jiashi FENG , Heng WANG , Jingjia HUANG
IPC: G06V10/80 , G06F40/284 , G06V10/774 , G06V20/40
CPC classification number: G06V10/806 , G06F40/284 , G06V10/774 , G06V20/46
Abstract: The embodiments of the disclosure provides a processing method, apparatus, electronic device and non-transitory computer-readable storage medium for multimodal data, wherein the method includes: obtaining data to be processed of an original modality; determining result data of a target modality corresponding to the data to be processed by processing the data to be processed with a target processing model; wherein the target processing model comprises a multimodal submodel, and the pre-training task of the multimodal submodel includes a task of locating local data that matches second modal data from first modal data; wherein when the first modal data belongs to the original modality, the second modal data belongs to the target modality; when the first modal data belongs to the target modality, the second modal data belongs to the original modality.
-