-
公开(公告)号:US20240380949A1
公开(公告)日:2024-11-14
申请号:US18314019
申请日:2023-05-08
Applicant: Lemon Inc.
Inventor: Linjie YANG , Heng WANG , Yuhan SHEN , Longyin WEN , Haichao YU
IPC: H04N21/488 , H04N21/2389 , H04N21/84
Abstract: A system and a method are provided that include a processor executing a caption generation program to receive an input video, sample video frames from the input video, extract video frames from the input video, extract video embeddings and audio embeddings from the video frames, including local video tokens and local audio tokens, respectively, input the local video tokens and the local audio tokens into at least a transformer layer of a cross-modal encoder to generate multi-modal embeddings, and generate video captions based on the multi-modal embeddings using a caption decoder.
-
公开(公告)号:US20230237764A1
公开(公告)日:2023-07-27
申请号:US17581423
申请日:2022-01-21
Applicant: Lemon Inc.
Inventor: Linjie YANG , Yiming CUI , Ding LIU
CPC classification number: G06V10/513 , G06V10/764 , G06T7/70 , G06V10/94 , G06V10/87 , G06V10/82 , G06T2207/30242 , G06T2207/20081 , G06T2207/20084
Abstract: Described are examples for detecting objects in an image on a device including setting, based on a condition, a number of sparse proposals to use in performing object detection in the image, performing object detection in the image based on providing the sparse proposals as input to an object detection process to infer object location and classification of one or more objects in the image, and indicating, to an application and based on an output of the object detection process, the object location and classification of the one or more objects.
-
公开(公告)号:US20230044969A1
公开(公告)日:2023-02-09
申请号:US17396055
申请日:2021-08-06
Applicant: Lemon Inc.
Inventor: Linjie YANG , Peter LIN , Imran SALEEMI
Abstract: The present disclosure describes techniques of improving video matting. The techniques comprise extracting features from each frame of a video by an encoder of a model, wherein the video comprises a plurality of frames; incorporating, by a decoder of the model, into any particular frame temporal information extracted from one or more frames previous to the particular frame, wherein the particular frame and the one or more previous frames are among the plurality of frames of the video, and the decoder is a recurrent decoder; and generating a representation of a foreground object included in the particular frame by the model, wherein the model is trained using segmentation dataset and matting dataset.
-
-