-
公开(公告)号:US20220398450A1
公开(公告)日:2022-12-15
申请号:US17348246
申请日:2021-06-15
Applicant: Lemon Inc.
Inventor: Xiaojie JIN , Daquan Zhou , Xiaochen Lian , Linjie Yang , Jiashi Feng
Abstract: A super-network comprising a plurality of layers may be generated. Each layer may comprise cells with different structures. A predetermined number of cells from each layer may be selected. A plurality of cells may be generated based on selected cells using a local mutation model, wherein the local mutation model comprises a mutation window for removing redundant edges from each selected cell. Performance of the plurality of cells may be evaluated using a differentiable fitness scoring function. The operations of the generating a plurality of cells using the local mutation model, the evaluating performance of the plurality of cells using the differentiable fitness scoring function and the selecting the subset of cells based on the evaluation results may be iteratively performed until the super-network converges. A search space for each layer may be generated based on a predetermined top number of cells with largest fitness scores after the super-network converges.
-
公开(公告)号:US20240177486A1
公开(公告)日:2024-05-30
申请号:US18057691
申请日:2022-11-21
Applicant: Lemon Inc. , Beijing Zitiao Network Technology Co., Ltd.
Inventor: Xiaojie JIN , Sen PEI
IPC: G06V20/40 , G06V10/62 , G06V10/774 , G06V10/776 , G06V10/80 , G06V10/82
CPC classification number: G06V20/46 , G06V10/62 , G06V10/774 , G06V10/776 , G06V10/806 , G06V10/82 , G06V20/41
Abstract: Systems and methods are provided that include a processor executing a video classifying program to receive an input video, sample video frames from the input video, extract frame-wise spatial features from the video frames using a convolutional neural network, extract a frame-wise temporal feature for each video frame, aggregate the frame-wise spatial features and the frame-wise temporal feature for each video frame to provide a temporal context to the frame-wise spatial features, input the aggregated frame-wise spatial features and the frame-wise temporal feature for each frame into a transformer encoder to obtain temporal-aware feature representations of the video frames, input the feature representations into a feedforward network model to obtain feedforward-transformed features, obtain a parameter by inputting each feedforward-transformed feature and a set of highlight prototypes into a function comparing the feedforward-transformed features to the set of highlight prototypes, classify the video frames as highlights based on the calculated parameter.
-