METHOD OF PROCESSING MULTIMODAL RETRIEVAL TASKS, AND AN APPARATUS FOR THE SAME

    公开(公告)号:US20230237089A1

    公开(公告)日:2023-07-27

    申请号:US18099711

    申请日:2023-01-20

    CPC classification number: G06F16/538 G06F16/2455

    Abstract: A method for multimodal content retrieval, may include: receiving a search query corresponding to a request for content; aggregating word features extracted from the search query based on a first set of learned weights; aggregating region features extracted from each of a plurality of images, based on a second set of learned weights, independently of the word features; computing a similarity score between the aggregated words features and the aggregated region features for each of the plurality of images; selecting candidate images from the plurality of images based on the similarity scores between each of the plurality of images and the search query; and selecting at least one final image from the candidate images as a response to the search query, based on attended similarity scores of the candidate images with respect to the search query.

    METHOD OF PROCESSING MULTIMODAL TASKS, AND AN APPARATUS FOR THE SAME

    公开(公告)号:US20230259779A1

    公开(公告)日:2023-08-17

    申请号:US17981024

    申请日:2022-11-04

    Inventor: Ning YE Zhiming HU

    CPC classification number: G06N3/084 G06N3/0445 G06N3/063 G06F16/738

    Abstract: An electronic device may obtain a query from a user input; obtain a sequence of frames of one or more input videos; select frames from the sequence of frames of the one or more input videos, via a sampler neural network configured to extract features from the sequence of frames that are input to the sampler neural network, determine temporal dependencies between the extracted features, and determine an action of selecting or skipping for each of the sequence of frames; and identify a video that matches the query via a multimodal neural network configured to receive the selected frames and the query, and output the video that matches the query, among the one or more input videos, wherein the sampler neural network and the multimodal neural network are jointly trained based on an aggregated loss that combines an accuracy loss that represents an accuracy of determining the video that matches the query, and an efficiency loss that reflects a proportion of frames being passed to the multimodal neural network.

    IMAGE AND VIDEO CLASSIFICATION
    3.
    发明申请

    公开(公告)号:US20250086936A1

    公开(公告)日:2025-03-13

    申请号:US18670139

    申请日:2024-05-21

    Abstract: Provided are system, method, and device for determining a classification of an image. According to embodiments, the method may include: determining, by a patch sampler model, selection probabilities of a first plurality of patches included in an image; selecting, by the patch sampler model, a second plurality of patches from among the first plurality of patches of the image based on the selection probabilities; and determining a classification of the image by processing the second plurality of patches through an encoder; wherein the patch sampler model may be trained based on a sampling loss which indicates a difference between the selection probabilities and attention scores of the image obtained via the encoder.

Patent Agency Ranking