-
公开(公告)号:US20230005284A1
公开(公告)日:2023-01-05
申请号:US17943458
申请日:2022-09-13
Inventor: Feng HE , Qi WANG , Hu YANG , Shuai CHEN , Zhifan FENG , Chunguang CHAI
IPC: G06V30/19 , G06F16/583
Abstract: A computer-implemented method is provided. The method includes: obtaining a sample text and a sample image corresponding to the sample text; labeling a true semantic tag for the sample text according to a first preset rule; obtaining a text feature representation of the sample text and a predicted semantic tag output by a text coding sub-model; obtaining an image feature representation of the sample image output by an image coding sub-model; calculating a first loss based on the true semantic tag and the predicted semantic tag; calculating a contrast loss based on the text feature representation of the sample text and the image feature representation of the sample image; adjusting parameters of the text coding sub-model based on the first loss and the contrast loss; and adjusting parameters of the image coding sub-model based on the contrast loss.
-
公开(公告)号:US20230010160A1
公开(公告)日:2023-01-12
申请号:US17945415
申请日:2022-09-15
Inventor: Shuai CHEN , Qi WANG , Hu YANG , Feng HE , Zhifan FENG , Chunguang CHAI , Yong ZHU
Abstract: Disclosed are a method for processing multimodal data using a neural network, a device, and a medium, and relates to the field of artificial intelligence and, in particular to multimodal data processing, video classification, and deep learning. The neural network includes: an input subnetwork configured to receive the multimodal data to output respective first features of a plurality of modalities; a plurality of cross-modal feature subnetworks, each of which is configured to receive respective first features of two corresponding modalities to output a cross-modal feature corresponding to the two modalities; a plurality of cross-modal fusion subnetworks, each of which is configured to receive at least one cross-modal feature corresponding to a corresponding target modality and other modalities to output a second feature of the target modality; and an output subnetwork configured to receive respective second features of the plurality of modalities to output a processing result of the multimodal data.
-
公开(公告)号:US20220284246A1
公开(公告)日:2022-09-08
申请号:US17502385
申请日:2021-10-15
Inventor: Feng HE , Qi WANG , Zhifan FENG , Hu YANG , Chunguang CHAI
IPC: G06K9/62
Abstract: The present disclosure discloses a method for training a cross-modal retrieval model, an electronic device and a storage medium, and relates to the field of computer technologies, and particularly to the field of artificial intelligence technologies, such as knowledge graph technologies, computer vision technologies, deep learning technologies, or the like. The method for training a cross-modal retrieval model includes: determining similarity of a cross-modal sample pair according to the cross-modal sample pair, the cross-modal sample pair including a sample of a first modal and a sample of a second modal, and the first modal being different from the second modal; determining a soft margin based on the similarity, and determining a soft margin loss function based on the soft margin; and determining a total loss function based on the soft margin loss function, and training a cross-modal retrieval model according to the total loss function.
-
公开(公告)号:US20230115737A1
公开(公告)日:2023-04-13
申请号:US18080432
申请日:2022-12-13
Inventor: Shuai CHEN , Qi WANG , Zhifan FENG , Chunguang CHAI , Yong ZHU
IPC: G06F16/483 , G06F16/43 , G06F18/25 , G06F18/22 , G06N5/02
Abstract: A method of processing multimedia data, a device, and a medium, which relates to a field of an artificial intelligence technology, in particular to fields of knowledge graph and deep learning. The method of processing the multimedia data includes: recognizing the multimedia data so as to obtain at least one key information of the multimedia data; querying a predetermined knowledge base according to the at least one key information, so as to determine a multimedia name associated with the at least one key information and an association degree between the multimedia name and the at least one key information; and determining, in the multimedia name, a name of the multimedia data based on a similarity between alternative multimedia data for the multimedia name and the multimedia data, in response to the association degree being less than a first threshold value.
-
公开(公告)号:US20220284218A1
公开(公告)日:2022-09-08
申请号:US17502173
申请日:2021-10-15
Inventor: Hu YANG , Feng HE , Qi WANG , Zhifan FENG , Chunguang CHAI , Yong ZHU
Abstract: The present disclosure discloses a video classification method, an electronic device and a storage medium, and relates to the field of computer technologies, and particularly to the field of artificial intelligence technologies, such as knowledge graph technologies, computer vision technologies, deep learning technologies, or the like. The video classification method includes: extracting a keyword in a video according to multi-modal information of the video; acquiring background knowledge corresponding to the keyword, and determining a text to be recognized according to the keyword and the background knowledge; and classifying the text to be recognized to obtain a class of the video.
-
公开(公告)号:US20220027634A1
公开(公告)日:2022-01-27
申请号:US17450158
申请日:2021-10-06
Inventor: Qi WANG , Zhifan FENG , Hu YANG , Chunguang CHAI
Abstract: A video processing method, an electronic device and a storage medium are provided, and relate to the field of artificial intelligence, and particularly relates to the fields of deep learning, model training, knowledge mapping, video processing and the like. The method includes: acquiring a plurality of first video frames, and performing fine-grained splitting on the plurality of first video frames to obtain a plurality of second video frames; performing feature encoding on the plurality of second video frames according to multi-mode information related to the plurality of second video frames, to obtain feature fusion information for characterizing fusion of the multi-mode information; and performing similarity matching on the plurality of second video frames according to the feature fusion information, and obtaining a target video according to a result of the similarity matching.
-
-
-
-
-