-
公开(公告)号:US20230409899A1
公开(公告)日:2023-12-21
申请号:US17845753
申请日:2022-06-21
申请人: Google LLC
发明人: Michael Sahngwon Ryoo , Anthony Jacob Piergiovanni , Anelia Angelova , Anurag Arnab , Mostafa Dehghani
IPC分类号: G06N3/08
CPC分类号: G06N3/08
摘要: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing a network input using a computer vision neural network with learned tokenization.
-
公开(公告)号:US20240346824A1
公开(公告)日:2024-10-17
申请号:US18634794
申请日:2024-04-12
申请人: Google LLC
发明人: Alexey Alexeevich Gritsenko , Xuehan Xiong , Josip Djolonga , Mostafa Dehghani , Chen Sun , Mario Lucic , Cordelia Luise Schmid , Anurag Arnab
IPC分类号: G06V20/40 , G06T7/73 , G06V10/62 , G06V10/764 , G06V10/77 , G06V10/774 , G06V10/776 , G06V10/82
CPC分类号: G06V20/46 , G06T7/73 , G06V10/62 , G06V10/764 , G06V10/7715 , G06V10/774 , G06V10/776 , G06V10/82 , G06T2207/10016 , G06T2207/20081 , G06T2207/20084
摘要: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing action localization on an input video. In particular, a system maintains a set of query vectors and uses the input video and the set of query vectors to generate an action localization output for the input video. The action localization output includes, for each of one or more agents depicted in the video, data specifying, for each of one or more video frames in the video, a respective bounding box in the video frame that depicts the agent and a respective action from a set of actions that is being performed by the agent in the video frame.
-
公开(公告)号:US12112538B2
公开(公告)日:2024-10-08
申请号:US17370522
申请日:2021-07-08
申请人: Google LLC
发明人: Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid
摘要: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of video tokens from the video data, the plurality of video tokens comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of video tokens as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.
-
公开(公告)号:US20230017072A1
公开(公告)日:2023-01-19
申请号:US17370522
申请日:2021-07-08
申请人: Google LLC
发明人: Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid
摘要: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of video tokens from the video data, the plurality of video tokens comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of video tokens as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.
-
公开(公告)号:US20240127794A1
公开(公告)日:2024-04-18
申请号:US17957291
申请日:2022-09-30
申请人: Google LLC
CPC分类号: G10L15/063 , G10L15/24 , G10L15/26
摘要: Systems and methods method for performing captioning for image or video data are described herein. The method can include receiving unlabeled multimedia data, and outputting, from a machine learning model, one or more captions for the multimedia data. Training the machine learning model to create these outputs can include inputting a subset of video frames and a first utterance into the machine learning model, using the machine learning model to predict a predicted utterance based on the subset of video frames and the first utterance, and updating one or more parameters of the machine learning model based on a loss function that compares the predicted utterance with the second utterance.
-
公开(公告)号:US20230177384A1
公开(公告)日:2023-06-08
申请号:US17545526
申请日:2021-12-08
申请人: Google LLC
发明人: Arsha Nagrani , Shan Yang , Anurag Arnab , Chen Sun , Cordelia Luise Schmid
摘要: Example embodiments according to aspects of the present disclosure provide an example computer-implemented method for multimodal data processing with improved cross-modal attention. The example method includes inputting a multimodal sequence to an example machine-learned model. The example model includes a first modal processing stream receiving a first modal portion of the multimodal sequence and a second modal processing stream receiving a second modal portion of the multimodal sequence. The example model includes fusing the first modal processing stream and the second modal processing stream across one or more fusion layers of the machine-learned model through a plurality of cross-modal context encodings. The example method includes outputting an inference based at least in part on the plurality of cross-modal context encodings.
-
-
-
-
-