Patent search ap:("Google LLC") AND inv:"Anurag Arnab" Page 1

1.

发明申请
Systems and Methods for Improved Video Understanding 有权

公开(公告)号：US20240428586A1

公开(公告)日：2024-12-26

申请号：US18827088

申请日：2024-09-06

Applicant: Google LLC

Inventor： Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid

IPC: G06V20/40 , G06N20/00

Abstract: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of spatiotemporal representations from the video data, the plurality of spatiotemporal representations comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of spatiotemporal representations as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.

2.

发明公开
ACTION LOCALIZATION IN VIDEOS USING LEARNED QUERIES 审中-公开

公开(公告)号：US20240346824A1

公开(公告)日：2024-10-17

申请号：US18634794

申请日：2024-04-12

Applicant: Google LLC

Inventor： Alexey Alexeevich Gritsenko , Xuehan Xiong , Josip Djolonga , Mostafa Dehghani , Chen Sun , Mario Lucic , Cordelia Luise Schmid , Anurag Arnab

IPC: G06V20/40 , G06T7/73 , G06V10/62 , G06V10/764 , G06V10/77 , G06V10/774 , G06V10/776 , G06V10/82

CPC classification number: G06V20/46 , G06T7/73 , G06V10/62 , G06V10/764 , G06V10/7715 , G06V10/774 , G06V10/776 , G06V10/82 , G06T2207/10016 , G06T2207/20081 , G06T2207/20084

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing action localization on an input video. In particular, a system maintains a set of query vectors and uses the input video and the set of query vectors to generate an action localization output for the input video. The action localization output includes, for each of one or more agents depicted in the video, data specifying, for each of one or more video frames in the video, a respective bounding box in the video frame that depicts the agent and a respective action from a set of actions that is being performed by the agent in the video frame.

3.

发明申请
Learning with Neighbor Consistency for Noisy Labels 有权

公开(公告)号：US20250131694A1

公开(公告)日：2025-04-24

申请号：US18688257

申请日：2021-09-09

Applicant: Google LLC

Inventor： Ahmet Iscen , Jack Louis Valmadre , Anurag Arnab , Cordelia Luise Schmid

IPC: G06V10/774 , G06V10/74 , G06V10/764 , G06V10/77 , G06V10/776 , G06V10/82 , G06V20/70

Abstract: Systems and methods for classification model training can use feature representation neighbors for mitigating label training overfitting. The systems and methods disclosed herein can utilize neighbor consistency regularization for training a classification model with and without noisy labels. The systems and methods can include a combined loss function with both a supervised learning loss and a neighbor consistency regularization loss.

4.

发明申请
VIDEO LOCALIZATION USING ARTIFICIAL INTELLIGENCE 有权

公开(公告)号：US20240371164A1

公开(公告)日：2024-11-07

申请号：US18652703

申请日：2024-05-01

Applicant: Google LLC

Inventor： Shen Yan , Xuehan Xiong , Arsha Nagrani , Anurag Arnab , David Alexander Ross , Cordelia Schmid

IPC: G06V20/40 , G06V10/774 , G06V10/80

Abstract: Methods and systems for video localization using artificial intelligence are provided herein. A set of video embeddings representing features of one or more video frames of a media it em and a set of textual embeddings corresponding to an event associated with the media item are obtained. Fused video-textual data is generated based on the set of video embeddings and the set of textual embeddings. The fused video-textual data indicates features of the video frames of the media item and textual data pertaining to the media item. The fused video-textual data is provided as an input to an artificial intelligence (AI) model trained to perform multiple video localization tasks with respect to media items of a platform. One or move outputs of the AI model are obtained. A segment of the media item that depicts the event is determined based on the one or move outputs of the AI model.

5.

发明公开
COMPUTER VISION NEURAL NETWORKS WITH LEARNED TOKENIZATION 审中-公开

公开(公告)号：US20230409899A1

公开(公告)日：2023-12-21

申请号：US17845753

申请日：2022-06-21

Applicant: Google LLC

Inventor： Michael Sahngwon Ryoo , Anthony Jacob Piergiovanni , Anelia Angelova , Anurag Arnab , Mostafa Dehghani

IPC: G06N3/08

CPC classification number: G06N3/08

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing a network input using a computer vision neural network with learned tokenization.

6.

发明申请
SELF-ATTENTION BASED NEURAL NETWORKS FOR PROCESSING NETWORK INPUTS FROM MULTIPLE MODALITIES 有权

公开(公告)号：US20240403636A1

公开(公告)日：2024-12-05

申请号：US18697257

申请日：2022-10-05

Applicant: GOOGLE LLC

Inventor： Valerii Likhosherstov , Mostafa Dehghani , Anurag Arnab , Krzysztof Marcin Choromanski , Mario Lucic , Yi Tay

IPC: G06N3/08 , G06N3/045

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for executing and training a multi-modal, multi-task self-attention neural network.

7.

发明公开
Pre-Training a Model Using Unlabeled Videos 审中-公开

公开(公告)号：US20240127794A1

公开(公告)日：2024-04-18

申请号：US17957291

申请日：2022-09-30

Applicant: Google LLC

Inventor： Hongsuck Seo , Arsha Nagrani , Anurag Arnab , Cordelia Luise Schmid

IPC: G10L15/06 , G10L15/24 , G10L15/26

CPC classification number: G10L15/063 , G10L15/24 , G10L15/26

Abstract: Systems and methods method for performing captioning for image or video data are described herein. The method can include receiving unlabeled multimedia data, and outputting, from a machine learning model, one or more captions for the multimedia data. Training the machine learning model to create these outputs can include inputting a subset of video frames and a first utterance into the machine learning model, using the machine learning model to predict a predicted utterance based on the subset of video frames and the first utterance, and updating one or more parameters of the machine learning model based on a loss function that compares the predicted utterance with the second utterance.

8.

发明公开
Attention Bottlenecks for Multimodal Fusion 审中-公开

公开(公告)号：US20230177384A1

公开(公告)日：2023-06-08

申请号：US17545526

申请日：2021-12-08

Applicant: Google LLC

Inventor： Arsha Nagrani , Shan Yang , Anurag Arnab , Chen Sun , Cordelia Luise Schmid

IPC: G06N20/00 , G06N5/04

CPC classification number: G06N20/00 , G06N5/04

Abstract: Example embodiments according to aspects of the present disclosure provide an example computer-implemented method for multimodal data processing with improved cross-modal attention. The example method includes inputting a multimodal sequence to an example machine-learned model. The example model includes a first modal processing stream receiving a first modal portion of the multimodal sequence and a second modal processing stream receiving a second modal portion of the multimodal sequence. The example model includes fusing the first modal processing stream and the second modal processing stream across one or more fusion layers of the machine-learned model through a plurality of cross-modal context encodings. The example method includes outputting an inference based at least in part on the plurality of cross-modal context encodings.

9.

发明申请
Dense Video Object Captioning from Disjoint Vision 有权

公开(公告)号：US20250053753A1

公开(公告)日：2025-02-13

申请号：US18448508

申请日：2023-08-11

Applicant: Google LLC

Inventor： Xingyi Zhou , Anurag Arnab , Chen Sun , Cordelia Luise Schmid

IPC: G06F40/40 , G06T7/246 , G06V10/22 , G06V10/774 , G06V10/776 , G06V20/40

Abstract: Provided are a new task and model for dense video object captioning—detecting, tracking, and captioning trajectories of all objects in a video. This task unifies spatial and temporal understanding of the video, and requires fine-grained language description. Example implementations of the proposed model for dense video object captioning can be trained end-to-end and can include different models for spatial localization, tracking, and captioning. As such, some example implementations of the present disclosure can train the proposed model with a mixture of disjoint tasks, and leverage diverse, large-scale datasets which supervise different parts of an example proposed model. This results in noteworthy zero-shot performance.

10.

发明申请
Systems and Methods for Improved Video Understanding 有权

公开(公告)号：US20240428587A1

公开(公告)日：2024-12-26

申请号：US18827133

申请日：2024-09-06

Applicant: Google LLC

Inventor： Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid

IPC: G06V20/40 , G06N20/00

Abstract: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of video tokens from the video data, the plurality of video tokens comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of video tokens as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification