-
1.
公开(公告)号:US20240403636A1
公开(公告)日:2024-12-05
申请号:US18697257
申请日:2022-10-05
Applicant: GOOGLE LLC
Inventor: Valerii Likhosherstov , Mostafa Dehghani , Anurag Arnab , Krzysztof Marcin Choromanski , Mario Lucic , Yi Tay
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for executing and training a multi-modal, multi-task self-attention neural network.
-
公开(公告)号:US20240428587A1
公开(公告)日:2024-12-26
申请号:US18827133
申请日:2024-09-06
Applicant: Google LLC
Inventor: Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid
Abstract: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of video tokens from the video data, the plurality of video tokens comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of video tokens as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.
-
公开(公告)号:US12112538B2
公开(公告)日:2024-10-08
申请号:US17370522
申请日:2021-07-08
Applicant: Google LLC
Inventor: Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid
Abstract: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of video tokens from the video data, the plurality of video tokens comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of video tokens as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.
-
公开(公告)号:US20230017072A1
公开(公告)日:2023-01-19
申请号:US17370522
申请日:2021-07-08
Applicant: Google LLC
Inventor: Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid
Abstract: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of video tokens from the video data, the plurality of video tokens comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of video tokens as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.
-
公开(公告)号:US20220375211A1
公开(公告)日:2022-11-24
申请号:US17737507
申请日:2022-05-05
Applicant: Google LLC
Inventor: Ilya Tolstikhin , Neil Matthew Tinmouth Houlsby , Alexander Kolesnikov , Lucas Klaus Beyer , Alexey Dosovitskiy , Mario Lucic , Xiaohua Zhai , Thomas Unterthiner , Daniel M. Keysers , Jakob D. Uszkoreit , Yin Ching Jessica Yung , Andreas Steiner
IPC: G06V10/82 , G06V10/764 , G06N3/04
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing images using mixer neural networks. One of the methods includes obtaining one or more images comprising a plurality of pixels; determining, for each image of the one or more images, a plurality of image patches of the image, wherein each image patch comprises a different subset of the pixels of the image; processing, for each image of the one or more images, the corresponding plurality of image patches to generate an input sequence comprising a respective input element at each of a plurality of input positions, wherein a plurality of the input elements correspond to respective different image patches; and processing the input sequences using a neural network to generate a network output that characterizes the one or more images, wherein the neural network comprises one or more mixer neural network layers.
-
公开(公告)号:US20240428586A1
公开(公告)日:2024-12-26
申请号:US18827088
申请日:2024-09-06
Applicant: Google LLC
Inventor: Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , Cordelia Luise Schmid
Abstract: A computer-implemented method for classifying video data with improved accuracy includes obtaining, by a computing system comprising one or more computing devices, video data comprising a plurality of video frames; extracting, by the computing system, a plurality of spatiotemporal representations from the video data, the plurality of spatiotemporal representations comprising a representation of spatiotemporal information in the video data; providing, by the computing system, the plurality of spatiotemporal representations as input to a video understanding model, the video understanding model comprising a video transformer encoder model; and receiving, by the computing system, a classification output from the video understanding model.
-
公开(公告)号:US20240346824A1
公开(公告)日:2024-10-17
申请号:US18634794
申请日:2024-04-12
Applicant: Google LLC
Inventor: Alexey Alexeevich Gritsenko , Xuehan Xiong , Josip Djolonga , Mostafa Dehghani , Chen Sun , Mario Lucic , Cordelia Luise Schmid , Anurag Arnab
IPC: G06V20/40 , G06T7/73 , G06V10/62 , G06V10/764 , G06V10/77 , G06V10/774 , G06V10/776 , G06V10/82
CPC classification number: G06V20/46 , G06T7/73 , G06V10/62 , G06V10/764 , G06V10/7715 , G06V10/774 , G06V10/776 , G06V10/82 , G06T2207/10016 , G06T2207/20081 , G06T2207/20084
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing action localization on an input video. In particular, a system maintains a set of query vectors and uses the input video and the set of query vectors to generate an action localization output for the input video. The action localization output includes, for each of one or more agents depicted in the video, data specifying, for each of one or more video frames in the video, a respective bounding box in the video frame that depicts the agent and a respective action from a set of actions that is being performed by the agent in the video frame.
-
公开(公告)号:US20240169662A1
公开(公告)日:2024-05-23
申请号:US18517190
申请日:2023-11-22
Applicant: Google LLC
Inventor: Seyed Mohammad Mehdi Sajjadi , Klaus Greff , Etienne François Régis Pot , Daniel Christopher Duckworth , Mario Lucic , Aravindh Mahendran , Thomas Kipf
CPC classification number: G06T15/205 , B25J9/1697 , G06T7/73 , G06T2207/20081 , G06T2207/20084
Abstract: An example method includes obtaining, by a computing system, one or more source images of a scene; obtaining, by the computing system, a query associated with a target view of the scene, wherein at least a portion of the query is parameterized in a latent pose space; and generating, by the computing system and using a machine-learned image view synthesis model, an output image of the scene associated with the target view.
-
-
-
-
-
-
-