-
公开(公告)号:US20250124708A1
公开(公告)日:2025-04-17
申请号:US18694604
申请日:2023-12-08
Applicant: Google LLC
Inventor: Shen Yan , Tao Zhu , Zirui Wang , Yuan Cao , Jiahui Yu
IPC: G06V20/40 , G06F16/583
Abstract: Provided is an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. Some example implementations include a model which can be referred to as VideoCoCa. Example implementations reuse a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with little or minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, aspects of the present disclosure leverage findings that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to “flattened frame embeddings”, yielding a strong zero-shot transfer baseline for many video-text tasks.
-
公开(公告)号:US20230196105A1
公开(公告)日:2023-06-22
申请号:US18082934
申请日:2022-12-16
Applicant: Google LLC
Inventor: Zirui Wang , Wei Yu , Orhan Firat , Yuan Cao
IPC: G06N3/08
CPC classification number: G06N3/08
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating labeled training data using a pre-trained language model neural network. In particular, the language model neural network can generate the text input in a new labeled training example from an input sequence that includes (i) one or more context inputs and (ii) a text label that identifies the ground truth category for the new labeled training example.
-
公开(公告)号:US20230351149A1
公开(公告)日:2023-11-02
申请号:US18141340
申请日:2023-04-28
Applicant: Google LLC
Inventor: Jiahui Yu , Zirui Wang , Vijay Vasudevan , Ho Man Yeung , Seyed Mojtaba Seyedhosseini Tarzjani , Yonghui Wu
IPC: G06N3/04
CPC classification number: G06N3/04
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing multi-modal inputs using contrastive captioning neural networks.
-
公开(公告)号:US20230281400A1
公开(公告)日:2023-09-07
申请号:US17685774
申请日:2022-03-03
Applicant: Google LLC
Inventor: Zirui Wang , Jiahui Yu , Yuan Cao , Wei Yu , Zihang Dai
IPC: G06F40/58 , G06F40/284 , G06V30/10 , G06V10/766
CPC classification number: G06F40/58 , G06F40/284 , G06V10/766 , G06V30/10
Abstract: Example embodiments of the present disclosure relate to systems and methods for pretraining image-processing models on weakly-supervised image-text pairs. The pretraining can include receiving a training sequence for the machine-learned image-processing model. The training sequence can include text tokens and image tokens. A prefix sequence can contain the image tokens. A remainder sequence can include a remainder set of the text tokens. The pretraining can include determining, using the prefix sequence as an input to the machine-learned image-processing model, an objective based on recovery of the remainder sequence. The pretraining can include updating one or more learnable parameters of the machine-learned image-processing model based on the objective.
-
公开(公告)号:US20240404238A1
公开(公告)日:2024-12-05
申请号:US18698997
申请日:2022-10-05
Applicant: Google LLC
Inventor: Jiahui Yu , Vijay Vasudevan , Alexander Yeong-Shiuh Ku , Yonghui Wu , Jason Michael Baldridge , Yuanzhong Xu , Jing Yu Koh , Thang Minh Luong , Gunjan Baid , Zirui Wang , Han Zhang , Xin Li
IPC: G06V10/28 , G06F40/284 , G06V10/764 , G06V10/766 , G06V10/82
Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pre-training a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.
-
公开(公告)号:US20240112088A1
公开(公告)日:2024-04-04
申请号:US18520083
申请日:2023-11-27
Applicant: Google LLC
Inventor: Jiahui Yu , Xin Li , Han Zhang , Vijay Vasudevan , Alexander Yeong-Shiuh Ku , Jason Michael Baldridge , Yuanzhong Xu , Jing Yu Koh , Thang Minh Luong , Gunjan Baid , Zirui Wang , Yonghui Wu
IPC: G06N20/00
CPC classification number: G06N20/00
Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pretraining a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.
-
-
-
-
-