VIDEO-TEXT MODELING WITH ZERO-SHOT TRANSFER FROM CONTRASTIVE CAPTIONERS

    公开(公告)号:US20250124708A1

    公开(公告)日:2025-04-17

    申请号:US18694604

    申请日:2023-12-08

    Applicant: Google LLC

    Abstract: Provided is an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. Some example implementations include a model which can be referred to as VideoCoCa. Example implementations reuse a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with little or minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, aspects of the present disclosure leverage findings that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to “flattened frame embeddings”, yielding a strong zero-shot transfer baseline for many video-text tasks.

    GENERATING LABELED TRAINING DATA USING A PRE-TRAINED LANGUAGE MODEL NEURAL NETWORK

    公开(公告)号:US20230196105A1

    公开(公告)日:2023-06-22

    申请号:US18082934

    申请日:2022-12-16

    Applicant: Google LLC

    CPC classification number: G06N3/08

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating labeled training data using a pre-trained language model neural network. In particular, the language model neural network can generate the text input in a new labeled training example from an input sequence that includes (i) one or more context inputs and (ii) a text label that identifies the ground truth category for the new labeled training example.

    Systems and Methods for Pretraining Image Processing Models

    公开(公告)号:US20230281400A1

    公开(公告)日:2023-09-07

    申请号:US17685774

    申请日:2022-03-03

    Applicant: Google LLC

    CPC classification number: G06F40/58 G06F40/284 G06V10/766 G06V30/10

    Abstract: Example embodiments of the present disclosure relate to systems and methods for pretraining image-processing models on weakly-supervised image-text pairs. The pretraining can include receiving a training sequence for the machine-learned image-processing model. The training sequence can include text tokens and image tokens. A prefix sequence can contain the image tokens. A remainder sequence can include a remainder set of the text tokens. The pretraining can include determining, using the prefix sequence as an input to the machine-learned image-processing model, an objective based on recovery of the remainder sequence. The pretraining can include updating one or more learnable parameters of the machine-learned image-processing model based on the objective.

    Vector-Quantized Image Modeling
    6.
    发明公开

    公开(公告)号:US20240112088A1

    公开(公告)日:2024-04-04

    申请号:US18520083

    申请日:2023-11-27

    Applicant: Google LLC

    CPC classification number: G06N20/00

    Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pretraining a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.

Patent Agency Ranking