-
公开(公告)号:US20220375211A1
公开(公告)日:2022-11-24
申请号:US17737507
申请日:2022-05-05
Applicant: Google LLC
Inventor: Ilya Tolstikhin , Neil Matthew Tinmouth Houlsby , Alexander Kolesnikov , Lucas Klaus Beyer , Alexey Dosovitskiy , Mario Lucic , Xiaohua Zhai , Thomas Unterthiner , Daniel M. Keysers , Jakob D. Uszkoreit , Yin Ching Jessica Yung , Andreas Steiner
IPC: G06V10/82 , G06V10/764 , G06N3/04
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing images using mixer neural networks. One of the methods includes obtaining one or more images comprising a plurality of pixels; determining, for each image of the one or more images, a plurality of image patches of the image, wherein each image patch comprises a different subset of the pixels of the image; processing, for each image of the one or more images, the corresponding plurality of image patches to generate an input sequence comprising a respective input element at each of a plurality of input positions, wherein a plurality of the input elements correspond to respective different image patches; and processing the input sequences using a neural network to generate a network output that characterizes the one or more images, wherein the neural network comprises one or more mixer neural network layers.
-
公开(公告)号:US20240153256A1
公开(公告)日:2024-05-09
申请号:US18051106
申请日:2022-10-31
Applicant: Google LLC
Inventor: Daniel Keysers , Xiaohua Zhai , Xiao Wang , Lucas Beyer , Basil Mustafa , Andreas Steiner , Alexander Kolesnikov
IPC: G06V10/778
CPC classification number: G06V10/778
Abstract: A method may include obtaining a pretrained image encoder and a training sample comprising a training image and a training text string corresponding to the training image. The method may also include initializing a text encoder in an untrained state, determining, using the pretrained image encoder and based on the training image, a first latent representation of the training image, and determining, using the text encoder and based on the training text string, a second latent representation of the training text string. The method may further include determining a loss value based on the first latent representation and the second latent representation, updating, based on the loss value, one or more parameters of the text encoder while holding fixed parameters of the pretrained image encoder, and outputting the text encoder in a trained state.
-