Patent search ap:("Google LLC") AND inv:"Wei-Cheng Kuo" Page 1

1.

发明申请
Visual Transformers with Sparse Application of Video Kernels 有权

公开(公告)号：US20250005924A1

公开(公告)日：2025-01-02

申请号：US18577051

申请日：2023-11-22

Applicant: Google LLC

Inventor： Anthony J. Piergiovanni , Wei-Cheng Kuo , Anelia Angelova

IPC: G06V20/40 , G06V10/776 , G06V10/82

Abstract: Provided are machine-learned models for performing video processing with improved efficiency. In particular, the machine-learned model can perform the sparse application of one or more video kernels to a set of video data to generate video tokens that can, for example, be provided as input to a visual transformer. Thus, example implementations of the present disclosure are directed to an approach which can turn a visual transformer (e.g., a ViT encoder) into an efficient video model. Furthermore, example implementations described herein can seamlessly work with both image and video inputs. Specifically, by sparsely sampling the inputs, the model is able to do training and inference from both inputs. The proposed model is easily scalable and can optionally be adapted to large-scale pre-trained visual transformers without requiring full finetuning.

2.

发明公开
Multi-Modal Machine Learning Models with Improved Computational Efficiency Via Adaptive Tokenization and Fusion 审中-公开

公开(公告)号：US20230394306A1

公开(公告)日：2023-12-07

申请号：US18328464

申请日：2023-06-02

Applicant: Google LLC

Inventor： Anthony J. Piergiovanni , Wei-Cheng Kuo , Anelia Angelova

IPC: G06N3/08 , G06N3/0464 , G06N3/048 , G06N3/0455

CPC classification number: G06N3/08 , G06N3/0464 , G06N3/048 , G06N3/0455

Abstract: Provided is an efficient multi-modal processing model. The multi-modal processing model can process input data from multiple different domains to generate a prediction for a multi-modal processing task. A machine-learned multi-modal processing model can include an adaptive tokenization layer that is configured to adaptively tokenize features generated from the multi-modal inputs into sets of tokens. Specifically, the tokens may have a smaller data size relative to the features from the inputs, thereby enabling a reduced number of processing operations to be performed overall, thereby improving the efficiency of model.

3.

发明公开
Localization of Objects Encoded in Image Data in Accordance with Natural Language Queries 审中-公开

公开(公告)号：US20240289981A1

公开(公告)日：2024-08-29

申请号：US18173557

申请日：2023-02-23

Applicant: Google LLC

Inventor： Wei-Cheng Kuo , Fred Bertsch , Wei Li , Anthony J. Piergiovanni , Mohammad Taghi Saffar , Anelia Angelova

IPC: G06T7/73 , G06F40/126 , G06F40/40 , G06V10/77 , G06V10/80

CPC classification number: G06T7/73 , G06F40/126 , G06F40/40 , G06V10/7715 , G06V10/806

Abstract: Generally, the disclosure is directed to generalized objected location, where the located object is in accordance to a natural language (NL) query. More specifically, the embodiments include a unified generalized visual localization architecture. The architecture achieves enhanced performance on the following three tasks: referring expression comprehension, object localization, and object detection. The embodiments employ machine-learned NL models and/or image models. The architecture is enabled to understand and answer natural localization questions towards an image, to output multiple boxes, provide no output if the object is not present (e.g., a null result), as well as, solve general detection tasks.

Patent Agency Ranking