-
1.
公开(公告)号:US20240362902A1
公开(公告)日:2024-10-31
申请号:US18140055
申请日:2023-04-27
申请人: ModiFace Inc.
发明人: Cong WEI , Brendan DUKE , Ruowei JIANG
CPC分类号: G06V10/82 , G06T7/246 , G06T11/60 , G06V10/764 , G06V20/20 , G06V40/161 , G06T2207/20081 , G06T2207/20084 , G06T2207/30201
摘要: Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs) though they often come with high computational costs. Methods, systems and techniques herein learn instance-dependent attention patterns, utilizing a lightweight connectivity predictor module to estimate a connectivity score of each pair of tokens. Intuitively, two tokens have high connectivity scores if the features are considered relevant either spatially or semantically. As each token only attends to a small number of other tokens, the binarized connectivity masks are often very sparse by nature providing an opportunity to accelerate the network via sparse computations. Equipped with the learned unstructured attention pattern, sparse attention ViT produces a superior Pareto-optimal trade-off between FLOPs and top-1 accuracy on ImageNet compared to token sparsity (48%˜69% FLOPs reduction of MHSA; accuracy drop within 0.4%). Combining attention and token sparsity reduces VIT FLOPs by over 60%.