Efficientvit: Memory efficient vision transformer with cascaded group attention

X Liu, H Peng, N Zheng, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision transformers have shown great success due to their high model capabilities.
However, their remarkable performance is accompanied by heavy computation costs, which …

Faster segment anything: Towards lightweight sam for mobile applications

C Zhang, D Han, Y Qiao, JU Kim, SH Bae… - arXiv preprint arXiv …, 2023 - arxiv.org
Segment anything model (SAM) is a prompt-guided vision foundation model for cutting out
the object of interest from its background. Since Meta research team released the SA project …

Efficientsam: Leveraged masked image pretraining for efficient segment anything

Y Xiong, B Varadarajan, L Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Segment Anything Model (SAM) has emerged as a powerful tool for numerous
vision applications. A key component that drives the impressive performance for zero-shot …

Distilling large vision-language model with out-of-distribution generalizability

X Li, Y Fang, M Liu, Z Ling, Z Tu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Large vision-language models have achieved outstanding performance, but their size and
computational requirements make their deployment on resource-constrained devices and …

Exploring lightweight hierarchical vision transformers for efficient visual tracking

B Kang, X Chen, D Wang, H Peng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Transformer-based visual trackers have demonstrated significant progress owing to their
superior modeling capabilities. However, existing trackers are hampered by low speed …

A survey on transformer compression

Y Tang, Y Wang, J Guo, Z Tu, K Han, H Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large models based on the Transformer architecture play increasingly vital roles in artificial
intelligence, particularly within the realms of natural language processing (NLP) and …

Tinyclip: Clip distillation via affinity mimicking and weight inheritance

K Wu, H Peng, Z Zhou, B Xiao, M Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-
scale language-image pre-trained models. The method introduces two core techniques …

A survey of the vision transformers and their CNN-transformer based variants

A Khan, Z Rauf, A Sohail, AR Khan, H Asif… - Artificial Intelligence …, 2023 - Springer
Vision transformers have become popular as a possible substitute to convolutional neural
networks (CNNs) for a variety of computer vision applications. These transformers, with their …

Logit standardization in knowledge distillation

S Sun, W Ren, J Li, R Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Knowledge distillation involves transferring soft labels from a teacher to a student
using a shared temperature-based softmax function. However the assumption of a shared …

Diffrate: Differentiable compression rate for efficient vision transformers

M Chen, W Shao, P Xu, M Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Token compression aims to speed up large-scale vision transformers (eg ViTs) by pruning
(dropping) or merging tokens. It is an important but challenging task. Although recent …