Efficientvit: Memory efficient vision transformer with cascaded group attention
Vision transformers have shown great success due to their high model capabilities.
However, their remarkable performance is accompanied by heavy computation costs, which …
However, their remarkable performance is accompanied by heavy computation costs, which …
Faster segment anything: Towards lightweight sam for mobile applications
Segment anything model (SAM) is a prompt-guided vision foundation model for cutting out
the object of interest from its background. Since Meta research team released the SA project …
the object of interest from its background. Since Meta research team released the SA project …
Efficientsam: Leveraged masked image pretraining for efficient segment anything
Abstract Segment Anything Model (SAM) has emerged as a powerful tool for numerous
vision applications. A key component that drives the impressive performance for zero-shot …
vision applications. A key component that drives the impressive performance for zero-shot …
Distilling large vision-language model with out-of-distribution generalizability
Large vision-language models have achieved outstanding performance, but their size and
computational requirements make their deployment on resource-constrained devices and …
computational requirements make their deployment on resource-constrained devices and …
Exploring lightweight hierarchical vision transformers for efficient visual tracking
Transformer-based visual trackers have demonstrated significant progress owing to their
superior modeling capabilities. However, existing trackers are hampered by low speed …
superior modeling capabilities. However, existing trackers are hampered by low speed …
A survey on transformer compression
Large models based on the Transformer architecture play increasingly vital roles in artificial
intelligence, particularly within the realms of natural language processing (NLP) and …
intelligence, particularly within the realms of natural language processing (NLP) and …
Tinyclip: Clip distillation via affinity mimicking and weight inheritance
In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-
scale language-image pre-trained models. The method introduces two core techniques …
scale language-image pre-trained models. The method introduces two core techniques …
A survey of the vision transformers and their CNN-transformer based variants
Vision transformers have become popular as a possible substitute to convolutional neural
networks (CNNs) for a variety of computer vision applications. These transformers, with their …
networks (CNNs) for a variety of computer vision applications. These transformers, with their …
Logit standardization in knowledge distillation
Abstract Knowledge distillation involves transferring soft labels from a teacher to a student
using a shared temperature-based softmax function. However the assumption of a shared …
using a shared temperature-based softmax function. However the assumption of a shared …
Diffrate: Differentiable compression rate for efficient vision transformers
Token compression aims to speed up large-scale vision transformers (eg ViTs) by pruning
(dropping) or merging tokens. It is an important but challenging task. Although recent …
(dropping) or merging tokens. It is an important but challenging task. Although recent …