Fine-grained visual classification of aircraft

C Zhou, Q Li, C Li, J Yu, Y Liu, G Wang… - International Journal of …, 2024 - Springer

Abstract Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …

被引用次数：561 相关文章所有 2 个版本

[PDF] arxiv.org

A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities

Y Song, T Wang, P Cai, SK Mondal… - ACM Computing Surveys, 2023 - dl.acm.org

Few-shot learning (FSL) has emerged as an effective learning method and shows great
potential. Despite the recent creative works in tackling FSL tasks, learning valid information …

被引用次数：358 相关文章所有 3 个版本

[PDF] arxiv.org

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org

The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

被引用次数：1978 相关文章所有 11 个版本

[PDF] thecvf.com

Maple: Multi-modal prompt learning

MU Khattak, H Rasheed, M Maaz… - Proceedings of the …, 2023 - openaccess.thecvf.com

Pre-trained vision-language (VL) models such as CLIP have shown excellent generalization
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …

被引用次数：619 相关文章所有 10 个版本

[PDF] arxiv.org

Eva-clip: Improved training techniques for clip at scale

Q Sun, Y Fang, L Wu, X Wang, Y Cao - arXiv preprint arXiv:2303.15389, 2023 - arxiv.org

Contrastive language-image pre-training, CLIP for short, has gained increasing attention for
its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models …

被引用次数：390 相关文章所有 2 个版本

[PDF] arxiv.org

Eva-02: A visual representation for neon genesis

Y Fang, Q Sun, X Wang, T Huang, X Wang… - Image and Vision …, 2024 - Elsevier

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained
to reconstruct strong and robust language-aligned vision features via masked image …

被引用次数：211 相关文章所有 3 个版本

[PDF] arxiv.org

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：355 相关文章所有 9 个版本

[PDF] thecvf.com

Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

R Zhang, X Hu, B Li, S Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Visual recognition in low-data regimes requires deep neural networks to learn generalized
representations from limited training samples. Recently, CLIP-based methods have shown …

被引用次数：160 相关文章所有 5 个版本

[PDF] thecvf.com

Your diffusion model is secretly a zero-shot classifier

AC Li, M Prabhudesai, S Duggal… - Proceedings of the …, 2023 - openaccess.thecvf.com

The recent wave of large-scale text-to-image diffusion models has dramatically increased
our text-based image generation abilities. These models can generate realistic images for a …

被引用次数：212 相关文章所有 9 个版本

[PDF] neurips.cc

In-context impersonation reveals Large Language Models' strengths and biases

L Salewski, S Alaniz, I Rio-Torto… - Advances in neural …, 2023 - proceedings.neurips.cc

In everyday conversations, humans can take on different roles and adapt their vocabulary to
their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles …

被引用次数：122 相关文章所有 12 个版本