Internimage: Exploring large-scale vision foundation models with deformable convolutions

J Qiu, L Li, J Sun, J Peng, P Shi… - IEEE Journal of …, 2023 - ieeexplore.ieee.org

Large AI models, or foundation models, are models recently emerging with massive scales
both parameter-wise and data-wise, the magnitudes of which can reach beyond billions …

被引用次数：78 相关文章所有 6 个版本

[PDF] arxiv.org

On the challenges and perspectives of foundation models for medical image analysis

S Zhang, D Metaxas - Medical Image Analysis, 2023 - Elsevier

This article discusses the opportunities, applications and future directions of large-scale
pretrained models, ie, foundation models, which promise to significantly improve the …

被引用次数：48 相关文章所有 6 个版本

[PDF] arxiv.org

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org

The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

被引用次数：1024 相关文章所有 11 个版本

[PDF] thecvf.com

Reproducible scaling laws for contrastive language-image learning

M Cherti, R Beaumont, R Wightman… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scaling up neural networks has led to remarkable performance across a wide range of
tasks. Moreover, performance often follows reliable scaling laws as a function of training set …

被引用次数：398 相关文章所有 6 个版本

[PDF] neurips.cc

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

W Wang, Z Chen, X Chen, J Wu… - Advances in …, 2024 - proceedings.neurips.cc

Large language models (LLMs) have notably accelerated progress towards artificial general
intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing …

被引用次数：263 相关文章所有 6 个版本

[PDF] arxiv.org

Videochat: Chat-centric video understanding

KC Li, Y He, Y Wang, Y Li, W Wang, P Luo… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we initiate an attempt of developing an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

被引用次数：315 相关文章所有 4 个版本

[PDF] arxiv.org

Vision mamba: Efficient visual representation learning with bidirectional state space model

L Zhu, B Liao, Q Zhang, X Wang, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently the state space models (SSMs) with efficient hardware-aware designs, ie, the
Mamba deep learning model, have shown great potential for long sequence modeling …

被引用次数：295 相关文章所有 5 个版本

[PDF] thecvf.com

Detrs with collaborative hybrid assignments training

Z Zong, G Song, Y Liu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

In this paper, we provide the observation that too few queries assigned as positive samples
in DETR with one-to-one set matching leads to sparse supervision on the encoder's output …

被引用次数：183 相关文章所有 5 个版本

[PDF] thecvf.com

Bevformer v2: Adapting modern image backbones to bird's-eye-view recognition via perspective supervision

C Yang, Y Chen, H Tian, C Tao, X Zhu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present a novel bird's-eye-view (BEV) detector with perspective supervision, which
converges faster and better suits modern image backbones. Existing state-of-the-art BEV …

被引用次数：166 相关文章所有 9 个版本

[PDF] arxiv.org

Eva-02: A visual representation for neon genesis

Y Fang, Q Sun, X Wang, T Huang, X Wang… - Image and Vision …, 2024 - Elsevier

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained
to reconstruct strong and robust language-aligned vision features via masked image …

被引用次数：132 相关文章所有 3 个版本