Fixing the train-test resolution discrepancy

L Chen, S Li, Q Bai, J Yang, S Jiang, Y Miao - Remote Sensing, 2021 - mdpi.com

Image classification has always been a hot research direction in the world, and the
emergence of deep learning has promoted the development of this field. Convolutional …

被引用次数：377 相关文章所有 5 个版本

[PDF] acm.org

Avoiding overfitting: A survey on regularization methods for convolutional neural networks

CFGD Santos, JP Papa - ACM Computing Surveys (CSUR), 2022 - dl.acm.org

Several image processing tasks, such as image classification and object detection, have
been significantly improved using Convolutional Neural Networks (CNN). Like ResNet and …

被引用次数：167 相关文章所有 7 个版本

[PDF] arxiv.org

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org

The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

被引用次数：1001 相关文章所有 11 个版本

[PDF] arxiv.org

Pali: A jointly-scaled multilingual language-image model

X Chen, X Wang, S Changpinyo… - arXiv preprint arXiv …, 2022 - arxiv.org

Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …

被引用次数：480 相关文章所有 6 个版本

[PDF] neurips.cc

Flamingo: a visual language model for few-shot learning

JB Alayrac, J Donahue, P Luc… - Advances in neural …, 2022 - proceedings.neurips.cc

Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …

被引用次数：2410 相关文章所有 7 个版本

[PDF] mlr.press

Pix2struct: Screenshot parsing as pretraining for visual language understanding

K Lee, M Joshi, IR Turc, H Hu, F Liu… - International …, 2023 - proceedings.mlr.press

Visually-situated language is ubiquitous—sources range from textbooks with diagrams to
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …

被引用次数：147 相关文章所有 7 个版本

[PDF] arxiv.org

Deit iii: Revenge of the vit

H Touvron, M Cord, H Jégou - European conference on computer vision, 2022 - Springer

Abstract A Vision Transformer (ViT) is a simple neural architecture amenable to serve
several computer vision tasks. It has limited built-in architectural priors, in contrast to more …

被引用次数：289 相关文章所有 8 个版本

[PDF] thecvf.com

Masked autoencoders are scalable vision learners

K He, X Chen, S Xie, Y Li, P Dollár… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners
for computer vision. Our MAE approach is simple: we mask random patches of the input …

被引用次数：6027 相关文章所有 11 个版本

Towards a general-purpose foundation model for computational pathology

RJ Chen, T Ding, MY Lu, DFK Williamson, G Jaume… - Nature Medicine, 2024 - nature.com

Quantitative evaluation of tissue images is crucial for computational pathology (CPath) tasks,
requiring the objective characterization of histopathological entities from whole-slide images …

被引用次数：48 相关文章所有 3 个版本

[PDF] arxiv.org

Resnet strikes back: An improved training procedure in timm

R Wightman, H Touvron, H Jégou - arXiv preprint arXiv:2110.00476, 2021 - arxiv.org

The influential Residual Networks designed by He et al. remain the gold-standard
architecture in numerous scientific publications. They typically serve as the default …

被引用次数：442 相关文章所有 3 个版本