Sora: A review on background, technology, limitations, and opportunities of large vision models

Y Liu, K Zhang, Y Li, Z Yan, C Gao, R Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The
model is trained to generate videos of realistic or imaginative scenes from text instructions …

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

Scaling vision transformers to 22 billion parameters

M Dehghani, J Djolonga, B Mustafa… - International …, 2023 - proceedings.mlr.press
The scaling of Transformers has driven breakthrough capabilities for language models. At
present, the largest large language models (LLMs) contain upwards of 100B parameters …

Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution

M Dehghani, B Mustafa, J Djolonga… - Advances in …, 2024 - proceedings.neurips.cc
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution
before processing them with computer vision models has not yet been successfully …

Rangevit: Towards vision transformers for 3d semantic segmentation in autonomous driving

A Ando, S Gidaris, A Bursuc, G Puy… - Proceedings of the …, 2023 - openaccess.thecvf.com
Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, eg, via
range projection, is an effective and popular approach. These projection-based methods …

Which tokens to use? investigating token reduction in vision transformers

JB Haurum, S Escalera, GW Taylor… - Proceedings of the …, 2023 - openaccess.thecvf.com
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs
more efficient by removing redundant information in the processed tokens. While different …

Getting vit in shape: Scaling laws for compute-optimal model design

IM Alabdulmohsin, X Zhai… - Advances in Neural …, 2024 - proceedings.neurips.cc
Scaling laws have been recently employed to derive compute-optimal model size (number
of parameters) for a given compute duration. We advance and refine such methods to infer …

Plainmamba: Improving non-hierarchical mamba in visual recognition

C Yang, Z Chen, M Espinosa, L Ericsson… - arXiv preprint arXiv …, 2024 - arxiv.org
We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for
general visual recognition. The recent Mamba model has shown how SSMs can be highly …

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Y Song, Q Zhou, X Li, DP Fan… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this paper we address the challenge of image resolution variation for the Segment
Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance …

[HTML][HTML] A novel day-ahead regional and probabilistic wind power forecasting framework using deep CNNs and conformalized regression forests

J Jonkers, DN Avendano, G Van Wallendael… - Applied Energy, 2024 - Elsevier
Regional forecasting is crucial for a balanced energy delivery system and for achieving the
global transition to clean energy. However, regional wind forecasting is challenging due to …