Sora: A review on background, technology, limitations, and opportunities of large vision models

Y Liu, K Zhang, Y Li, Z Yan, C Gao, R Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The
model is trained to generate videos of realistic or imaginative scenes from text instructions …

A survey of the vision transformers and their CNN-transformer based variants

A Khan, Z Rauf, A Sohail, AR Khan, H Asif… - Artificial Intelligence …, 2023 - Springer
Vision transformers have become popular as a possible substitute to convolutional neural
networks (CNNs) for a variety of computer vision applications. These transformers, with their …

PaliGemma: A versatile 3B VLM for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

What matters when building vision-language models?

H Laurençon, L Tronchon, M Cord, V Sanh - arXiv preprint arXiv …, 2024 - arxiv.org
The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Y Song, Q Zhou, X Li, DP Fan… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this paper we address the challenge of image resolution variation for the Segment
Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance …

Cogvideox: Text-to-video diffusion models with an expert transformer

Z Yang, J Teng, W Zheng, M Ding, S Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce CogVideoX, a large-scale diffusion transformer model designed for generating
videos based on text prompts. To efficently model video data, we propose to levearge a 3D …

Fit: Flexible vision transformer for diffusion model

Z Lu, Z Wang, D Huang, C Wu, X Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such
as Diffusion Transformers, often face challenges when processing image resolutions outside …

Aurora: A foundation model of the atmosphere

C Bodnar, WP Bruinsma, A Lucic, M Stanley… - arXiv preprint arXiv …, 2024 - arxiv.org
Deep learning foundation models are revolutionizing many facets of science by leveraging
vast amounts of data to learn general-purpose representations that can be adapted to tackle …

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

H Pouransari, CL Li, JHR Chang, PKA Vasu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) are commonly trained on datasets consisting of fixed-length
token sequences. These datasets are created by randomly concatenating documents of …

Win-Win: Training High-Resolution Vision Transformers from Two Windows

V Leroy, J Revaud, T Lucas, P Weinzaepfel - arXiv preprint arXiv …, 2023 - arxiv.org
Transformers have become the standard in state-of-the-art vision architectures, achieving
impressive performance on both image-level and dense pixelwise tasks. However, training …