Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

Y Liu, K Zhang, Y Li, Z Yan, C Gao, R Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The
model is trained to generate videos of realistic or imaginative scenes from text instructions …

被引用次数：111 相关文章所有 2 个版本

[PDF] arxiv.org

A survey of the vision transformers and their CNN-transformer based variants

A Khan, Z Rauf, A Sohail, AR Khan, H Asif… - Artificial Intelligence …, 2023 - Springer

Vision transformers have become popular as a possible substitute to convolutional neural
networks (CNNs) for a variety of computer vision applications. These transformers, with their …

被引用次数：54 相关文章所有 6 个版本

[PDF] arxiv.org

PaliGemma: A versatile 3B VLM for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

What matters when building vision-language models?

H Laurençon, L Tronchon, M Cord, V Sanh - arXiv preprint arXiv …, 2024 - arxiv.org

The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …

被引用次数：55 相关文章所有 2 个版本

[PDF] thecvf.com

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Y Song, Q Zhou, X Li, DP Fan… - Proceedings of the …, 2024 - openaccess.thecvf.com

In this paper we address the challenge of image resolution variation for the Segment
Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Cogvideox: Text-to-video diffusion models with an expert transformer

Z Yang, J Teng, W Zheng, M Ding, S Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce CogVideoX, a large-scale diffusion transformer model designed for generating
videos based on text prompts. To efficently model video data, we propose to levearge a 3D …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Fit: Flexible vision transformer for diffusion model

Z Lu, Z Wang, D Huang, C Wu, X Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such
as Diffusion Transformers, often face challenges when processing image resolutions outside …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Aurora: A foundation model of the atmosphere

C Bodnar, WP Bruinsma, A Lucic, M Stanley… - arXiv preprint arXiv …, 2024 - arxiv.org

Deep learning foundation models are revolutionizing many facets of science by leveraging
vast amounts of data to learn general-purpose representations that can be adapted to tackle …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

H Pouransari, CL Li, JHR Chang, PKA Vasu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) are commonly trained on datasets consisting of fixed-length
token sequences. These datasets are created by randomly concatenating documents of …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Win-Win: Training High-Resolution Vision Transformers from Two Windows

V Leroy, J Revaud, T Lucas, P Weinzaepfel - arXiv preprint arXiv …, 2023 - arxiv.org

Transformers have become the standard in state-of-the-art vision architectures, achieving
impressive performance on both image-level and dense pixelwise tasks. However, training …

被引用次数：1 相关文章所有 3 个版本