Sora: A review on background, technology, limitations, and opportunities of large vision models
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The
model is trained to generate videos of realistic or imaginative scenes from text instructions …
model is trained to generate videos of realistic or imaginative scenes from text instructions …
A survey of the vision transformers and their CNN-transformer based variants
Vision transformers have become popular as a possible substitute to convolutional neural
networks (CNNs) for a variety of computer vision applications. These transformers, with their …
networks (CNNs) for a variety of computer vision applications. These transformers, with their …
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
What matters when building vision-language models?
The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …
large language models and vision transformers. Despite the abundance of literature on this …
BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model
In this paper we address the challenge of image resolution variation for the Segment
Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance …
Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance …
Cogvideox: Text-to-video diffusion models with an expert transformer
We introduce CogVideoX, a large-scale diffusion transformer model designed for generating
videos based on text prompts. To efficently model video data, we propose to levearge a 3D …
videos based on text prompts. To efficently model video data, we propose to levearge a 3D …
Fit: Flexible vision transformer for diffusion model
Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such
as Diffusion Transformers, often face challenges when processing image resolutions outside …
as Diffusion Transformers, often face challenges when processing image resolutions outside …
Aurora: A foundation model of the atmosphere
Deep learning foundation models are revolutionizing many facets of science by leveraging
vast amounts of data to learn general-purpose representations that can be adapted to tackle …
vast amounts of data to learn general-purpose representations that can be adapted to tackle …
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Large language models (LLMs) are commonly trained on datasets consisting of fixed-length
token sequences. These datasets are created by randomly concatenating documents of …
token sequences. These datasets are created by randomly concatenating documents of …
Win-Win: Training High-Resolution Vision Transformers from Two Windows
Transformers have become the standard in state-of-the-art vision architectures, achieving
impressive performance on both image-level and dense pixelwise tasks. However, training …
impressive performance on both image-level and dense pixelwise tasks. However, training …