Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Free3d: Consistent novel view synthesis without 3d representation

C Zheng, A Vedaldi - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
We introduce Free3D a simple accurate method for monocular open-set novel view
synthesis (NVS). Similar to Zero-1-to-3 we start from a pre-trained 2D image generator for …

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Online clustered codebook

C Zheng, A Vedaldi - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is
increasingly used in representation learning. However, optimizing the codevectors in …

Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization

M Huang, Z Mao, Z Chen… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Existing vector quantization (VQ) based autoregressive models follow a two-stage
generation paradigm that first learns a codebook to encode images as discrete codes, and …

Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion

A Razzhigaev, A Shakhmatov, A Maltseva… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-to-image generation is a significant domain in modern computer vision and has
achieved substantial improvements through the evolution of generative architectures …

Maskbit: Embedding-free image generation via bit tokens

M Weber, L Yu, Q Yu, X Deng, X Shen… - arXiv preprint arXiv …, 2024 - arxiv.org
Masked transformer models for class-conditional image generation have become a
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …

Visual autoregressive modeling: Scalable image generation via next-scale prediction

K Tian, Y Jiang, Z Yuan, B Peng, L Wang - arXiv preprint arXiv:2404.02905, 2024 - arxiv.org
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …

Not all image regions matter: Masked vector quantization for autoregressive image generation

M Huang, Z Mao, Q Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Existing autoregressive models follow the two-stage generation paradigm that first learns a
codebook in the latent space for image reconstruction and then completes the image …