Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
Emu3: Next-token prediction is all you need
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
Free3d: Consistent novel view synthesis without 3d representation
We introduce Free3D a simple accurate method for monocular open-set novel view
synthesis (NVS). Similar to Zero-1-to-3 we start from a pre-trained 2D image generator for …
synthesis (NVS). Similar to Zero-1-to-3 we start from a pre-trained 2D image generator for …
Janus: Decoupling visual encoding for unified multimodal understanding and generation
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …
understanding and generation. Prior research often relies on a single visual encoder for …
Online clustered codebook
Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is
increasingly used in representation learning. However, optimizing the codevectors in …
increasingly used in representation learning. However, optimizing the codevectors in …
Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization
Existing vector quantization (VQ) based autoregressive models follow a two-stage
generation paradigm that first learns a codebook to encode images as discrete codes, and …
generation paradigm that first learns a codebook to encode images as discrete codes, and …
Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion
A Razzhigaev, A Shakhmatov, A Maltseva… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-to-image generation is a significant domain in modern computer vision and has
achieved substantial improvements through the evolution of generative architectures …
achieved substantial improvements through the evolution of generative architectures …
Maskbit: Embedding-free image generation via bit tokens
Masked transformer models for class-conditional image generation have become a
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …
Visual autoregressive modeling: Scalable image generation via next-scale prediction
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …
Not all image regions matter: Masked vector quantization for autoregressive image generation
Existing autoregressive models follow the two-stage generation paradigm that first learns a
codebook in the latent space for image reconstruction and then completes the image …
codebook in the latent space for image reconstruction and then completes the image …