Movq: Modulating quantized vectors for high-fidelity image generation

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：95 相关文章所有 3 个版本

[PDF] arxiv.org

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

被引用次数：56 相关文章所有 3 个版本

[PDF] thecvf.com

Free3d: Consistent novel view synthesis without 3d representation

C Zheng, A Vedaldi - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

We introduce Free3D a simple accurate method for monocular open-set novel view
synthesis (NVS). Similar to Zero-1-to-3 we start from a pre-trained 2D image generator for …

被引用次数：24 相关文章所有 6 个版本

[PDF] arxiv.org

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

被引用次数：19 相关文章所有 2 个版本

[PDF] thecvf.com

Online clustered codebook

C Zheng, A Vedaldi - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is
increasingly used in representation learning. However, optimizing the codevectors in …

被引用次数：35 相关文章所有 11 个版本

[PDF] thecvf.com

Towards accurate image coding: Improved autoregressive image generation with dynamic vector quantization

M Huang, Z Mao, Z Chen… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Existing vector quantization (VQ) based autoregressive models follow a two-stage
generation paradigm that first learns a codebook to encode images as discrete codes, and …

被引用次数：30 相关文章所有 5 个版本

[PDF] arxiv.org

Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion

A Razzhigaev, A Shakhmatov, A Maltseva… - arXiv preprint arXiv …, 2023 - arxiv.org

Text-to-image generation is a significant domain in modern computer vision and has
achieved substantial improvements through the evolution of generative architectures …

被引用次数：54 相关文章所有 3 个版本

[PDF] arxiv.org

Maskbit: Embedding-free image generation via bit tokens

M Weber, L Yu, Q Yu, X Deng, X Shen… - arXiv preprint arXiv …, 2024 - arxiv.org

Masked transformer models for class-conditional image generation have become a
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …

被引用次数：13 相关文章所有 4 个版本

[PDF] arxiv.org

Visual autoregressive modeling: Scalable image generation via next-scale prediction

K Tian, Y Jiang, Z Yuan, B Peng, L Wang - arXiv preprint arXiv:2404.02905, 2024 - arxiv.org

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …

被引用次数：124 相关文章所有 3 个版本

[PDF] thecvf.com

Not all image regions matter: Masked vector quantization for autoregressive image generation

M Huang, Z Mao, Q Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Existing autoregressive models follow the two-stage generation paradigm that first learns a
codebook in the latent space for image reconstruction and then completes the image …

被引用次数：18 相关文章所有 6 个版本