Vector-quantized image modeling with improved vqgan

F Zhan, Y Yu, R Wu, J Zhang, S Lu, L Liu… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

As information exists in various modalities in real world, effective interaction and fusion
among multimodal information plays a key role for the creation and perception of multimodal …

被引用次数：199 相关文章所有 11 个版本

[PDF] mdpi.com

A review of synthetic image data and its use in computer vision

K Man, J Chahl - Journal of Imaging, 2022 - mdpi.com

Development of computer vision algorithms using convolutional neural networks and deep
learning has necessitated ever greater amounts of annotated and labelled data to produce …

被引用次数：50 相关文章所有 7 个版本

[PDF] thecvf.com

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

JZ Wu, Y Ge, X Wang, SW Lei, Y Gu… - Proceedings of the …, 2023 - openaccess.thecvf.com

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale
video datasets to train a text-to-video (T2V) generator. Despite their promising results, such …

被引用次数：460 相关文章所有 4 个版本

[PDF] arxiv.org

Make-a-video: Text-to-video generation without text-video data

U Singer, A Polyak, T Hayes, X Yin, J An… - arXiv preprint arXiv …, 2022 - arxiv.org

We propose Make-A-Video--an approach for directly translating the tremendous recent
progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple …

被引用次数：857 相关文章所有 3 个版本

[PDF] 3dvar.com

[PDF][PDF] Scaling autoregressive models for content-rich text-to-image generation

J Yu, Y Xu, JY Koh, T Luong, G Baid, Z Wang… - arXiv preprint arXiv …, 2022 - 3dvar.com

Abstract We present the Pathways [1] Autoregressive Text-to-Image (Parti) model, which
generates high-fidelity photorealistic images and supports content-rich synthesis involving …

被引用次数：827 相关文章所有 5 个版本

[PDF] arxiv.org

Muse: Text-to-image generation via masked generative transformers

H Chang, H Zhang, J Barber, AJ Maschinot… - arXiv preprint arXiv …, 2023 - arxiv.org

We present Muse, a text-to-image Transformer model that achieves state-of-the-art image
generation performance while being significantly more efficient than diffusion or …

被引用次数：353 相关文章所有 6 个版本

[PDF] neurips.cc

Photorealistic text-to-image diffusion models with deep language understanding

C Saharia, W Chan, S Saxena, L Li… - Advances in neural …, 2022 - proceedings.neurips.cc

We present Imagen, a text-to-image diffusion model with an unprecedented degree of
photorealism and a deep level of language understanding. Imagen builds on the power of …

被引用次数：4163 相关文章所有 11 个版本

[PDF] arxiv.org

Audiolm: a language modeling approach to audio generation

Z Borsos, R Marinier, D Vincent… - … ACM transactions on …, 2023 - ieeexplore.ieee.org

We introduce AudioLM, a framework for high-quality audio generation with long-term
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …

被引用次数：397 相关文章所有 5 个版本

[PDF] openreview.net

Phenaki: Variable length video generation from open domain textual descriptions

R Villegas, M Babaeizadeh, PJ Kindermans… - International …, 2022 - openreview.net

We present Phenaki, a model capable of realistic video synthesis given a sequence of
textual prompts. Generating videos from text is particularly challenging due to the …

被引用次数：274 相关文章所有 5 个版本

[PDF] thecvf.com

Sequential modeling enables scalable learning for large vision models

Y Bai, X Geng, K Mangalam, A Bar… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce a novel sequential modeling approach which enables learning a Large Vision
Model (LVM) without making use of any linguistic data. To do this we define a common …

被引用次数：65 相关文章所有 3 个版本