Multimodal image synthesis and editing: A survey and taxonomy

F Zhan, Y Yu, R Wu, J Zhang, S Lu, L Liu… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
As information exists in various modalities in real world, effective interaction and fusion
among multimodal information plays a key role for the creation and perception of multimodal …

A review of synthetic image data and its use in computer vision

K Man, J Chahl - Journal of Imaging, 2022 - mdpi.com
Development of computer vision algorithms using convolutional neural networks and deep
learning has necessitated ever greater amounts of annotated and labelled data to produce …

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

JZ Wu, Y Ge, X Wang, SW Lei, Y Gu… - Proceedings of the …, 2023 - openaccess.thecvf.com
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale
video datasets to train a text-to-video (T2V) generator. Despite their promising results, such …

Make-a-video: Text-to-video generation without text-video data

U Singer, A Polyak, T Hayes, X Yin, J An… - arXiv preprint arXiv …, 2022 - arxiv.org
We propose Make-A-Video--an approach for directly translating the tremendous recent
progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple …

[PDF][PDF] Scaling autoregressive models for content-rich text-to-image generation

J Yu, Y Xu, JY Koh, T Luong, G Baid, Z Wang… - arXiv preprint arXiv …, 2022 - 3dvar.com
Abstract We present the Pathways [1] Autoregressive Text-to-Image (Parti) model, which
generates high-fidelity photorealistic images and supports content-rich synthesis involving …

Muse: Text-to-image generation via masked generative transformers

H Chang, H Zhang, J Barber, AJ Maschinot… - arXiv preprint arXiv …, 2023 - arxiv.org
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image
generation performance while being significantly more efficient than diffusion or …

Photorealistic text-to-image diffusion models with deep language understanding

C Saharia, W Chan, S Saxena, L Li… - Advances in neural …, 2022 - proceedings.neurips.cc
We present Imagen, a text-to-image diffusion model with an unprecedented degree of
photorealism and a deep level of language understanding. Imagen builds on the power of …

Audiolm: a language modeling approach to audio generation

Z Borsos, R Marinier, D Vincent… - … ACM transactions on …, 2023 - ieeexplore.ieee.org
We introduce AudioLM, a framework for high-quality audio generation with long-term
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …

Phenaki: Variable length video generation from open domain textual descriptions

R Villegas, M Babaeizadeh, PJ Kindermans… - International …, 2022 - openreview.net
We present Phenaki, a model capable of realistic video synthesis given a sequence of
textual prompts. Generating videos from text is particularly challenging due to the …

Sequential modeling enables scalable learning for large vision models

Y Bai, X Geng, K Mangalam, A Bar… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce a novel sequential modeling approach which enables learning a Large Vision
Model (LVM) without making use of any linguistic data. To do this we define a common …