Multimodal image synthesis and editing: A survey and taxonomy
As information exists in various modalities in real world, effective interaction and fusion
among multimodal information plays a key role for the creation and perception of multimodal …
among multimodal information plays a key role for the creation and perception of multimodal …
[HTML][HTML] Adversarial text-to-image synthesis: A review
With the advent of generative adversarial networks, synthesizing images from text
descriptions has recently become an active research area. It is a flexible and intuitive way for …
descriptions has recently become an active research area. It is a flexible and intuitive way for …
Vector quantized diffusion model for text-to-image synthesis
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation.
This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent …
This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent …
De-fake: Detection and attribution of fake images generated by text-to-image generation models
Text-to-image generation models that generate images based on prompt descriptions have
attracted an increasing amount of attention during the past few months. Despite their …
attracted an increasing amount of attention during the past few months. Despite their …
Cross-modal contrastive learning for text-to-image generation
The output of text-to-image synthesis systems should be coherent, clear, photo-realistic
scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal …
scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal …
Toward verifiable and reproducible human evaluation for text-to-image generation
Human evaluation is critical for validating the performance of text-to-image generative
models, as this highly cognitive process requires deep comprehension of text and images …
models, as this highly cognitive process requires deep comprehension of text and images …
Multimodal intelligence: Representation learning, information fusion, and applications
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …
natural language processing since 2010. Each of these tasks involves a single modality in …
[HTML][HTML] Learning disentangled representations in the imaging domain
Disentangled representation learning has been proposed as an approach to learning
general representations even in the absence of, or with limited, supervision. A good general …
general representations even in the absence of, or with limited, supervision. A good general …
Improved vector quantized diffusion models
Vector quantized diffusion (VQ-Diffusion) is a powerful generative model for text-to-image
synthesis, but sometimes can still generate low-quality samples or weakly correlated images …
synthesis, but sometimes can still generate low-quality samples or weakly correlated images …
Neural architecture search with a lightweight transformer for text-to-image synthesis
Despite the cross-modal text-to-imagesynthesis task has achieved great success, most of
the latest works in this field are based on the network architectures proposed by …
the latest works in this field are based on the network architectures proposed by …