Layoutgpt: Compositional visual planning and generation with large language models
Attaining a high degree of user controllability in visual generation often requires intricate,
fine-grained inputs like layouts. However, such inputs impose a substantial burden on users …
fine-grained inputs like layouts. However, such inputs impose a substantial burden on users …
Magicbrush: A manually annotated dataset for instruction-guided image editing
Text-guided image editing is widely needed in daily life, ranging from personal use to
professional applications such as Photoshop. However, existing methods are either zero …
professional applications such as Photoshop. However, existing methods are either zero …
Training-free structured diffusion guidance for compositional text-to-image synthesis
Large-scale diffusion models have achieved state-of-the-art results on text-to-image
synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we …
synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we …
Counterfactual vqa: A cause-effect look at language bias
Recent VQA models may tend to rely on language bias as a shortcut and thus fail to
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …
Talk-to-edit: Fine-grained facial editing via dialog
Facial editing is an important task in vision and graphics with numerous applications.
However, existing works are incapable to deliver a continuous and fine-grained editing …
However, existing works are incapable to deliver a continuous and fine-grained editing …
Guiding instruction-based image editing via multimodal large language models
Instruction-based image editing improves the controllability and flexibility of image
manipulation via natural commands without elaborate descriptions or regional masks …
manipulation via natural commands without elaborate descriptions or regional masks …
Tell me what happened: Unifying text-guided video completion via multimodal masked video generation
Generating a video given the first several static frames is challenging as it anticipates
reasonable future frames with temporal coherence. Besides video prediction, the ability to …
reasonable future frames with temporal coherence. Besides video prediction, the ability to …
Language-driven artistic style transfer
Despite having promising results, style transfer, which requires preparing style images in
advance, may result in lack of creativity and accessibility. Following human instruction, on …
advance, may result in lack of creativity and accessibility. Following human instruction, on …
Talk-to-edit: Fine-grained 2d and 3d facial editing via dialog
Facial editing is to manipulate the facial attributes of a given face image. Nowadays, with the
development of generative models, users can easily generate 2D and 3D facial images with …
development of generative models, users can easily generate 2D and 3D facial images with …
Iterative multi-granular image editing using diffusion models
Recent advances in text-guided image synthesis has dramatically changed how creative
professionals generate artistic and aesthetically pleasing visual assets. To fully support such …
professionals generate artistic and aesthetically pleasing visual assets. To fully support such …