Cocktail: Mixing multi-modality control for text-conditional image generation

M Li, T Yang, H Kuang, J Wu, Z Wang, X Xiao… - … on Computer Vision, 2025 - Springer

To enhance the controllability of text-to-image diffusion models, existing efforts like
ControlNet incorporated image-based conditional controls. In this paper, we reveal that …

被引用次数：31 相关文章所有 2 个版本

[PDF] arxiv.org

Dragapart: Learning a part-level motion prior for articulated objects

R Li, C Zheng, C Rupprecht, A Vedaldi - European Conference on …, 2025 - Springer

We introduce DragAPart, a method that, given an image and a set of drags as input,
generates a new image of the same object that responds to the action of the drags …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Diffusion models for monocular depth estimation: Overcoming challenging conditions

F Tosi, PZ Ramirez, M Poggi - European Conference on Computer Vision, 2025 - Springer

We present a novel approach designed to address the complexities posed by challenging,
out-of-distribution data in the single-image depth estimation task. Starting with images that …

被引用次数：3 相关文章所有 10 个版本

[PDF] arxiv.org

Smartcontrol: Enhancing controlnet for handling rough visual conditions

X Liu, Y Wei, M Liu, X Lin, P Ren, X Xie… - European Conference on …, 2025 - Springer

Recent text-to-image generation methods such as ControlNet have achieved remarkable
success in controlling image layouts, where the generated images by the default model are …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Controllable generation with text-to-image diffusion models: A survey

P Cao, F Zhou, Q Song, L Yang - arXiv preprint arXiv:2403.04279, 2024 - arxiv.org

In the rapidly advancing realm of visual generation, diffusion models have revolutionized the
landscape, marking a significant shift in capabilities with their impressive text-guided …

被引用次数：25 相关文章所有 2 个版本

[PDF] arxiv.org

Multi-modal generative ai: Multi-modal llm, diffusion and beyond

H Chen, X Wang, Y Zhou, B Huang, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Multi-modal generative AI has received increasing attention in both academia and industry.
Particularly, two dominant families of techniques are: i) The multi-modal large language …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Anycontrol: create your artwork with versatile control on text-to-image generation

Y Sun, Y Liu, Y Tang, W Pei, K Chen - European Conference on Computer …, 2025 - Springer

The field of text-to-image (T2I) generation has made significant progress in recent years,
largely driven by advancements in diffusion models. Linguistic control enables effective …

被引用次数：2 相关文章所有 5 个版本

[PDF] arxiv.org

When ControlNet Meets Inexplicit Masks: A Case Study of ControlNet on its Contour-following Ability

W Xuan, Y Xu, S Zhao, C Wang, J Liu, B Du… - Proceedings of the 32nd …, 2024 - dl.acm.org

ControlNet excels at creating content that closely matches precise contours in user-provided
masks. However, when these masks contain noise, as a frequent occurrence with non …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Bootpig: Bootstrapping zero-shot personalized image generation capabilities in pretrained diffusion models

S Purushwalkam, A Gokul, S Joty, N Naik - arXiv preprint arXiv …, 2024 - arxiv.org

Recent text-to-image generation models have demonstrated incredible success in
generating images that faithfully follow input prompts. However, the requirement of using …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

X Shuai, H Ding, X Ma, R Tu, YG Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org

Image editing aims to edit the given synthetic or real image to meet the specific requirements
from users. It is widely studied in recent years as a promising and challenging field of …

被引用次数：14 相关文章