Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation...

X Chen, L Huang, Y Liu, Y Shen… - Proceedings of the …, 2024 - openaccess.thecvf.com

This work presents AnyDoor a diffusion-based image generator with the power to teleport
target objects to new scenes at user-specified locations with desired shapes. Instead of …

被引用次数：86 相关文章所有 3 个版本

[PDF] thecvf.com

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

被引用次数：77 相关文章所有 3 个版本

[PDF] thecvf.com

Instantbooth: Personalized text-to-image generation without test-time finetuning

J Shi, W Xiong, Z Lin, HJ Jung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Recent advances in personalized image generation have enabled pre-trained text-to-image
models to learn new concepts from specific image sets. However these methods often …

被引用次数：126 相关文章所有 3 个版本

[PDF] arxiv.org

Dreamllm: Synergistic multimodal comprehension and creation

R Dong, C Han, Y Peng, Z Qi, Z Ge, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper presents DreamLLM, a learning framework that first achieves versatile
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …

被引用次数：59 相关文章所有 4 个版本

[PDF] thecvf.com

Style aligned image generation via shared attention

A Hertz, A Voynov, S Fruchter… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large-scale Text-to-Image (T2I) models have rapidly gained prominence across
creative fields generating visually compelling outputs from textual prompts. However …

被引用次数：23 相关文章所有 4 个版本

[PDF] thecvf.com

Alpha-clip: A clip model focusing on wherever you want

Z Sun, Y Fang, T Wu, P Zhang, Y Zang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) plays an essential role in
extracting valuable content information from images across diverse tasks. It aligns textual …

被引用次数：17 相关文章所有 3 个版本

[PDF] arxiv.org

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

被引用次数：73 相关文章所有 4 个版本

[PDF] thecvf.com

Gpt4point: A unified framework for point-language understanding and generation

Z Qi, Y Fang, Z Sun, X Wu, T Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multimodal Large Language Models (MLLMs) have excelled in 2D image-text
comprehension and image generation but their understanding of the 3D world is notably …

被引用次数：10 相关文章所有 3 个版本

[PDF] thecvf.com

Videobooth: Diffusion-based video generation with image prompts

Y Jiang, T Wu, S Yang, C Si, D Lin… - Proceedings of the …, 2024 - openaccess.thecvf.com

Text-driven video generation witnesses rapid progress. However merely using text prompts
is not enough to depict the desired subject appearance that accurately aligns with users' …

被引用次数：13 相关文章所有 4 个版本

[PDF] arxiv.org

Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning

J Ma, J Liang, C Chen, H Lu - arXiv preprint arXiv:2307.11410, 2023 - arxiv.org

Recent progress in personalized image generation using diffusion models has been
significant. However, development in the area of open-domain and non-fine-tuning …

被引用次数：50 相关文章所有 3 个版本