Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion

A Chatterjee, Y Luo, T Gokhale, Y Yang… - European Conference on …, 2025 - Springer

Abstract Text-to-Image (T2I) and multimodal large language models (MLLMs) have been
adopted in solutions for several computer vision and multimodal learning tasks. However, it …

被引用次数：2 相关文章所有 7 个版本

[PDF] thecvf.com

ZeroRF: Fast Sparse View 360deg Reconstruction with Zero Pretraining

R Shi, X Wei, C Wang, H Su - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

We present ZeroRF a novel per-scene optimization method addressing the challenge of
sparse view 360deg reconstruction in neural field representations. Current breakthroughs …

被引用次数：10 相关文章所有 3 个版本

ZeroRF: fast sparse view 360 reconstruction with zero pretraining

R Shi, X Wei, C Wang, H Su - 2024 IEEE/CVF Conference on …, 2024 - ieeexplore.ieee.org

We present ZeroRF, a novel per-scene optimization method addressing the challenge of
sparse view 360° reconstruction in neural field representations. Current breakthroughs like …

被引用次数：8 相关文章

Moma: Multimodal llm adapter for fast personalized image generation

K Song, Y Zhu, B Liu, Q Yan, A Elgammal… - European Conference on …, 2025 - Springer

In this paper, we present MoMA: an open-vocabulary, training-free personalized image
model that boasts flexible zero-shot capabilities. As foundational text-to-image models …

被引用次数：8 相关文章所有 2 个版本

Contrasting deepfakes diffusion via contrastive learning and global-local similarities

L Baraldi, F Cocchi, M Cornia, A Nicolosi… - … on Computer Vision, 2025 - Springer

Discerning between authentic content and that generated by advanced AI methods has
become increasingly challenging. While previous research primarily addresses the …

被引用次数：6 相关文章所有 5 个版本

[PDF] thecvf.com

Eclipse: A resource-efficient text-to-image prior for image generations

M Patel, C Kim, S Cheng, C Baral… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Text-to-image (T2I) diffusion models notably the unCLIP models (eg DALL-E-2)
achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks at …

被引用次数：14 相关文章所有 3 个版本

[PDF] neurips.cc

Kiki or bouba? sound symbolism in vision-and-language models

M Alper, H Averbuch-Elor - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Although the mapping between sound and meaning in human language is assumed to be
largely arbitrary, research in cognitive science has shown that there are non-trivial …

被引用次数：9 相关文章所有 6 个版本

[PDF] arxiv.org

A Practitioner's Guide to Continual Multimodal Pretraining

K Roth, V Udandarao, S Dziadzio, A Prabhu… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal foundation models serve numerous applications at the intersection of vision and
language. Still, despite being pretrained on extensive data, they become outdated over time …

被引用次数：4 相关文章所有 4 个版本

[PDF] thecvf.com

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

P Marcos-Manchón, R Alcover-Couso… - Proceedings of the …, 2024 - openaccess.thecvf.com

Diffusion models represent a new paradigm in text-to-image generation. Beyond generating
high-quality images from text prompts models such as Stable Diffusion have been …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

T-stitch: Accelerating sampling in pre-trained diffusion models with trajectory stitching

Z Pan, B Zhuang, DA Huang, W Nie, Z Yu… - arXiv preprint arXiv …, 2024 - arxiv.org

Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality
image generation and typically requires many steps with a large model. In this paper, we …

被引用次数：13 相关文章所有 3 个版本