Revision: Rendering tools enable spatial fidelity in vision-language models

A Chatterjee, Y Luo, T Gokhale, Y Yang… - European Conference on …, 2025 - Springer
Abstract Text-to-Image (T2I) and multimodal large language models (MLLMs) have been
adopted in solutions for several computer vision and multimodal learning tasks. However, it …

ZeroRF: Fast Sparse View 360deg Reconstruction with Zero Pretraining

R Shi, X Wei, C Wang, H Su - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
We present ZeroRF a novel per-scene optimization method addressing the challenge of
sparse view 360deg reconstruction in neural field representations. Current breakthroughs …

ZeroRF: fast sparse view 360 reconstruction with zero pretraining

R Shi, X Wei, C Wang, H Su - 2024 IEEE/CVF Conference on …, 2024 - ieeexplore.ieee.org
We present ZeroRF, a novel per-scene optimization method addressing the challenge of
sparse view 360° reconstruction in neural field representations. Current breakthroughs like …

Moma: Multimodal llm adapter for fast personalized image generation

K Song, Y Zhu, B Liu, Q Yan, A Elgammal… - European Conference on …, 2025 - Springer
In this paper, we present MoMA: an open-vocabulary, training-free personalized image
model that boasts flexible zero-shot capabilities. As foundational text-to-image models …

Contrasting deepfakes diffusion via contrastive learning and global-local similarities

L Baraldi, F Cocchi, M Cornia, A Nicolosi… - … on Computer Vision, 2025 - Springer
Discerning between authentic content and that generated by advanced AI methods has
become increasingly challenging. While previous research primarily addresses the …

Eclipse: A resource-efficient text-to-image prior for image generations

M Patel, C Kim, S Cheng, C Baral… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Text-to-image (T2I) diffusion models notably the unCLIP models (eg DALL-E-2)
achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks at …

Kiki or bouba? sound symbolism in vision-and-language models

M Alper, H Averbuch-Elor - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Although the mapping between sound and meaning in human language is assumed to be
largely arbitrary, research in cognitive science has shown that there are non-trivial …

A Practitioner's Guide to Continual Multimodal Pretraining

K Roth, V Udandarao, S Dziadzio, A Prabhu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal foundation models serve numerous applications at the intersection of vision and
language. Still, despite being pretrained on extensive data, they become outdated over time …

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

P Marcos-Manchón, R Alcover-Couso… - Proceedings of the …, 2024 - openaccess.thecvf.com
Diffusion models represent a new paradigm in text-to-image generation. Beyond generating
high-quality images from text prompts models such as Stable Diffusion have been …

T-stitch: Accelerating sampling in pre-trained diffusion models with trajectory stitching

Z Pan, B Zhuang, DA Huang, W Nie, Z Yu… - arXiv preprint arXiv …, 2024 - arxiv.org
Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality
image generation and typically requires many steps with a large model. In this paper, we …