Revision: Rendering tools enable spatial fidelity in vision-language models
Abstract Text-to-Image (T2I) and multimodal large language models (MLLMs) have been
adopted in solutions for several computer vision and multimodal learning tasks. However, it …
adopted in solutions for several computer vision and multimodal learning tasks. However, it …
ZeroRF: Fast Sparse View 360deg Reconstruction with Zero Pretraining
We present ZeroRF a novel per-scene optimization method addressing the challenge of
sparse view 360deg reconstruction in neural field representations. Current breakthroughs …
sparse view 360deg reconstruction in neural field representations. Current breakthroughs …
ZeroRF: fast sparse view 360 reconstruction with zero pretraining
We present ZeroRF, a novel per-scene optimization method addressing the challenge of
sparse view 360° reconstruction in neural field representations. Current breakthroughs like …
sparse view 360° reconstruction in neural field representations. Current breakthroughs like …
Moma: Multimodal llm adapter for fast personalized image generation
In this paper, we present MoMA: an open-vocabulary, training-free personalized image
model that boasts flexible zero-shot capabilities. As foundational text-to-image models …
model that boasts flexible zero-shot capabilities. As foundational text-to-image models …
Contrasting deepfakes diffusion via contrastive learning and global-local similarities
Discerning between authentic content and that generated by advanced AI methods has
become increasingly challenging. While previous research primarily addresses the …
become increasingly challenging. While previous research primarily addresses the …
Eclipse: A resource-efficient text-to-image prior for image generations
Abstract Text-to-image (T2I) diffusion models notably the unCLIP models (eg DALL-E-2)
achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks at …
achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks at …
Kiki or bouba? sound symbolism in vision-and-language models
M Alper, H Averbuch-Elor - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Although the mapping between sound and meaning in human language is assumed to be
largely arbitrary, research in cognitive science has shown that there are non-trivial …
largely arbitrary, research in cognitive science has shown that there are non-trivial …
A Practitioner's Guide to Continual Multimodal Pretraining
Multimodal foundation models serve numerous applications at the intersection of vision and
language. Still, despite being pretrained on extensive data, they become outdated over time …
language. Still, despite being pretrained on extensive data, they become outdated over time …
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
P Marcos-Manchón, R Alcover-Couso… - Proceedings of the …, 2024 - openaccess.thecvf.com
Diffusion models represent a new paradigm in text-to-image generation. Beyond generating
high-quality images from text prompts models such as Stable Diffusion have been …
high-quality images from text prompts models such as Stable Diffusion have been …
T-stitch: Accelerating sampling in pre-trained diffusion models with trajectory stitching
Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality
image generation and typically requires many steps with a large model. In this paper, we …
image generation and typically requires many steps with a large model. In this paper, we …