Myvlm: Personalizing vlms for user-specific queries

Y Alaluf, E Richardson, S Tulyakov, K Aberman… - … on Computer Vision, 2025 - Springer
Recent large-scale vision-language models (VLMs) have demonstrated remarkable
capabilities in understanding and generating textual descriptions for visual content …

LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval

Z Yang, D Xue, S Qian, W Dong, C Xu - Proceedings of the 47th …, 2024 - dl.acm.org
Zero-Shot Composed Image Retrieval (ZS-CIR) has garnered increasing interest in recent
years, which aims to retrieve a target image based on a query composed of a reference …

E5-v: Universal embeddings with multimodal large language models

T Jiang, M Song, Z Zhang, H Huang, W Deng… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) have shown promising advancements in
general visual and language understanding. However, the representation of multimodal …

Large multimodal agents: A survey

J Xie, Z Chen, R Zhang, X Wan, G Li - arXiv preprint arXiv:2402.15116, 2024 - arxiv.org
Large language models (LLMs) have achieved superior performance in powering text-
based AI agents, endowing them with decision-making and reasoning abilities akin to …

Magiclens: Self-supervised image retrieval with open-ended instructions

K Zhang, Y Luan, H Hu, K Lee, S Qiao, W Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Image retrieval, ie, finding desired images given a reference image, inherently
encompasses rich, multi-faceted search intents that are difficult to capture solely using …

Unraveling cross-modality knowledge conflicts in large vision-language models

T Zhu, Q Liu, F Wang, Z Tu, M Chen - arXiv preprint arXiv:2410.03659, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for
capturing and reasoning over multimodal inputs. However, these models are prone to …

Egocvr: An egocentric benchmark for fine-grained composed video retrieval

T Hummel, S Karthik, MI Georgescu, Z Akata - arXiv preprint arXiv …, 2024 - Springer
In Composed Video Retrieval, a video and a textual description which modifies the video
content are provided as inputs to the model. The aim is to retrieve the relevant video with the …

Multi-source spatial knowledge understanding for immersive visual text-to-speech

S He, R Liu, H Li - arXiv preprint arXiv:2410.14101, 2024 - arxiv.org
Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to
synthesize the reverberation speech for the spoken content. Previous research focused on …

Denoise-i2w: Mapping images to denoising words for accurate zero-shot composed image retrieval

Y Tang, J Yu, K Gai, J Zhuang, G Gou, G Xiong… - arXiv preprint arXiv …, 2024 - arxiv.org
Zero-Shot Composed Image Retrieval (ZS-CIR) supports diverse tasks with a broad range of
visual content manipulation intentions that can be related to domain, scene, object, and …

A Survey of Multimodal Composite Editing and Retrieval

S Li, F Huang, L Zhang - arXiv preprint arXiv:2409.05405, 2024 - arxiv.org
In the real world, where information is abundant and diverse across different modalities,
understanding and utilizing various data types to improve retrieval systems is a key focus of …