Myvlm: Personalizing vlms for user-specific queries
Recent large-scale vision-language models (VLMs) have demonstrated remarkable
capabilities in understanding and generating textual descriptions for visual content …
capabilities in understanding and generating textual descriptions for visual content …
LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval
Zero-Shot Composed Image Retrieval (ZS-CIR) has garnered increasing interest in recent
years, which aims to retrieve a target image based on a query composed of a reference …
years, which aims to retrieve a target image based on a query composed of a reference …
E5-v: Universal embeddings with multimodal large language models
Multimodal large language models (MLLMs) have shown promising advancements in
general visual and language understanding. However, the representation of multimodal …
general visual and language understanding. However, the representation of multimodal …
Large multimodal agents: A survey
Large language models (LLMs) have achieved superior performance in powering text-
based AI agents, endowing them with decision-making and reasoning abilities akin to …
based AI agents, endowing them with decision-making and reasoning abilities akin to …
Magiclens: Self-supervised image retrieval with open-ended instructions
Image retrieval, ie, finding desired images given a reference image, inherently
encompasses rich, multi-faceted search intents that are difficult to capture solely using …
encompasses rich, multi-faceted search intents that are difficult to capture solely using …
Unraveling cross-modality knowledge conflicts in large vision-language models
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for
capturing and reasoning over multimodal inputs. However, these models are prone to …
capturing and reasoning over multimodal inputs. However, these models are prone to …
Egocvr: An egocentric benchmark for fine-grained composed video retrieval
In Composed Video Retrieval, a video and a textual description which modifies the video
content are provided as inputs to the model. The aim is to retrieve the relevant video with the …
content are provided as inputs to the model. The aim is to retrieve the relevant video with the …
Multi-source spatial knowledge understanding for immersive visual text-to-speech
Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to
synthesize the reverberation speech for the spoken content. Previous research focused on …
synthesize the reverberation speech for the spoken content. Previous research focused on …
Denoise-i2w: Mapping images to denoising words for accurate zero-shot composed image retrieval
Zero-Shot Composed Image Retrieval (ZS-CIR) supports diverse tasks with a broad range of
visual content manipulation intentions that can be related to domain, scene, object, and …
visual content manipulation intentions that can be related to domain, scene, object, and …
A Survey of Multimodal Composite Editing and Retrieval
In the real world, where information is abundant and diverse across different modalities,
understanding and utilizing various data types to improve retrieval systems is a key focus of …
understanding and utilizing various data types to improve retrieval systems is a key focus of …