EchoSight: Advancing Visual-Language Models with Wiki Knowledge
Knowledge-based Visual Question Answering (KVQA) tasks require answering questions
about images using extensive background knowledge. Despite significant advancements …
about images using extensive background knowledge. Despite significant advancements …
Automated multi-level preference for mllms
Current multimodal Large Language Models (MLLMs) suffer from``hallucination'',
occasionally generating responses that are not grounded in the input images. To tackle this …
occasionally generating responses that are not grounded in the input images. To tackle this …
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
While large pre-trained visual-language models have shown promising results on traditional
visual question answering benchmarks, it is still challenging for them to answer complex …
visual question answering benchmarks, it is still challenging for them to answer complex …
Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa
Visual question answering (VQA) tasks, often performed by visual language model (VLM),
face challenges with long-tail knowledge. Recent retrieval-augmented VQA (RA-VQA) …
face challenges with long-tail knowledge. Recent retrieval-augmented VQA (RA-VQA) …
Unified Generative and Discriminative Training for Multi-modal Large Language Models
In recent times, Vision-Language Models (VLMs) have been trained under two predominant
paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) …
paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) …
Unified Text-to-Image Generation and Retrieval
How humans can efficiently and effectively acquire images has always been a perennial
question. A typical solution is text-to-image retrieval from an existing database given the text …
question. A typical solution is text-to-image retrieval from an existing database given the text …
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
Y Liu, P Chen, J Cai, X Jiang, Y Hu, J Yao… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rapid advancement of multimodal information retrieval, increasingly complex
retrieval tasks have emerged. Existing methods predominately rely on task-specific fine …
retrieval tasks have emerged. Existing methods predominately rely on task-specific fine …
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Multimodal LLMs (MLLMs) are the natural extension of large language models to handle
multimodal inputs, combining text and image data. They have recently garnered attention …
multimodal inputs, combining text and image data. They have recently garnered attention …
Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering
Multimodal Large Language Models (MLLMs) have demonstrated great zero-shot
performance on visual question answering (VQA). However, when it comes to knowledge …
performance on visual question answering (VQA). However, when it comes to knowledge …
Augmenting Multi-modal Question Answering Systems with Retrieval Methods
W Lin - 2024 - repository.cam.ac.uk
The quest to develop artificial intelligence systems capable of handling intricate tasks has
propelled the prominence of deep learning, particularly since 2016, when neural network …
propelled the prominence of deep learning, particularly since 2016, when neural network …