EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Y Yan, W Xie - arXiv preprint arXiv:2407.12735, 2024 - arxiv.org
Knowledge-based Visual Question Answering (KVQA) tasks require answering questions
about images using extensive background knowledge. Despite significant advancements …

Automated multi-level preference for mllms

M Zhang, W Wu, Y Lu, Y Song, K Rong, H Yao… - arXiv preprint arXiv …, 2024 - arxiv.org
Current multimodal Large Language Models (MLLMs) suffer from``hallucination'',
occasionally generating responses that are not grounded in the input images. To tackle this …

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

D Hao, Q Wang, L Guo, J Jiang… - Proceedings of the 2024 …, 2024 - aclanthology.org
While large pre-trained visual-language models have shown promising results on traditional
visual question answering benchmarks, it is still challenging for them to answer complex …

Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa

P Jian, D Yu, J Zhang - Proceedings of the 2024 Conference on …, 2024 - aclanthology.org
Visual question answering (VQA) tasks, often performed by visual language model (VLM),
face challenges with long-tail knowledge. Recent retrieval-augmented VQA (RA-VQA) …

Unified Generative and Discriminative Training for Multi-modal Large Language Models

W Chow, J Li, Q Yu, K Pan, H Fei, Z Ge, S Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent times, Vision-Language Models (VLMs) have been trained under two predominant
paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) …

Unified Text-to-Image Generation and Retrieval

L Qu, H Li, T Wang, W Wang, Y Li, L Nie… - arXiv preprint arXiv …, 2024 - arxiv.org
How humans can efficiently and effectively acquire images has always been a perennial
question. A typical solution is text-to-image retrieval from an existing database given the text …

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Y Liu, P Chen, J Cai, X Jiang, Y Hu, J Yao… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rapid advancement of multimodal information retrieval, increasingly complex
retrieval tasks have emerged. Existing methods predominately rely on task-specific fine …

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

F Cocchi, N Moratelli, M Cornia, L Baraldi… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal LLMs (MLLMs) are the natural extension of large language models to handle
multimodal inputs, combining text and image data. They have recently garnered attention …

Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

W Weng, J Zhu, H Zhang, X Meng, R Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have demonstrated great zero-shot
performance on visual question answering (VQA). However, when it comes to knowledge …

Augmenting Multi-modal Question Answering Systems with Retrieval Methods

W Lin - 2024 - repository.cam.ac.uk
The quest to develop artificial intelligence systems capable of handling intricate tasks has
propelled the prominence of deep learning, particularly since 2016, when neural network …