HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

T Guan, F Liu, X Wu, R Xian, Z Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …

[PDF][PDF] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

T Guan, F Liu, X Wu, R Xian, Z Li, X Liu… - arXiv preprint arXiv …, 2023 - researchgate.net
Large language models (LLMs), after being aligned with vision models and integrated into
vision-language models (VLMs), can bring impressive improvement in image reasoning …

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

Have we built machines that think like people?

LMS Buschoff, E Akata, M Bethge, E Schulz - arXiv preprint arXiv …, 2023 - arxiv.org
A chief goal of artificial intelligence is to build machines that think like people. Yet it has
been argued that deep neural network architectures fail to accomplish this. Researchers …

VGA: Vision GUI Assistant-Minimizing Hallucinations through Image-Centric Fine-Tuning

M Ziyang, Y Dai, Z Gong, S Guo… - Findings of the …, 2024 - aclanthology.org
Abstract Large Vision-Language Models (VLMs) have already been applied to the
understanding of Graphical User Interfaces (GUIs) and have achieved notable results …

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

HS Shahgir, KS Sayeed, A Bhattacharjee… - arXiv preprint arXiv …, 2024 - arxiv.org
The advent of Vision Language Models (VLM) has allowed researchers to investigate the
visual understanding of a neural network using natural language. Beyond object …

Navigating the risks: A survey of security, privacy, and ethics threats in llm-based agents

Y Gan, Y Yang, Z Ma, P He, R Zeng, Y Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
With the continuous development of large language models (LLMs), transformer-based
models have made groundbreaking advances in numerous natural language processing …

Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

Z Zhang, F Hu, J Lee, F Shi, P Kordjamshidi… - arXiv preprint arXiv …, 2024 - arxiv.org
Spatial expressions in situated communication can be ambiguous, as their meanings vary
depending on the frames of reference (FoR) adopted by speakers and listeners. While …

Evaluating Vision-Language Models on Bistable Images

A Panagopoulou, C Melkin… - arXiv preprint arXiv …, 2024 - arxiv.org
Bistable images, also known as ambiguous or reversible images, present visual stimuli that
can be seen in two distinct interpretations, though not simultaneously by the observer. In this …

Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

Z Shao, C Luo, Z Zhu, H Xing, Z Yu, Q Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) have shown impressive capabilities in
document understanding, a rapidly growing research area with significant industrial demand …