Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

T Guan, F Liu, X Wu, R Xian, Z Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …

被引用次数：131 相关文章所有 3 个版本

[PDF] researchgate.net

[PDF][PDF] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

T Guan, F Liu, X Wu, R Xian, Z Li, X Liu… - arXiv preprint arXiv …, 2023 - researchgate.net

Large language models (LLMs), after being aligned with vision models and integrated into
vision-language models (VLMs), can bring impressive improvement in image reasoning …

被引用次数：110 相关文章

[PDF] arxiv.org

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

被引用次数：51 相关文章所有 3 个版本

[PDF] arxiv.org

Have we built machines that think like people?

LMS Buschoff, E Akata, M Bethge, E Schulz - arXiv preprint arXiv …, 2023 - arxiv.org

A chief goal of artificial intelligence is to build machines that think like people. Yet it has
been argued that deep neural network architectures fail to accomplish this. Researchers …

被引用次数：6 相关文章所有 3 个版本

[PDF] aclanthology.org

VGA: Vision GUI Assistant-Minimizing Hallucinations through Image-Centric Fine-Tuning

M Ziyang, Y Dai, Z Gong, S Guo… - Findings of the …, 2024 - aclanthology.org

Abstract Large Vision-Language Models (VLMs) have already been applied to the
understanding of Graphical User Interfaces (GUIs) and have achieved notable results …

被引用次数：2 相关文章

[PDF] arxiv.org

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

HS Shahgir, KS Sayeed, A Bhattacharjee… - arXiv preprint arXiv …, 2024 - arxiv.org

The advent of Vision Language Models (VLM) has allowed researchers to investigate the
visual understanding of a neural network using natural language. Beyond object …

被引用次数：8 相关文章

[PDF] arxiv.org

Navigating the risks: A survey of security, privacy, and ethics threats in llm-based agents

Y Gan, Y Yang, Z Ma, P He, R Zeng, Y Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

With the continuous development of large language models (LLMs), transformer-based
models have made groundbreaking advances in numerous natural language processing …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

Z Zhang, F Hu, J Lee, F Shi, P Kordjamshidi… - arXiv preprint arXiv …, 2024 - arxiv.org

Spatial expressions in situated communication can be ambiguous, as their meanings vary
depending on the frames of reference (FoR) adopted by speakers and listeners. While …

Evaluating Vision-Language Models on Bistable Images

A Panagopoulou, C Melkin… - arXiv preprint arXiv …, 2024 - arxiv.org

Bistable images, also known as ambiguous or reversible images, present visual stimuli that
can be seen in two distinct interpretations, though not simultaneously by the observer. In this …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

Z Shao, C Luo, Z Zhu, H Xing, Z Yu, Q Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs) have shown impressive capabilities in
document understanding, a rapidly growing research area with significant industrial demand …