Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Q Yu, J Li, L Wei, L Pang, W Ye, B Qin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …

How many unicorns are in this image? a safety evaluation benchmark for vision llms

H Tu, C Cui, Z Wang, Y Zhou, B Zhao, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org
This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different
from prior studies, we shift our focus from evaluating standard performance to introducing a …

LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention

R Zhang, J Han, C Liu, A Zhou, P Lu… - The Twelfth …, 2024 - openreview.net
With the rising tide of large language models (LLMs), there has been a growing interest in
developing general-purpose instruction-following models, eg, ChatGPT. To this end, we …

Sight beyond text: Multi-modal training enhances llms in truthfulness and ethics

H Tu, B Zhao, C Wei, C Xie - arXiv preprint arXiv:2309.07120, 2023 - arxiv.org
Multi-modal large language models (MLLMs) are trained based on large language models
(LLM), with an enhanced capability to comprehend multi-modal inputs and generate textual …

Causal Inference with Latent Variables: Recent Advances and Future Prospectives

Y Zhu, Y He, J Ma, M Hu, S Li, J Li - Proceedings of the 30th ACM …, 2024 - dl.acm.org
Causality lays the foundation for the trajectory of our world. Causal inference (CI), which
aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial …

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models

Y Li, W Tian, Y Jiao, J Chen, YG Jiang - arXiv preprint arXiv:2404.12966, 2024 - arxiv.org
Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making
presuppositions based on established facts and extrapolating potential outcomes. Existing …

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

B Zhao, Y Zong, L Zhang, T Hospedales - arXiv preprint arXiv:2406.12742, 2024 - arxiv.org
The advancement of large language models (LLMs) has significantly broadened the scope
of applications in natural language processing, with multi-modal LLMs extending these …

Enhancing Multimodal Understanding With LIUS: A Novel Framework for Visual Question Answering in Digital Marketing

C Song - Journal of Organizational and End User Computing …, 2024 - igi-global.com
VQA (visual question and answer) is the task of enabling a computer to generate accurate
textual answers based on given images and related questions. It integrates computer vision …

[PDF][PDF] How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

H Tu, C Cui, Z Wang, Y Zhou, B Zhao, J Han, W Zhou… - ecva.net
This work focuses on benchmarking the capabilities of vision large language models
(VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating …