Robustness of Structured Data Extraction from In-Plane Rotated Documents Using Multi-Modal Large Language Models (LLM)

A Biswas, W Talukdar - Journal of Artificial Intelligence Research, 2024 - arxiv.org
Multi-modal large language models (LLMs) have shown remarkable performance in various
natural language processing tasks, including data extraction from documents. However, the …

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

A Miyai, J Yang, J Zhang, Y Ming, Y Lin, Q Yu… - arXiv preprint arXiv …, 2024 - arxiv.org
Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine
learning systems and has shaped the field of OOD detection. Meanwhile, several other …

Show-o: One single transformer to unify multimodal understanding and generation

J Xie, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

[HTML][HTML] Deep learning for the harmonization of structural MRI scans: a survey

S Abbasi, H Lan, J Choupan, N Sheikh-Bahaei… - BioMedical Engineering …, 2024 - Springer
Medical imaging datasets for research are frequently collected from multiple imaging centers
using different scanners, protocols, and settings. These variations affect data consistency …

MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era

J Nie, G Zhang, W An, YP Tan, AC Kot, S Lu - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the recent advancements in Multi-modal Large Language Models (MLLMs),
understanding inter-object relations, ie, interactions or associations between distinct objects …

Reefknot: A comprehensive benchmark for relation hallucination evaluation, analysis and mitigation in multimodal large language models

K Zheng, J Chen, Y Yan, X Zou, X Hu - arXiv preprint arXiv:2408.09429, 2024 - arxiv.org
Hallucination issues persistently plagued current multimodal large language models
(MLLMs). While existing research primarily focuses on object-level or attribute-level …

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

J Chen, D Yang, T Wu, Y Jiang, X Hou, M Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications,
including medical visual question answering and imaging report generation. While these …

Mitigating Object Hallucination via Data Augmented Contrastive Tuning

P Sarkar, S Ebrahimi, A Etemad, A Beirami… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite their remarkable progress, Multimodal Large Language Models (MLLMs) tend to
hallucinate factually inaccurate information. In this work, we address object hallucinations in …

CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models

J Wu, X Li, T Yu, Y Wang, X Chen, J Gu, L Yao… - arXiv preprint arXiv …, 2024 - arxiv.org
Instruction tuning in multimodal large language models (MLLMs) aims to smoothly integrate
a backbone LLM with a pre-trained feature encoder for downstream tasks. The major …

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

B Yan, J Zhang, Z Yuan, S Shan, X Chen - arXiv preprint arXiv:2406.17115, 2024 - arxiv.org
Despite the rapid progress and outstanding performance of Large Vision-Language Models
(LVLMs) in recent years, LVLMs have been plagued by the issue of hallucination, ie, LVLMs …