Lion: Empowering multimodal large language model with dual-level visual knowledge

G Chen, L Shen, R Shao, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However most of the existing MLLMs …

A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering

Y Li, L Wang, B Hu, X Chen, W Zhong, C Lyu… - arXiv preprint arXiv …, 2023 - arxiv.org
The emergence of multimodal large models (MLMs) has significantly advanced the field of
visual understanding, offering remarkable capabilities in the realm of visual question …

Vision-language instruction tuning: A review and analysis

C Li, Y Ge, D Li, Y Shan - arXiv preprint arXiv:2311.08172, 2023 - arxiv.org
Instruction tuning is an essential supervised training phase for Large Language Models
(LLMs), with the goal of enhancing LLMs' capacity to generalize instruction execution and …

Training multimedia event extraction with generated images and captions

Z Du, Y Li, X Guo, Y Sun, B Li - … of the 31st ACM International Conference …, 2023 - dl.acm.org
Contemporary news reporting increasingly features multimedia content, motivating research
on multimedia event extraction. However, the task lacks annotated multimodal training data …

Switchgpt: Adapting large language models for non-text outputs

X Wang, B Zhuang, Q Wu - arXiv preprint arXiv:2309.07623, 2023 - arxiv.org
Large Language Models (LLMs), primarily trained on text-based datasets, exhibit
exceptional proficiencies in understanding and executing complex linguistic instructions via …

Visual instruction tuning towards general-purpose multimodal model: A survey

J Huang, J Zhang, K Jiang, H Qiu, S Lu - arXiv preprint arXiv:2312.16602, 2023 - arxiv.org
Traditional computer vision generally solves each single task independently by a dedicated
model with the task instruction implicitly designed in the model architecture, arising two …

Mitigating multilingual hallucination in large vision-language models

X Qu, M Song, W Wei, J Dong, Y Cheng - arXiv preprint arXiv:2408.00550, 2024 - arxiv.org
While Large Vision-Language Models (LVLMs) have exhibited remarkable capabilities
across a wide range of tasks, they suffer from hallucination problems, where models …

[PDF][PDF] LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

Y Li, X Chen, B Hu, M Zhang - arXiv preprint arXiv:2402.13546, 2024 - researchgate.net
Long video understanding is a significant and ongoing challenge in the intersection of
multimedia and artificial intelligence. Employing large language models (LLMs) for …

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

IE Toubal, A Avinash, NG Alldrin… - Proceedings of the …, 2024 - openaccess.thecvf.com
From content moderation to wildlife conservation the number of applications that require
models to recognize nuanced or subjective visual concepts is growing. Traditionally …

Path to medical agi: Unify domain-specific medical llms with the lowest cost

J Zhou, X Chen, X Gao - medRxiv, 2023 - medrxiv.org
Medical artificial general intelligence (AGI) is an emerging field that aims to develop systems
specifically designed for medical applications that possess the ability to understand, learn …