ViQuAE, a dataset for knowledge-based visual question answering about named entities

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

被引用次数：39 相关文章所有 2 个版本

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

被引用次数：311 相关文章所有 2 个版本

[PDF] thecvf.com

Mvbench: A comprehensive multi-modal video understanding benchmark

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …

被引用次数：230 相关文章所有 4 个版本

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arXiv preprint arXiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

被引用次数：61 相关文章所有 3 个版本

[PDF] arxiv.org

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier

The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

被引用次数：22 相关文章所有 2 个版本

[PDF] arxiv.org

MIT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

L Li, Y Yin, S Li, L Chen, P Wang, S Ren, M Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Instruction tuning has significantly advanced large language models (LLMs) such as
ChatGPT, enabling them to align with human instructions across diverse tasks. However …

被引用次数：101 相关文章所有 2 个版本

[PDF] arxiv.org

Can pre-trained vision and language models answer visual information-seeking questions?

Y Chen, H Hu, Y Luan, H Sun, S Changpinyo… - arXiv preprint arXiv …, 2023 - arxiv.org

Pre-trained vision and language models have demonstrated state-of-the-art capabilities over
existing tasks involving images and texts, including visual question answering. However, it …

被引用次数：61 相关文章所有 6 个版本

[PDF] arxiv.org

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its
progression has been hindered by challenges in comprehending fine-grained visual content …

被引用次数：99 相关文章所有 2 个版本

[PDF] arxiv.org

VILA: VILA Augmented VILA

Y Fang, L Zhu, Y Lu, Y Wang, P Molchanov… - arXiv preprint arXiv …, 2024 - arxiv.org

While visual language model architectures and training infrastructures advance rapidly, data
curation remains under-explored where quantity and quality become a bottleneck. Existing …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

被引用次数：6 相关文章所有 2 个版本