Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Plug-and-play grounding of reasoning in multimodal large language models

J Chen, Y Liu, D Li, X An, W Deng, Z Feng… - arXiv preprint arXiv …, 2024 - arxiv.org
The rise of Multimodal Large Language Models (MLLMs), renowned for their advanced
instruction-following and reasoning capabilities, has significantly propelled the field of visual …

Vlprompt: Vision-language prompting for panoptic scene graph generation

Z Zhou, M Shi, H Caesar - arXiv preprint arXiv:2311.16492, 2023 - arxiv.org
Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image
understanding by simultaneously segmenting objects and predicting relations among …

Padellm-ner: Parallel decoding in large language models for named entity recognition

J Lu, Z Yang, Y Wang, X Liu, B Mac Namee… - arXiv preprint arXiv …, 2024 - arxiv.org
In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with
Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential …

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

J Lu, H Yu, Y Wang, Y Ye, J Tang, Z Yang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, many studies have demonstrated that exclusively incorporating OCR-derived text
and spatial layouts with large language models (LLMs) can be highly effective for document …

[PDF][PDF] MMVQA: A Comprehensive Dataset for Investigating Multipage Multimodal Information Retrieval in PDF-based Visual Question Answering

Y Ding, K Ren, J Huang, S Luo, SC Han - 33rd International Joint …, 2024 - ijcai.org
Abstract Document Question Answering (QA) presents a challenge in understanding visually-
rich documents (VRD), particularly with lengthy textual content. Existing studies primarily …

MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering

Y Ding, K Ren, J Huang, S Luo, SC Han - arXiv preprint arXiv:2404.12720, 2024 - arxiv.org
Document Question Answering (QA) presents a challenge in understanding visually-rich
documents (VRD), particularly those dominated by lengthy textual content like research …

SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

C Ding, X Liu, W Tang, J Li, X Wang, R Zhao… - Proceedings of the 2nd …, 2024 - dl.acm.org
This paper introduces SynthDoc, a novel synthetic document generation pipeline designed
to enhance Visual Document Understanding (VDU) by generating high-quality, diverse …

Visually Rich Document Understanding and Intelligence

Y Ding - 2024 - ses.library.usyd.edu.au
Visually Rich Documents (VRDs) are potent carriers of multimodal information widely used
in academia, finance, medical fields, and marketing. Traditional approaches to extracting …