Llava-onevision: Easy visual task transfer
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …
by consolidating our insights into data, models, and visual representations in the LLaVA …
Plug-and-play grounding of reasoning in multimodal large language models
The rise of Multimodal Large Language Models (MLLMs), renowned for their advanced
instruction-following and reasoning capabilities, has significantly propelled the field of visual …
instruction-following and reasoning capabilities, has significantly propelled the field of visual …
Vlprompt: Vision-language prompting for panoptic scene graph generation
Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image
understanding by simultaneously segmenting objects and predicting relations among …
understanding by simultaneously segmenting objects and predicting relations among …
Padellm-ner: Parallel decoding in large language models for named entity recognition
In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with
Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential …
Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential …
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
Recently, many studies have demonstrated that exclusively incorporating OCR-derived text
and spatial layouts with large language models (LLMs) can be highly effective for document …
and spatial layouts with large language models (LLMs) can be highly effective for document …
[PDF][PDF] MMVQA: A Comprehensive Dataset for Investigating Multipage Multimodal Information Retrieval in PDF-based Visual Question Answering
Abstract Document Question Answering (QA) presents a challenge in understanding visually-
rich documents (VRD), particularly with lengthy textual content. Existing studies primarily …
rich documents (VRD), particularly with lengthy textual content. Existing studies primarily …
MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Document Question Answering (QA) presents a challenge in understanding visually-rich
documents (VRD), particularly those dominated by lengthy textual content like research …
documents (VRD), particularly those dominated by lengthy textual content like research …
SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding
This paper introduces SynthDoc, a novel synthetic document generation pipeline designed
to enhance Visual Document Understanding (VDU) by generating high-quality, diverse …
to enhance Visual Document Understanding (VDU) by generating high-quality, diverse …
Visually Rich Document Understanding and Intelligence
Y Ding - 2024 - ses.library.usyd.edu.au
Visually Rich Documents (VRDs) are potent carriers of multimodal information widely used
in academia, finance, medical fields, and marketing. Traditional approaches to extracting …
in academia, finance, medical fields, and marketing. Traditional approaches to extracting …