Aya dataset: An open-access collection for multilingual instruction tuning

S Singh, F Vargus, D Dsouza, BF Karlsson… - arXiv preprint arXiv …, 2024 - arxiv.org
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many
recent achievements in the space of natural language processing (NLP) can be attributed to …

[PDF][PDF] Qlarify: Recursively Expandable Abstracts for Directed Information Retrieval over Scientific Papers

R Fok, JC Chang, T August, AX Zhang… - arXiv preprint arXiv …, 2023 - talaugust.github.io
As scientific literature has grown exponentially, researchers often rely on paper triaging
strategies such as browsing abstracts before deciding to delve into a paper's full text …

Docfinqa: A long-context financial reasoning dataset

V Reddy, R Koncel-Kedziorski, VD Lai… - arXiv preprint arXiv …, 2024 - arxiv.org
For large language models (LLMs) to be effective in the financial domain--where each
decision can have a significant impact--it is necessary to investigate realistic tasks and data …

Anchor-based large language models

J Pang, F Ye, DF Wong, X He, W Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) predominantly employ decoder-only transformer
architectures, necessitating the retention of keys/values information for historical tokens to …

Docxchain: A powerful open-source toolchain for document parsing and beyond

C Yao - arXiv preprint arXiv:2310.12430, 2023 - arxiv.org
In this report, we introduce DocXChain, a powerful open-source toolchain for document
parsing, which is designed and developed to automatically convert the rich information …

DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

J Van Landeghem, S Maity, A Banerjee… - … on Document Analysis …, 2024 - Springer
This work explores knowledge distillation (KD) for visually-rich document (VRD) applications
such as document layout analysis (DLA) and document image classification (DIC). While …

TruthReader: Towards Trustworthy Document Assistant Chatbot with Reliable Attribution

D Li, X Hu, Z Sun, B Hu, S Ye, Z Shan… - Proceedings of the …, 2024 - aclanthology.org
Document assistant chatbots are empowered with extensive capabilities by Large Language
Models (LLMs) and have exhibited significant advancements. However, these systems may …

Qlarify: Recursively Expandable Abstracts for Dynamic Information Retrieval over Scientific Papers

R Fok, JC Chang, T August, AX Zhang… - Proceedings of the 37th …, 2024 - dl.acm.org
Navigating the vast scientific literature often starts with browsing a paper's abstract.
However, when a reader seeks additional information, not present in the abstract, they face …

DocHieNet: A Large and Diverse Dataset for Document Hierarchy Parsing

H Xing, C Cheng, F Gao, Z Shao, Z Yu… - Proceedings of the …, 2024 - aclanthology.org
Parsing documents from pixels, such as pictures and scanned PDFs, into hierarchical
structures is extensively demanded in the daily routines of data storage, retrieval and …

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

YK Chia, L Cheng, HP Chan, C Liu, M Song… - arXiv preprint arXiv …, 2024 - arxiv.org
The ability to understand and answer questions over documents can be useful in many
business and practical applications. However, documents often contain lengthy and diverse …