Survey of post-OCR processing approaches

TTH Nguyen, A Jatowt, M Coustaty… - ACM Computing Surveys …, 2021 - dl.acm.org
Optical character recognition (OCR) is one of the most popular techniques used for
converting printed documents into machine-readable ones. While OCR engines can do well …

Neural machine translation with BERT for post-OCR error detection and correction

TTH Nguyen, A Jatowt, NV Nguyen… - Proceedings of the …, 2020 - dl.acm.org
The quality of OCR has a direct impact on information access, and an indirect impact on the
performance of natural language processing applications, making fine-grained (eg …

End-to-end semi-supervised approach with modulated object queries for table detection in documents

I Ehsan, T Shehzadi, D Stricker, MZ Afzal - International Journal on …, 2024 - Springer
Table detection, a pivotal task in document analysis, aims to precisely recognize and locate
tables within document images. Although deep learning has shown remarkable progress in …

BART for post-correction of OCR newspaper text

E Soper, S Fujimoto, YY Yu - … of the Seventh Workshop on Noisy …, 2021 - aclanthology.org
Optical character recognition (OCR) from newspaper page images is susceptible to noise
due to degradation of old documents and variation in typesetting. In this report, we present a …

The newspaper navigator dataset: extracting and analyzing visual content from 16 million historic newspaper pages in chronicling America

BCG Lee, J Mears, E Jakeway, M Ferriter… - arXiv preprint arXiv …, 2020 - arxiv.org
Chronicling America is a product of the National Digital Newspaper Program, a partnership
between the Library of Congress and the National Endowment for the Humanities to digitize …

Advancing post-OCR correction: A comparative study of synthetic data

S Guan, D Greene - arXiv preprint arXiv:2408.02253, 2024 - arxiv.org
This paper explores the application of synthetic data in the post-OCR domain on multiple
fronts by conducting experiments to assess the impact of data volume, augmentation, and …

Automated size-specific dose estimates using deep learning image processing

J Juszczyk, P Badura, J Czajkowska, A Wijata… - Medical Image …, 2021 - Elsevier
An automated vendor-independent system for dose monitoring in computed tomography
(CT) medical examinations involving ionizing radiation is presented in this paper. The …

Leveraging LLMs for Post-OCR Correction of Historical Newspapers

A Thomas, R Gaizauskas, H Lu - Proceedings of the Third …, 2024 - aclanthology.org
Poor OCR quality continues to be a major obstacle for humanities scholars seeking to make
use of digitised primary sources such as historical newspapers. Typical approaches to post …

Post-OCR document correction with large ensembles of character sequence-to-sequence models

JA Ramirez-Orta, E Xamena, A Maguitman… - Proceedings of the …, 2022 - ojs.aaai.org
In this paper, we propose a novel method to extend sequence-to-sequence models to
accurately process sequences much longer than the ones used during training while being …

A two-step approach for automatic OCR post-correction

R Schaefer, C Neudecker - Proceedings of the 4th Joint SIGHUM …, 2020 - aclanthology.org
Abstract The quality of Optical Character Recognition (OCR) is a key factor in the digitisation
of historical documents. OCR errors are a major obstacle for downstream tasks and have …