Survey of post-OCR processing approaches
Optical character recognition (OCR) is one of the most popular techniques used for
converting printed documents into machine-readable ones. While OCR engines can do well …
converting printed documents into machine-readable ones. While OCR engines can do well …
Neural machine translation with BERT for post-OCR error detection and correction
The quality of OCR has a direct impact on information access, and an indirect impact on the
performance of natural language processing applications, making fine-grained (eg …
performance of natural language processing applications, making fine-grained (eg …
End-to-end semi-supervised approach with modulated object queries for table detection in documents
Table detection, a pivotal task in document analysis, aims to precisely recognize and locate
tables within document images. Although deep learning has shown remarkable progress in …
tables within document images. Although deep learning has shown remarkable progress in …
BART for post-correction of OCR newspaper text
E Soper, S Fujimoto, YY Yu - … of the Seventh Workshop on Noisy …, 2021 - aclanthology.org
Optical character recognition (OCR) from newspaper page images is susceptible to noise
due to degradation of old documents and variation in typesetting. In this report, we present a …
due to degradation of old documents and variation in typesetting. In this report, we present a …
The newspaper navigator dataset: extracting and analyzing visual content from 16 million historic newspaper pages in chronicling America
BCG Lee, J Mears, E Jakeway, M Ferriter… - arXiv preprint arXiv …, 2020 - arxiv.org
Chronicling America is a product of the National Digital Newspaper Program, a partnership
between the Library of Congress and the National Endowment for the Humanities to digitize …
between the Library of Congress and the National Endowment for the Humanities to digitize …
Advancing post-OCR correction: A comparative study of synthetic data
This paper explores the application of synthetic data in the post-OCR domain on multiple
fronts by conducting experiments to assess the impact of data volume, augmentation, and …
fronts by conducting experiments to assess the impact of data volume, augmentation, and …
Automated size-specific dose estimates using deep learning image processing
An automated vendor-independent system for dose monitoring in computed tomography
(CT) medical examinations involving ionizing radiation is presented in this paper. The …
(CT) medical examinations involving ionizing radiation is presented in this paper. The …
Leveraging LLMs for Post-OCR Correction of Historical Newspapers
Poor OCR quality continues to be a major obstacle for humanities scholars seeking to make
use of digitised primary sources such as historical newspapers. Typical approaches to post …
use of digitised primary sources such as historical newspapers. Typical approaches to post …
Post-OCR document correction with large ensembles of character sequence-to-sequence models
In this paper, we propose a novel method to extend sequence-to-sequence models to
accurately process sequences much longer than the ones used during training while being …
accurately process sequences much longer than the ones used during training while being …
A two-step approach for automatic OCR post-correction
R Schaefer, C Neudecker - Proceedings of the 4th Joint SIGHUM …, 2020 - aclanthology.org
Abstract The quality of Optical Character Recognition (OCR) is a key factor in the digitisation
of historical documents. OCR errors are a major obstacle for downstream tasks and have …
of historical documents. OCR errors are a major obstacle for downstream tasks and have …