Supervised OCR error detection and correction using statistical and neural machine translation...

TTH Nguyen, A Jatowt, M Coustaty… - ACM Computing Surveys …, 2021 - dl.acm.org

Optical character recognition (OCR) is one of the most popular techniques used for
converting printed documents into machine-readable ones. While OCR engines can do well …

被引用次数：167 相关文章所有 4 个版本

[PDF] hal.science

Neural machine translation with BERT for post-OCR error detection and correction

TTH Nguyen, A Jatowt, NV Nguyen… - Proceedings of the …, 2020 - dl.acm.org

The quality of OCR has a direct impact on information access, and an indirect impact on the
performance of natural language processing applications, making fine-grained (eg …

被引用次数：69 相关文章所有 4 个版本

[PDF] arxiv.org

End-to-end semi-supervised approach with modulated object queries for table detection in documents

I Ehsan, T Shehzadi, D Stricker, MZ Afzal - International Journal on …, 2024 - Springer

Table detection, a pivotal task in document analysis, aims to precisely recognize and locate
tables within document images. Although deep learning has shown remarkable progress in …

被引用次数：4 相关文章所有 3 个版本

[PDF] aclanthology.org

BART for post-correction of OCR newspaper text

E Soper, S Fujimoto, YY Yu - … of the Seventh Workshop on Noisy …, 2021 - aclanthology.org

Optical character recognition (OCR) from newspaper page images is susceptible to noise
due to degradation of old documents and variation in typesetting. In this report, we present a …

被引用次数：32 相关文章所有 2 个版本

[PDF] arxiv.org

The newspaper navigator dataset: extracting and analyzing visual content from 16 million historic newspaper pages in chronicling America

BCG Lee, J Mears, E Jakeway, M Ferriter… - arXiv preprint arXiv …, 2020 - arxiv.org

Chronicling America is a product of the National Digital Newspaper Program, a partnership
between the Library of Congress and the National Endowment for the Humanities to digitize …

被引用次数：43 相关文章所有 4 个版本

[PDF] arxiv.org

Advancing post-OCR correction: A comparative study of synthetic data

S Guan, D Greene - arXiv preprint arXiv:2408.02253, 2024 - arxiv.org

This paper explores the application of synthetic data in the post-OCR domain on multiple
fronts by conducting experiments to assess the impact of data volume, augmentation, and …

被引用次数：3 相关文章所有 5 个版本

Automated size-specific dose estimates using deep learning image processing

J Juszczyk, P Badura, J Czajkowska, A Wijata… - Medical Image …, 2021 - Elsevier

An automated vendor-independent system for dose monitoring in computed tomography
(CT) medical examinations involving ionizing radiation is presented in this paper. The …

被引用次数：26 相关文章所有 3 个版本

[PDF] aclanthology.org

Leveraging LLMs for Post-OCR Correction of Historical Newspapers

A Thomas, R Gaizauskas, H Lu - Proceedings of the Third …, 2024 - aclanthology.org

Poor OCR quality continues to be a major obstacle for humanities scholars seeking to make
use of digitised primary sources such as historical newspapers. Typical approaches to post …

被引用次数：10 相关文章所有 4 个版本

[PDF] aaai.org

Post-OCR document correction with large ensembles of character sequence-to-sequence models

JA Ramirez-Orta, E Xamena, A Maguitman… - Proceedings of the …, 2022 - ojs.aaai.org

In this paper, we propose a novel method to extend sequence-to-sequence models to
accurately process sequences much longer than the ones used during training while being …

被引用次数：20 相关文章所有 8 个版本

[PDF] aclanthology.org

A two-step approach for automatic OCR post-correction

R Schaefer, C Neudecker - Proceedings of the 4th Joint SIGHUM …, 2020 - aclanthology.org

Abstract The quality of Optical Character Recognition (OCR) is a key factor in the digitisation
of historical documents. OCR errors are a major obstacle for downstream tasks and have …

被引用次数：31 相关文章所有 2 个版本