Survey of post-OCR processing approaches

TTH Nguyen, A Jatowt, M Coustaty… - ACM Computing Surveys …, 2021 - dl.acm.org
Optical character recognition (OCR) is one of the most popular techniques used for
converting printed documents into machine-readable ones. While OCR engines can do well …

Assessing the impact of OCR quality on downstream NLP tasks

D Van Strien, K Beelen, MC Ardanuy, K Hosseini… - 2020 - repository.cam.ac.uk
A growing volume of heritage data is being digitized and made available as text via optical
character recognition (OCR). Scholars and libraries are increasingly using OCR-generated …

Time-lined capturing & delivering of events with SVG & audio overlays: An interactive & versioned content delivery

PA Kishan, CH Sandeep, V Tirupathi… - AIP Conference …, 2022 - pubs.aip.org
Online transmission of video content of various higher resolution bandwidth definitions is
increasing network congestion and limiting the number of connections, content referring & …

Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians

S Oberbichler, E Boroş, A Doucet… - Journal of the …, 2022 - Wiley Online Library
This article considers the interdisciplinary opportunities and challenges of working with
digital cultural heritage, such as digitized historical newspapers, and proposes an integrated …

Neural machine translation with BERT for post-OCR error detection and correction

TTH Nguyen, A Jatowt, NV Nguyen… - Proceedings of the …, 2020 - dl.acm.org
The quality of OCR has a direct impact on information access, and an indirect impact on the
performance of natural language processing applications, making fine-grained (eg …

In-depth analysis of the impact of OCR errors on named entity recognition and linking

A Hamdi, EL Pontes, N Sidere, M Coustaty… - Natural Language …, 2023 - cambridge.org
Named entities (NEs) are among the most relevant type of information that can be used to
properly index digital documents and thus easily retrieve them. It has long been observed …

A comprehensive comparison of open-source libraries for handwritten text recognition in norwegian

M Maarand, Y Beyer, A Kåsen, KT Fosseide… - … Workshop on Document …, 2022 - Springer
In this paper, we introduce an open database of historical handwritten documents fully
annotated in Norwegian, the first of its kind, allowing the development of handwritten text …

Assessing the impact of OCR noise on multilingual event detection over digitised documents

E Boros, NK Nguyen, G Lejeune, A Doucet - International Journal on …, 2022 - Springer
Event detection is a crucial task for natural language processing and it involves the
identification of instances of specified types of events in text and their classification into …

Digitization of Data from Invoice using OCR

VNSR Kamisetty, BS Chidvilas… - 2022 6th …, 2022 - ieeexplore.ieee.org
Optical Character Recognition (OCR) is a predominant aspect to transmute scanned images
and other visuals into text. Computer vision technology is extrapolated onto the system to …

The newspaper navigator dataset: extracting and analyzing visual content from 16 million historic newspaper pages in chronicling America

BCG Lee, J Mears, E Jakeway, M Ferriter… - arXiv preprint arXiv …, 2020 - arxiv.org
Chronicling America is a product of the National Digital Newspaper Program, a partnership
between the Library of Congress and the National Endowment for the Humanities to digitize …