Survey of post-OCR processing approaches

TTH Nguyen, A Jatowt, M Coustaty… - ACM Computing Surveys …, 2021 - dl.acm.org
Optical character recognition (OCR) is one of the most popular techniques used for
converting printed documents into machine-readable ones. While OCR engines can do well …

Neural machine translation: A review

F Stahlberg - Journal of Artificial Intelligence Research, 2020 - jair.org
The field of machine translation (MT), the automatic translation of written text from one
natural language into another, has experienced a major paradigm shift in recent years …

Text processing like humans do: Visually attacking and shielding NLP systems

S Eger, GG Şahin, A Rücklé, JU Lee, C Schulz… - arXiv preprint arXiv …, 2019 - arxiv.org
Visual modifications to text are often used to obfuscate offensive comments in social media
(eg,"! d10t") or as a writing style (" 1337" in" leet speak"), among other scenarios. We …

Morphological inflection generation with hard monotonic attention

R Aharoni, Y Goldberg - arXiv preprint arXiv:1611.01487, 2016 - arxiv.org
We present a neural model for morphological inflection generation which employs a hard
attention mechanism, inspired by the nearly-monotonic alignment commonly found between …

Reducing sequence length by predicting edit spans with large language models

M Kaneko, N Okazaki - Proceedings of the 2023 Conference on …, 2023 - aclanthology.org
Abstract Large Language Models (LLMs) have demonstrated remarkable performance in
various tasks and gained significant attention. LLMs are also used for local sequence …

OCR post correction for endangered language texts

S Rijhwani, A Anastasopoulos, G Neubig - arXiv preprint arXiv …, 2020 - arxiv.org
There is little to no data available to build natural language processing models for most
endangered languages. However, textual data in these languages often exists in formats …

Supervised OCR error detection and correction using statistical and neural machine translation methods

C Amrhein, S Clematide - Journal for Language Technology and …, 2018 - zora.uzh.ch
For indexing the content of digitized historical texts, optical character recognition (OCR)
errors are a hampering problem. To explore the effectivity of new strategies for OCR post …

Multi-input attention for unsupervised OCR correction

R Dong, DA Smith - Proceedings of the 56th Annual Meeting of …, 2018 - aclanthology.org
We propose a novel approach to OCR post-correction that exploits repeated texts in large
corpora both as a source of noisy target outputs for unsupervised training and as a source of …

Neural OCR post-hoc correction of historical corpora

L Lyu, M Koutraki, M Krickl, B Fetahu - Transactions of the Association …, 2021 - direct.mit.edu
Optical character recognition (OCR) is crucial for a deeper access to historical collections.
OCR needs to account for orthographic variations, typefaces, or language evolution (ie, new …

Lexically aware semi-supervised learning for OCR post-correction

S Rijhwani, D Rosenblum… - Transactions of the …, 2021 - direct.mit.edu
Much of the existing linguistic data in many languages of the world is locked away in non-
digitized books and documents. Optical character recognition (OCR) can be used to produce …