Survey of post-OCR processing approaches
Optical character recognition (OCR) is one of the most popular techniques used for
converting printed documents into machine-readable ones. While OCR engines can do well …
converting printed documents into machine-readable ones. While OCR engines can do well …
Neural machine translation: A review
F Stahlberg - Journal of Artificial Intelligence Research, 2020 - jair.org
The field of machine translation (MT), the automatic translation of written text from one
natural language into another, has experienced a major paradigm shift in recent years …
natural language into another, has experienced a major paradigm shift in recent years …
Text processing like humans do: Visually attacking and shielding NLP systems
Visual modifications to text are often used to obfuscate offensive comments in social media
(eg,"! d10t") or as a writing style (" 1337" in" leet speak"), among other scenarios. We …
(eg,"! d10t") or as a writing style (" 1337" in" leet speak"), among other scenarios. We …
Morphological inflection generation with hard monotonic attention
R Aharoni, Y Goldberg - arXiv preprint arXiv:1611.01487, 2016 - arxiv.org
We present a neural model for morphological inflection generation which employs a hard
attention mechanism, inspired by the nearly-monotonic alignment commonly found between …
attention mechanism, inspired by the nearly-monotonic alignment commonly found between …
Reducing sequence length by predicting edit spans with large language models
M Kaneko, N Okazaki - Proceedings of the 2023 Conference on …, 2023 - aclanthology.org
Abstract Large Language Models (LLMs) have demonstrated remarkable performance in
various tasks and gained significant attention. LLMs are also used for local sequence …
various tasks and gained significant attention. LLMs are also used for local sequence …
OCR post correction for endangered language texts
There is little to no data available to build natural language processing models for most
endangered languages. However, textual data in these languages often exists in formats …
endangered languages. However, textual data in these languages often exists in formats …
Supervised OCR error detection and correction using statistical and neural machine translation methods
C Amrhein, S Clematide - Journal for Language Technology and …, 2018 - zora.uzh.ch
For indexing the content of digitized historical texts, optical character recognition (OCR)
errors are a hampering problem. To explore the effectivity of new strategies for OCR post …
errors are a hampering problem. To explore the effectivity of new strategies for OCR post …
Multi-input attention for unsupervised OCR correction
We propose a novel approach to OCR post-correction that exploits repeated texts in large
corpora both as a source of noisy target outputs for unsupervised training and as a source of …
corpora both as a source of noisy target outputs for unsupervised training and as a source of …
Neural OCR post-hoc correction of historical corpora
Optical character recognition (OCR) is crucial for a deeper access to historical collections.
OCR needs to account for orthographic variations, typefaces, or language evolution (ie, new …
OCR needs to account for orthographic variations, typefaces, or language evolution (ie, new …
Lexically aware semi-supervised learning for OCR post-correction
S Rijhwani, D Rosenblum… - Transactions of the …, 2021 - direct.mit.edu
Much of the existing linguistic data in many languages of the world is locked away in non-
digitized books and documents. Optical character recognition (OCR) can be used to produce …
digitized books and documents. Optical character recognition (OCR) can be used to produce …