Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation

O Kuparinen, A Miletić, Y Scherrer - Findings of the Association for …, 2023 - aclanthology.org
Text normalization methods have been commonly applied to historical language or user-
generated content, but less often to dialectal transcriptions. In this paper, we introduce …

Murreviikko-a dialectologically annotated and normalized dataset of Finnish tweets

O Kuparinen - Tenth Workshop on NLP for Similar Languages …, 2023 - aclanthology.org
This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been
dialectologically annotated and manually normalized to a standard form. The dataset can be …

Dialect representation learning with neural dialect-to-standard normalization

O Kuparinen, Y Scherrer - Tenth Workshop on NLP for Similar …, 2023 - aclanthology.org
Abstract Language label tokens are often used in multilingual neural language modeling
and sequence-to-sequence learning to enhance the performance of such models. An …

Automatic Normalisation of Middle French and its Impact on Productivity

R Rubino, S Coram-Mekkey, J Gerlach… - Proceedings of the …, 2024 - aclanthology.org
This paper presents a study on automatic normalisation of 16th century documents written in
Middle French. These documents present a large variety of wordforms which require …

Evaluating the Use of Generative LLMs for Intralingual Diachronic Translation of Middle-Polish Texts into Contemporary Polish

C Klamra, K Kryńska, M Ogrodniczuk - International Conference on Asian …, 2023 - Springer
This paper presents efforts towards creating a tool for translating texts from Middle Polish
into modern Polish. Archaic texts sourced from the CBDU digital library were translated into …

CorCoDial-Machine translation techniques for corpus-based computational dialectology

Y Scherrer, O Kuparinen, A Miletić - Proceedings of the 24th …, 2023 - aclanthology.org
This paper presents CorCoDial, a research project funded by the Academy of Finland
aiming to leverage machine translation technology for corpus-based computational …

Modeling Orthographic Variation in Occitan's Dialects

ZW Hopton, N Aepli - arXiv preprint arXiv:2404.19315, 2024 - arxiv.org
Effectively normalizing textual data poses a considerable challenge, especially for low-
resource languages lacking standardized writing systems. In this study, we fine-tuned a …

Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants

R Rubino, J Gerlach, J Mutal… - Findings of the …, 2024 - aclanthology.org
Conservation of historical documents benefits from computational methods by alleviating the
manual labor related to digitization and modernization of textual content. Languages usually …

Le projet FREEM: ressources, outils et enjeux pour l'étude du français d'Ancien Régime

S Gabay, PO Suarez, R Bawden, A Bartz… - TALN 2022-Traitement …, 2022 - hal.science
En dépit de leur qualité certaine, les ressources et outils disponibles pour l'analyse du
français d'Ancien Régime ne sont plus à même de répondre aux enjeux de la recherche en …

A transformer-based standardisation system for Scottish Gaelic

J Huang, B Alex, M Bauer, DS Jasin… - … of SIGUL 2023: 2nd …, 2023 - research.ed.ac.uk
The transition from rule-based to neural-based architectures has made it more difficult for
low-resource languages like Scottish Gaelic to participate in modern language technologies …