Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation
Text normalization methods have been commonly applied to historical language or user-
generated content, but less often to dialectal transcriptions. In this paper, we introduce …
generated content, but less often to dialectal transcriptions. In this paper, we introduce …
Murreviikko-a dialectologically annotated and normalized dataset of Finnish tweets
O Kuparinen - Tenth Workshop on NLP for Similar Languages …, 2023 - aclanthology.org
This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been
dialectologically annotated and manually normalized to a standard form. The dataset can be …
dialectologically annotated and manually normalized to a standard form. The dataset can be …
Dialect representation learning with neural dialect-to-standard normalization
O Kuparinen, Y Scherrer - Tenth Workshop on NLP for Similar …, 2023 - aclanthology.org
Abstract Language label tokens are often used in multilingual neural language modeling
and sequence-to-sequence learning to enhance the performance of such models. An …
and sequence-to-sequence learning to enhance the performance of such models. An …
Automatic Normalisation of Middle French and its Impact on Productivity
R Rubino, S Coram-Mekkey, J Gerlach… - Proceedings of the …, 2024 - aclanthology.org
This paper presents a study on automatic normalisation of 16th century documents written in
Middle French. These documents present a large variety of wordforms which require …
Middle French. These documents present a large variety of wordforms which require …
Evaluating the Use of Generative LLMs for Intralingual Diachronic Translation of Middle-Polish Texts into Contemporary Polish
C Klamra, K Kryńska, M Ogrodniczuk - International Conference on Asian …, 2023 - Springer
This paper presents efforts towards creating a tool for translating texts from Middle Polish
into modern Polish. Archaic texts sourced from the CBDU digital library were translated into …
into modern Polish. Archaic texts sourced from the CBDU digital library were translated into …
CorCoDial-Machine translation techniques for corpus-based computational dialectology
This paper presents CorCoDial, a research project funded by the Academy of Finland
aiming to leverage machine translation technology for corpus-based computational …
aiming to leverage machine translation technology for corpus-based computational …
Modeling Orthographic Variation in Occitan's Dialects
ZW Hopton, N Aepli - arXiv preprint arXiv:2404.19315, 2024 - arxiv.org
Effectively normalizing textual data poses a considerable challenge, especially for low-
resource languages lacking standardized writing systems. In this study, we fine-tuned a …
resource languages lacking standardized writing systems. In this study, we fine-tuned a …
Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants
Conservation of historical documents benefits from computational methods by alleviating the
manual labor related to digitization and modernization of textual content. Languages usually …
manual labor related to digitization and modernization of textual content. Languages usually …
Le projet FREEM: ressources, outils et enjeux pour l'étude du français d'Ancien Régime
En dépit de leur qualité certaine, les ressources et outils disponibles pour l'analyse du
français d'Ancien Régime ne sont plus à même de répondre aux enjeux de la recherche en …
français d'Ancien Régime ne sont plus à même de répondre aux enjeux de la recherche en …
A transformer-based standardisation system for Scottish Gaelic
J Huang, B Alex, M Bauer, DS Jasin… - … of SIGUL 2023: 2nd …, 2023 - research.ed.ac.uk
The transition from rule-based to neural-based architectures has made it more difficult for
low-resource languages like Scottish Gaelic to participate in modern language technologies …
low-resource languages like Scottish Gaelic to participate in modern language technologies …