Endangered Languages are not Low-Resourced!

M Hämäläinen - arXiv preprint arXiv:2103.09567, 2021 - arxiv.org
The term low-resourced has been tossed around in the field of natural language processing
to a degree that almost any language that is not English can be called" low-resourced"; …

When word embeddings become endangered

K Alnajjar - arXiv preprint arXiv:2103.13275, 2021 - arxiv.org
Big languages such as English and Finnish have many natural language processing (NLP)
resources and models, but this is not the case for low-resourced and endangered languages …

Sentiment analysis using aligned word embeddings for uralic languages

K Alnajjar, M Hämäläinen, J Rueter - arXiv preprint arXiv:2305.15380, 2023 - arxiv.org
In this paper, we present an approach for translating word embeddings from a majority
language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore …

Using graph-based methods to augment online dictionaries of endangered languages

K Alnajjar, M Hämäläinen… - Workshop on the …, 2022 - researchportal.helsinki.fi
Many endangered Uralic languages have multilingual machine readable dictionaries saved
in an XML format. However, the dictionaries cover translations very inconsistently between …

Modelling the Reduplicating Lushootseed Morphology with an FST and LSTM

J Rueter, M Hämäläinen, K Alnajjar - Proceedings of the Workshop …, 2023 - aclanthology.org
In this paper, we present an FST based approach for conducting morphological analysis,
lemmatization and generation of Lushootseed words. Furthermore, we use the FST to …

Prerequisites for shallow-transfer machine translation of Mordvin languages: Language documentation with a purpose

J Rueter, M Hämäläinen - 2021 - preprints.org
This paper presents the current lexical, morphological, syntactic and rule-based machine
translation work for Erzya and Moksha that can and should be used in the development of a …

[PDF][PDF] Lexd: A finitestate lexicon compiler for non-suffixational morphologies

D Swanson, N Howell - Multilingual Facilitation, 2021 - pdfs.semanticscholar.org
This paper presents lexd, a lexicon compiler for languages with nonsuffixational
morphology, which is intended to be faster and easier to use than existing solutions while …

Working Towards Digital Documentation of Uralic Languages With Open-Source Tools and Modern NLP Methods

M Hämäläinen, J Rueter, K Alnajjar… - Proceedings of the Big …, 2023 - aclanthology.org
We present our work towards building an infrastructure for documenting endangered
languages with the focus on Uralic languages in particular. Our infrastructure consists of …

DAG: Dictionary-Augmented Generation for Disambiguation of Sentences in Endangered Uralic Languages using ChatGPT

M Hämäläinen - arXiv preprint arXiv:2411.01531, 2024 - arxiv.org
We showcase that ChatGPT can be used to disambiguate lemmas in two endangered
languages ChatGPT is not proficient in, namely Erzya and Skolt Sami. We augment our …

On Erzya and Moksha Corpora and Analyzer Development, ERME-PSLA 1950s

J Rueter, O Erina, N Kabaeva - Proceedings of the 9th …, 2024 - aclanthology.org
This paper describes materials and annotation facilitation pertinent to the «Erzya-Moksha
Electronic Resources and Linguistic Diversity»(EMERALD) project. It addresses work …