Byt5: Towards a token-free future with pre-trained byte-to-byte models

L Xue, A Barua, N Constant, R Al-Rfou… - Transactions of the …, 2022 - direct.mit.edu
Most widely used pre-trained language models operate on sequences of tokens
corresponding to word or subword units. By comparison, token-free models that operate …

[HTML][HTML] Language varieties of Italy: Technology challenges and opportunities

A Ramponi - Transactions of the Association for Computational …, 2024 - direct.mit.edu
Italy is characterized by a one-of-a-kind linguistic diversity landscape in Europe, which
implicitly encodes local knowledge, cultural traditions, artistic expressions, and history of its …

Systematic Inequalities in Language Technology Performance across the World's Languages

D Blasi, A Anastasopoulos, G Neubig - arXiv preprint arXiv:2110.06733, 2021 - arxiv.org
Natural language processing (NLP) systems have become a central technology in
communication, education, medicine, artificial intelligence, and many other domains of …

State-of-the-art generalisation research in NLP: a taxonomy and review

D Hupkes, M Giulianelli, V Dankers, M Artetxe… - arXiv preprint arXiv …, 2022 - arxiv.org
The ability to generalise well is one of the primary desiderata of natural language
processing (NLP). Yet, what'good generalisation'entails and how it should be evaluated is …

UniMorph 4.0: universal morphology

K Batsuren, O Goldman, S Khalifa, N Habash… - arXiv preprint arXiv …, 2022 - arxiv.org
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-
coverage instantiated normalized morphological inflection tables for hundreds of diverse …

Findings of the WMT shared task on machine translation using terminologies

MMI Alam, I Kvapilíková… - Proceedings of the …, 2021 - aclanthology.org
Abstract Language domains that require very careful use of terminology are abundant and
reflect a significant part of the translation industry. In this work we introduce a benchmark for …

IGT2P: From interlinear glossed texts to paradigms

S Moeller, L Liu, C Yang, K Kann… - Proceedings of the 2020 …, 2020 - aclanthology.org
An intermediate step in the linguistic analysis of an under-documented language is to find
and organize inflected forms that are attested in natural speech. From this data, linguists …

Can a transformer pass the wug test? Tuning copying bias in neural morphological inflection models

L Liu, M Hulden - arXiv preprint arXiv:2104.06483, 2021 - arxiv.org
Deep learning sequence models have been successfully applied to the task of
morphological inflection. The results of the SIGMORPHON shared tasks in the past several …

Morphological inflection: A reality check

J Kodner, S Payne, S Khalifa, Z Liu - arXiv preprint arXiv:2305.15637, 2023 - arxiv.org
Morphological inflection is a popular task in sub-word NLP with both practical and cognitive
applications. For years now, state-of-the-art systems have reported high, but also highly …

Ensemble self-training for low-resource languages: Grapheme-to-phoneme conversion and morphological inflection

X Yu, NT Vu, J Kuhn - … of the 17th SIGMORPHON Workshop on …, 2020 - aclanthology.org
We present an iterative data augmentation framework, which trains and searches for an
optimal ensemble and simultaneously annotates new training data in a self-training style …