Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Why don't people use character-level machine translation?

J Libovický, H Schmid, A Fraser - arXiv preprint arXiv:2110.08191, 2021 - arxiv.org
We present a literature and empirical survey that critically assesses the state of the art in
character-level modeling for machine translation (MT). Despite evidence in the literature that …

[HTML][HTML] A reverse positional encoding multi-head attention-based neural machine translation model for arabic dialects

LH Baniata, S Kang, IKE Ampomah - Mathematics, 2022 - mdpi.com
Languages with a grammatical structure that have a free order for words, such as Arabic
dialects, are considered a challenge for neural machine translation (NMT) models because …

On sparsifying encoder outputs in sequence-to-sequence models

B Zhang, I Titov, R Sennrich - arXiv preprint arXiv:2004.11854, 2020 - arxiv.org
Sequence-to-sequence models usually transfer all encoder outputs to the decoder for
generation. In this work, by contrast, we hypothesize that these encoder outputs can be …

When is char better than subword: A systematic study of segmentation algorithms for neural machine translation

J Li, Y Shen, S Huang, X Dai… - Proceedings of the 59th …, 2021 - aclanthology.org
Subword segmentation algorithms have been a de facto choice when building neural
machine translation systems. However, most of them need to learn a segmentation model …

Local byte fusion for neural machine translation

MN Sreedhar, X Wan, Y Cheng, J Hu - arXiv preprint arXiv:2205.11490, 2022 - arxiv.org
Subword tokenization schemes are the dominant technique used in current NLP models.
However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to …

Evaluating morphological generalisation in machine translation by distribution-based compositionality assessment

A Moisio, M Creutz, M Kurimo - … of the 24th Nordic Conference on …, 2023 - aclanthology.org
Compositional generalisation refers to the ability to understand and generate a potentially
infinite number of novel meanings using a finite group of known primitives and a set of rules …

The boundaries of meaning: a case study in neural machine translation

Y Balashov - Inquiry, 2022 - Taylor & Francis
The success of deep learning in natural language processing raises intriguing questions
about the nature of linguistic meaning and ways in which it can be processed by natural and …

The LMU Munich systems for the WMT21 unsupervised and very low-resource translation task

J Libovický, A Fraser - Proceedings of the Sixth Conference on …, 2021 - aclanthology.org
We present our submissions to the WMT21 shared task in Unsupervised and Very Low
Resource machine translation between German and Upper Sorbian, German and Lower …

Towards efficient universal neural machine translation

B Zhang - 2022 - era.ed.ac.uk
Humans benefit from communication but suffer from language barriers. Machine translation
(MT) aims to overcome such barriers by automatically transforming information from one …