Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers

V Hofmann, H Schuetze, JB Pierrehumbert - 2022 - ora.ox.ac.uk
We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to
improve the tokenization of pretrained language models (PLMs). FLOTA uses the …

How can NLP help revitalize endangered languages? A case study and roadmap for the Cherokee language

S Zhang, B Frey, M Bansal - arXiv preprint arXiv:2204.11909, 2022 - arxiv.org
More than 43% of the languages spoken in the world are endangered, and language loss
currently occurs at an accelerated rate because of globalization and neocolonialism. Saving …

A survey on text classification: Practical perspectives on the Italian language

A Gasparetto, A Zangari, M Marcuzzo, A Albarelli - Plos one, 2022 - journals.plos.org
Text Classification methods have been improving at an unparalleled speed in the last
decade thanks to the success brought about by deep learning. Historically, state-of-the-art …

Beyond characters: Subword-level morpheme segmentation

B Peters, AFT Martins - … of the 19th SIGMORPHON Workshop on …, 2022 - aclanthology.org
This paper presents DeepSPIN's submissions to the SIGMORPHON 2022 Shared Task on
Morpheme Segmentation. We make three submissions, all to the word-level subtask. First …

DivEMT: Neural machine translation post-editing effort across typologically diverse languages

G Sarti, A Bisazza, AG Arenas, A Toral - arXiv preprint arXiv:2205.12215, 2022 - arxiv.org
We introduce DivEMT, the first publicly available post-editing study of Neural Machine
Translation (NMT) over a typologically diverse set of target languages. Using a strictly …

Languages through the looking glass of bpe compression

X Gutierrez-Vasques, C Bentz… - Computational …, 2023 - direct.mit.edu
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …

Quantifying synthesis and fusion and their impact on machine translation

A Oncevay, D Ataman, N Van Berkel, B Haddow… - arXiv preprint arXiv …, 2022 - arxiv.org
Theoretical work in morphological typology offers the possibility of measuring morphological
diversity on a continuous scale. However, literature in Natural Language Processing (NLP) …

Impact of subword pooling strategy on cross-lingual event detection

S Agarwal, S Fincke, C Jenkins, S Miller… - arXiv preprint arXiv …, 2023 - arxiv.org
Pre-trained multilingual language models (eg, mBERT, XLM-RoBERTa) have significantly
advanced the state-of-the-art for zero-shot cross-lingual information extraction. These …

Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies

MWD Marco, M Huck, A Fraser - arXiv preprint arXiv:2203.13550, 2022 - arxiv.org
Morphologically rich languages pose difficulties to machine translation. Machine translation
engines that rely on statistical learning from parallel training data, such as state-of-the-art …