Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …
can be analyzed and generated at many granularities. Until recently, most natural language …
An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers
We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to
improve the tokenization of pretrained language models (PLMs). FLOTA uses the …
improve the tokenization of pretrained language models (PLMs). FLOTA uses the …
How can NLP help revitalize endangered languages? A case study and roadmap for the Cherokee language
More than 43% of the languages spoken in the world are endangered, and language loss
currently occurs at an accelerated rate because of globalization and neocolonialism. Saving …
currently occurs at an accelerated rate because of globalization and neocolonialism. Saving …
A survey on text classification: Practical perspectives on the Italian language
Text Classification methods have been improving at an unparalleled speed in the last
decade thanks to the success brought about by deep learning. Historically, state-of-the-art …
decade thanks to the success brought about by deep learning. Historically, state-of-the-art …
Beyond characters: Subword-level morpheme segmentation
B Peters, AFT Martins - … of the 19th SIGMORPHON Workshop on …, 2022 - aclanthology.org
This paper presents DeepSPIN's submissions to the SIGMORPHON 2022 Shared Task on
Morpheme Segmentation. We make three submissions, all to the word-level subtask. First …
Morpheme Segmentation. We make three submissions, all to the word-level subtask. First …
DivEMT: Neural machine translation post-editing effort across typologically diverse languages
We introduce DivEMT, the first publicly available post-editing study of Neural Machine
Translation (NMT) over a typologically diverse set of target languages. Using a strictly …
Translation (NMT) over a typologically diverse set of target languages. Using a strictly …
Languages through the looking glass of bpe compression
X Gutierrez-Vasques, C Bentz… - Computational …, 2023 - direct.mit.edu
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …
Quantifying synthesis and fusion and their impact on machine translation
Theoretical work in morphological typology offers the possibility of measuring morphological
diversity on a continuous scale. However, literature in Natural Language Processing (NLP) …
diversity on a continuous scale. However, literature in Natural Language Processing (NLP) …
Impact of subword pooling strategy on cross-lingual event detection
S Agarwal, S Fincke, C Jenkins, S Miller… - arXiv preprint arXiv …, 2023 - arxiv.org
Pre-trained multilingual language models (eg, mBERT, XLM-RoBERTa) have significantly
advanced the state-of-the-art for zero-shot cross-lingual information extraction. These …
advanced the state-of-the-art for zero-shot cross-lingual information extraction. These …
Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies
Morphologically rich languages pose difficulties to machine translation. Machine translation
engines that rely on statistical learning from parallel training data, such as state-of-the-art …
engines that rely on statistical learning from parallel training data, such as state-of-the-art …