How suitable are subword segmentation strategies for translating non-concatenative morphology?

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org

What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

被引用次数：101 相关文章所有 5 个版本

[PDF] ox.ac.uk

An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers

V Hofmann, H Schuetze, JB Pierrehumbert - 2022 - ora.ox.ac.uk

We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to
improve the tokenization of pretrained language models (PLMs). FLOTA uses the …

被引用次数：37 相关文章所有 4 个版本

[PDF] arxiv.org

How can NLP help revitalize endangered languages? A case study and roadmap for the Cherokee language

S Zhang, B Frey, M Bansal - arXiv preprint arXiv:2204.11909, 2022 - arxiv.org

More than 43% of the languages spoken in the world are endangered, and language loss
currently occurs at an accelerated rate because of globalization and neocolonialism. Saving …

被引用次数：28 相关文章所有 5 个版本

[PDF] plos.org

A survey on text classification: Practical perspectives on the Italian language

A Gasparetto, A Zangari, M Marcuzzo, A Albarelli - Plos one, 2022 - journals.plos.org

Text Classification methods have been improving at an unparalleled speed in the last
decade thanks to the success brought about by deep learning. Historically, state-of-the-art …

被引用次数：9 相关文章所有 9 个版本

[PDF] aclanthology.org

Beyond characters: Subword-level morpheme segmentation

B Peters, AFT Martins - … of the 19th SIGMORPHON Workshop on …, 2022 - aclanthology.org

This paper presents DeepSPIN's submissions to the SIGMORPHON 2022 Shared Task on
Morpheme Segmentation. We make three submissions, all to the word-level subtask. First …

被引用次数：10 相关文章所有 4 个版本

[PDF] arxiv.org

DivEMT: Neural machine translation post-editing effort across typologically diverse languages

G Sarti, A Bisazza, AG Arenas, A Toral - arXiv preprint arXiv:2205.12215, 2022 - arxiv.org

We introduce DivEMT, the first publicly available post-editing study of Neural Machine
Translation (NMT) over a typologically diverse set of target languages. Using a strictly …

被引用次数：11 相关文章所有 6 个版本

[PDF] mit.edu

Languages through the looking glass of bpe compression

X Gutierrez-Vasques, C Bentz… - Computational …, 2023 - direct.mit.edu

Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It
uncovers redundant patterns for compressing the data, and hence alleviates the sparsity …

被引用次数：14 相关文章所有 4 个版本

[PDF] arxiv.org

Quantifying synthesis and fusion and their impact on machine translation

A Oncevay, D Ataman, N Van Berkel, B Haddow… - arXiv preprint arXiv …, 2022 - arxiv.org

Theoretical work in morphological typology offers the possibility of measuring morphological
diversity on a continuous scale. However, literature in Natural Language Processing (NLP) …

被引用次数：6 相关文章所有 9 个版本

[PDF] arxiv.org

Impact of subword pooling strategy on cross-lingual event detection

S Agarwal, S Fincke, C Jenkins, S Miller… - arXiv preprint arXiv …, 2023 - arxiv.org

Pre-trained multilingual language models (eg, mBERT, XLM-RoBERTa) have significantly
advanced the state-of-the-art for zero-shot cross-lingual information extraction. These …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies

MWD Marco, M Huck, A Fraser - arXiv preprint arXiv:2203.13550, 2022 - arxiv.org

Morphologically rich languages pose difficulties to machine translation. Machine translation
engines that rely on statistical learning from parallel training data, such as state-of-the-art …

被引用次数：1 相关文章