From characters to words to in between: Do we capture morphology?

JH Clark, E Choi, M Collins, D Garrette… - Transactions of the …, 2020 - direct.mit.edu

Confidently making progress on multilingual modeling requires challenging, trustworthy
evaluations. We present TyDi QA—a question answering dataset covering 11 typologically …

被引用次数：563 相关文章所有 14 个版本

[PDF] mit.edu

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

JH Clark, D Garrette, I Turc, J Wieting - Transactions of the Association …, 2022 - direct.mit.edu

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet
nearly all commonly used models still require an explicit tokenization step. While recent …

被引用次数：232 相关文章所有 9 个版本

Systematic mappings of sound to meaning: A theoretical review

DA Haslett, ZG Cai - Psychonomic Bulletin & Review, 2024 - Springer

The form of a word sometimes conveys semantic information. For example, the iconic word
gurgle sounds like what it means, and busy is easy to identify as an English adjective …

被引用次数：11 相关文章所有 4 个版本

[PDF] acm.org

Big code!= big vocabulary: Open-vocabulary models for source code

RM Karampatsis, H Babii, R Robbes, C Sutton… - Proceedings of the …, 2020 - dl.acm.org

Statistical language modeling techniques have successfully been applied to large source
code corpora, yielding a variety of new software development tools, such as tools for code …

被引用次数：259 相关文章所有 13 个版本

[PDF] arxiv.org

Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold

S Ruder, I Vulić, A Søgaard - arXiv preprint arXiv:2206.09755, 2022 - arxiv.org

The prototypical NLP experiment trains a standard architecture on labeled English data and
optimizes for accuracy, without accounting for other dimensions such as fairness …

被引用次数：36 相关文章所有 6 个版本

[PDF] arxiv.org

Challenges of language technologies for the indigenous languages of the Americas

M Mager, X Gutierrez-Vasques, G Sierra… - arXiv preprint arXiv …, 2018 - arxiv.org

Indigenous languages of the American continent are highly diverse. However, they have
received little attention from the technological perspective. In this paper, we review the …

被引用次数：126 相关文章所有 3 个版本

[PDF] mit.edu

Multi-simlex: A large-scale evaluation of multilingual and crosslingual lexical semantic similarity

I Vulić, S Baker, EM Ponti, U Petti, I Leviant… - Computational …, 2020 - direct.mit.edu

Abstract We introduce Multi-SimLex, a large-scale lexical resource and evaluation
benchmark covering data sets for 12 typologically diverse languages, including major …

被引用次数：88 相关文章所有 18 个版本

[PDF] arxiv.org

A call for more rigor in unsupervised cross-lingual learning

M Artetxe, S Ruder, D Yogatama, G Labaka… - arXiv preprint arXiv …, 2020 - arxiv.org

We review motivations, definition, approaches, and methodology for unsupervised cross-
lingual learning and call for a more rigorous position in each of them. An existing rationale …

被引用次数：74 相关文章所有 4 个版本

[PDF] aclanthology.org

Context sensitive neural lemmatization with Lematus

T Bergmanis, S Goldwater - … of the 2018 Conference of the North …, 2018 - aclanthology.org

The main motivation for developing contextsensitive lemmatizers is to improve performance
on unseen and ambiguous words. Yet previous systems have not carefully evaluated …

被引用次数：109 相关文章所有 3 个版本

[PDF] ieee.org

Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation

M Maimaiti, Y Liu, H Luan, M Sun - Tsinghua Science and …, 2021 - ieeexplore.ieee.org

Most State-Of-The-Art (SOTA) Neural Machine Translation (NMT) systems today achieve
outstanding results based only on large parallel corpora. The large-scale parallel corpora for …

被引用次数：50 相关文章所有 4 个版本