TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

JH Clark, E Choi, M Collins, D Garrette… - Transactions of the …, 2020 - direct.mit.edu
Confidently making progress on multilingual modeling requires challenging, trustworthy
evaluations. We present TyDi QA—a question answering dataset covering 11 typologically …

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

JH Clark, D Garrette, I Turc, J Wieting - Transactions of the Association …, 2022 - direct.mit.edu
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet
nearly all commonly used models still require an explicit tokenization step. While recent …

Systematic mappings of sound to meaning: A theoretical review

DA Haslett, ZG Cai - Psychonomic Bulletin & Review, 2024 - Springer
The form of a word sometimes conveys semantic information. For example, the iconic word
gurgle sounds like what it means, and busy is easy to identify as an English adjective …

Big code!= big vocabulary: Open-vocabulary models for source code

RM Karampatsis, H Babii, R Robbes, C Sutton… - Proceedings of the …, 2020 - dl.acm.org
Statistical language modeling techniques have successfully been applied to large source
code corpora, yielding a variety of new software development tools, such as tools for code …

Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold

S Ruder, I Vulić, A Søgaard - arXiv preprint arXiv:2206.09755, 2022 - arxiv.org
The prototypical NLP experiment trains a standard architecture on labeled English data and
optimizes for accuracy, without accounting for other dimensions such as fairness …

Challenges of language technologies for the indigenous languages of the Americas

M Mager, X Gutierrez-Vasques, G Sierra… - arXiv preprint arXiv …, 2018 - arxiv.org
Indigenous languages of the American continent are highly diverse. However, they have
received little attention from the technological perspective. In this paper, we review the …

Multi-simlex: A large-scale evaluation of multilingual and crosslingual lexical semantic similarity

I Vulić, S Baker, EM Ponti, U Petti, I Leviant… - Computational …, 2020 - direct.mit.edu
Abstract We introduce Multi-SimLex, a large-scale lexical resource and evaluation
benchmark covering data sets for 12 typologically diverse languages, including major …

A call for more rigor in unsupervised cross-lingual learning

M Artetxe, S Ruder, D Yogatama, G Labaka… - arXiv preprint arXiv …, 2020 - arxiv.org
We review motivations, definition, approaches, and methodology for unsupervised cross-
lingual learning and call for a more rigorous position in each of them. An existing rationale …

Context sensitive neural lemmatization with Lematus

T Bergmanis, S Goldwater - … of the 2018 Conference of the North …, 2018 - aclanthology.org
The main motivation for developing contextsensitive lemmatizers is to improve performance
on unseen and ambiguous words. Yet previous systems have not carefully evaluated …

Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation

M Maimaiti, Y Liu, H Luan, M Sun - Tsinghua Science and …, 2021 - ieeexplore.ieee.org
Most State-Of-The-Art (SOTA) Neural Machine Translation (NMT) systems today achieve
outstanding results based only on large parallel corpora. The large-scale parallel corpora for …