TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Confidently making progress on multilingual modeling requires challenging, trustworthy
evaluations. We present TyDi QA—a question answering dataset covering 11 typologically …
evaluations. We present TyDi QA—a question answering dataset covering 11 typologically …
Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet
nearly all commonly used models still require an explicit tokenization step. While recent …
nearly all commonly used models still require an explicit tokenization step. While recent …
Systematic mappings of sound to meaning: A theoretical review
DA Haslett, ZG Cai - Psychonomic Bulletin & Review, 2024 - Springer
The form of a word sometimes conveys semantic information. For example, the iconic word
gurgle sounds like what it means, and busy is easy to identify as an English adjective …
gurgle sounds like what it means, and busy is easy to identify as an English adjective …
Big code!= big vocabulary: Open-vocabulary models for source code
Statistical language modeling techniques have successfully been applied to large source
code corpora, yielding a variety of new software development tools, such as tools for code …
code corpora, yielding a variety of new software development tools, such as tools for code …
Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold
The prototypical NLP experiment trains a standard architecture on labeled English data and
optimizes for accuracy, without accounting for other dimensions such as fairness …
optimizes for accuracy, without accounting for other dimensions such as fairness …
Challenges of language technologies for the indigenous languages of the Americas
Indigenous languages of the American continent are highly diverse. However, they have
received little attention from the technological perspective. In this paper, we review the …
received little attention from the technological perspective. In this paper, we review the …
Multi-simlex: A large-scale evaluation of multilingual and crosslingual lexical semantic similarity
Abstract We introduce Multi-SimLex, a large-scale lexical resource and evaluation
benchmark covering data sets for 12 typologically diverse languages, including major …
benchmark covering data sets for 12 typologically diverse languages, including major …
A call for more rigor in unsupervised cross-lingual learning
We review motivations, definition, approaches, and methodology for unsupervised cross-
lingual learning and call for a more rigorous position in each of them. An existing rationale …
lingual learning and call for a more rigorous position in each of them. An existing rationale …
Context sensitive neural lemmatization with Lematus
T Bergmanis, S Goldwater - … of the 2018 Conference of the North …, 2018 - aclanthology.org
The main motivation for developing contextsensitive lemmatizers is to improve performance
on unseen and ambiguous words. Yet previous systems have not carefully evaluated …
on unseen and ambiguous words. Yet previous systems have not carefully evaluated …
Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation
Most State-Of-The-Art (SOTA) Neural Machine Translation (NMT) systems today achieve
outstanding results based only on large parallel corpora. The large-scale parallel corpora for …
outstanding results based only on large parallel corpora. The large-scale parallel corpora for …