Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

Impact of pretraining term frequencies on few-shot reasoning

Y Razeghi, RL Logan IV, M Gardner… - arXiv preprint arXiv …, 2022 - arxiv.org
Pretrained Language Models (LMs) have demonstrated ability to perform numerical
reasoning by extrapolating from a few examples in few-shot settings. However, the extent to …

Mind the gap: Assessing temporal generalization in neural language models

A Lazaridou, A Kuncoro… - Advances in …, 2021 - proceedings.neurips.cc
Our world is open-ended, non-stationary, and constantly evolving; thus what we talk about
and how we talk about it change over time. This inherent dynamic nature of language …

State-of-the-art generalisation research in NLP: a taxonomy and review

D Hupkes, M Giulianelli, V Dankers, M Artetxe… - arXiv preprint arXiv …, 2022 - arxiv.org
The ability to generalise well is one of the primary desiderata of natural language
processing (NLP). Yet, what'good generalisation'entails and how it should be evaluated is …

MultiEURLEX--A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

I Chalkidis, M Fergadiotis, I Androutsopoulos - arXiv preprint arXiv …, 2021 - arxiv.org
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal
documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 …

Randomness in neural network training: Characterizing the impact of tooling

D Zhuang, X Zhang, S Song… - Proceedings of Machine …, 2022 - proceedings.mlsys.org
The quest for determinism in machine learning has disproportionately focused on
characterizing the impact of noise introduced by algorithmic design choices. In this work, we …

Temporal effects on pre-trained models for language processing tasks

O Agarwal, A Nenkova - Transactions of the Association for …, 2022 - direct.mit.edu
Keeping the performance of language technologies optimal as time passes is of great
practical interest. We study temporal effects on model performance on downstream …

Semeval-2021 task 1: Lexical complexity prediction

M Shardlow, R Evans, GH Paetzold… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper presents the results and main findings of SemEval-2021 Task 1-Lexical
Complexity Prediction. We provided participants with an augmented version of the CompLex …

Fairlex: A multilingual benchmark for evaluating fairness in legal text processing

I Chalkidis, T Pasini, S Zhang, L Tomada… - arXiv preprint arXiv …, 2022 - arxiv.org
We present a benchmark suite of four datasets for evaluating the fairness of pre-trained
language models and the techniques used to fine-tune them for downstream tasks. Our …

Swiss-judgment-prediction: A multilingual legal judgment prediction benchmark

J Niklaus, I Chalkidis, M Stürmer - arXiv preprint arXiv:2110.00806, 2021 - arxiv.org
In many jurisdictions, the excessive workload of courts leads to high delays. Suitable
predictive AI models can assist legal professionals in their work, and thus enhance and …