Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
Impact of pretraining term frequencies on few-shot reasoning
Pretrained Language Models (LMs) have demonstrated ability to perform numerical
reasoning by extrapolating from a few examples in few-shot settings. However, the extent to …
reasoning by extrapolating from a few examples in few-shot settings. However, the extent to …
Mind the gap: Assessing temporal generalization in neural language models
A Lazaridou, A Kuncoro… - Advances in …, 2021 - proceedings.neurips.cc
Our world is open-ended, non-stationary, and constantly evolving; thus what we talk about
and how we talk about it change over time. This inherent dynamic nature of language …
and how we talk about it change over time. This inherent dynamic nature of language …
State-of-the-art generalisation research in NLP: a taxonomy and review
The ability to generalise well is one of the primary desiderata of natural language
processing (NLP). Yet, what'good generalisation'entails and how it should be evaluated is …
processing (NLP). Yet, what'good generalisation'entails and how it should be evaluated is …
MultiEURLEX--A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal
documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 …
documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 …
Randomness in neural network training: Characterizing the impact of tooling
The quest for determinism in machine learning has disproportionately focused on
characterizing the impact of noise introduced by algorithmic design choices. In this work, we …
characterizing the impact of noise introduced by algorithmic design choices. In this work, we …
Temporal effects on pre-trained models for language processing tasks
Keeping the performance of language technologies optimal as time passes is of great
practical interest. We study temporal effects on model performance on downstream …
practical interest. We study temporal effects on model performance on downstream …
Semeval-2021 task 1: Lexical complexity prediction
This paper presents the results and main findings of SemEval-2021 Task 1-Lexical
Complexity Prediction. We provided participants with an augmented version of the CompLex …
Complexity Prediction. We provided participants with an augmented version of the CompLex …
Fairlex: A multilingual benchmark for evaluating fairness in legal text processing
We present a benchmark suite of four datasets for evaluating the fairness of pre-trained
language models and the techniques used to fine-tune them for downstream tasks. Our …
language models and the techniques used to fine-tune them for downstream tasks. Our …
Swiss-judgment-prediction: A multilingual legal judgment prediction benchmark
In many jurisdictions, the excessive workload of courts leads to high delays. Suitable
predictive AI models can assist legal professionals in their work, and thus enhance and …
predictive AI models can assist legal professionals in their work, and thus enhance and …