Holistic evaluation of language models

P Liang, R Bommasani, T Lee, D Tsipras… - arXiv preprint arXiv …, 2022 - arxiv.org
Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …

Dynabench: Rethinking benchmarking in NLP

D Kiela, M Bartolo, Y Nie, D Kaushik, A Geiger… - arXiv preprint arXiv …, 2021 - arxiv.org
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …

Can large language models transform computational social science?

C Ziems, W Held, O Shaikh, J Chen, Z Zhang… - Computational …, 2024 - direct.mit.edu
Large language models (LLMs) are capable of successfully performing many language
processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify …

REALTIME QA: what's the answer right now?

J Kasai, K Sakaguchi, R Le Bras… - Advances in …, 2024 - proceedings.neurips.cc
We introduce RealTime QA, a dynamic question answering (QA) platform that announces
questions and evaluates systems on a regular basis (weekly in this version). RealTime QA …

Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations

L Yuan, Y Chen, G Cui, H Gao, F Zou… - Advances in …, 2023 - proceedings.neurips.cc
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …

Adaptive testing and debugging of nlp models

MT Ribeiro, S Lundberg - Proceedings of the 60th Annual Meeting …, 2022 - aclanthology.org
Current approaches to testing and debugging NLP models rely on highly variable human
creativity and extensive labor, or only work for a very restrictive class of bugs. We present …

The gem benchmark: Natural language generation, its evaluation and metrics

S Gehrmann, T Adewumi, K Aggarwal… - arXiv preprint arXiv …, 2021 - arxiv.org
We introduce GEM, a living benchmark for natural language Generation (NLG), its
Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving …

Mind the gap: Assessing temporal generalization in neural language models

A Lazaridou, A Kuncoro… - Advances in …, 2021 - proceedings.neurips.cc
Our world is open-ended, non-stationary, and constantly evolving; thus what we talk about
and how we talk about it change over time. This inherent dynamic nature of language …

[PDF][PDF] Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora

A Warstadt, A Mueller, L Choshen… - … of the BabyLM …, 2023 - research-collection.ethz.ch
Children can acquire language from less than 100 million words of input. Large language
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …

SituatedQA: Incorporating extra-linguistic contexts into QA

MJQ Zhang, E Choi - arXiv preprint arXiv:2109.06157, 2021 - arxiv.org
Answers to the same question may change depending on the extra-linguistic contexts (when
and where the question was asked). To study this challenge, we introduce SituatedQA, an …