DynaSent: A dynamic benchmark for sentiment analysis

P Liang, R Bommasani, T Lee, D Tsipras… - arXiv preprint arXiv …, 2022 - arxiv.org

Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …

被引用次数：887 相关文章所有 5 个版本

[PDF] arxiv.org

Dynabench: Rethinking benchmarking in NLP

D Kiela, M Bartolo, Y Nie, D Kaushik, A Geiger… - arXiv preprint arXiv …, 2021 - arxiv.org

We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …

被引用次数：348 相关文章所有 9 个版本

[PDF] mit.edu

Can large language models transform computational social science?

C Ziems, W Held, O Shaikh, J Chen, Z Zhang… - Computational …, 2024 - direct.mit.edu

Large language models (LLMs) are capable of successfully performing many language
processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify …

被引用次数：293 相关文章所有 8 个版本

[PDF] neurips.cc

REALTIME QA: what's the answer right now?

J Kasai, K Sakaguchi, R Le Bras… - Advances in …, 2024 - proceedings.neurips.cc

We introduce RealTime QA, a dynamic question answering (QA) platform that announces
questions and evaluates systems on a regular basis (weekly in this version). RealTime QA …

被引用次数：70 相关文章所有 7 个版本

[PDF] neurips.cc

Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations

L Yuan, Y Chen, G Cui, H Gao, F Zou… - Advances in …, 2023 - proceedings.neurips.cc

This paper reexamines the research on out-of-distribution (OOD) robustness in the field of
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …

被引用次数：46 相关文章所有 6 个版本

[PDF] aclanthology.org

Adaptive testing and debugging of nlp models

MT Ribeiro, S Lundberg - Proceedings of the 60th Annual Meeting …, 2022 - aclanthology.org

Current approaches to testing and debugging NLP models rely on highly variable human
creativity and extensive labor, or only work for a very restrictive class of bugs. We present …

被引用次数：72 相关文章所有 2 个版本

[PDF] hw.ac.uk

The gem benchmark: Natural language generation, its evaluation and metrics

S Gehrmann, T Adewumi, K Aggarwal… - arXiv preprint arXiv …, 2021 - arxiv.org

We introduce GEM, a living benchmark for natural language Generation (NLG), its
Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving …

被引用次数：135 相关文章所有 14 个版本

[PDF] neurips.cc

Mind the gap: Assessing temporal generalization in neural language models

A Lazaridou, A Kuncoro… - Advances in …, 2021 - proceedings.neurips.cc

Our world is open-ended, non-stationary, and constantly evolving; thus what we talk about
and how we talk about it change over time. This inherent dynamic nature of language …

被引用次数：120 相关文章所有 8 个版本

[PDF] ethz.ch

[PDF][PDF] Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora

A Warstadt, A Mueller, L Choshen… - … of the BabyLM …, 2023 - research-collection.ethz.ch

Children can acquire language from less than 100 million words of input. Large language
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …

被引用次数：64 相关文章所有 5 个版本

[PDF] arxiv.org

SituatedQA: Incorporating extra-linguistic contexts into QA

MJQ Zhang, E Choi - arXiv preprint arXiv:2109.06157, 2021 - arxiv.org

Answers to the same question may change depending on the extra-linguistic contexts (when
and where the question was asked). To study this challenge, we introduce SituatedQA, an …

被引用次数：96 相关文章所有 4 个版本