Evaluating models' local decision boundaries via contrast sets

A Madsen, S Reddy, S Chandar - ACM Computing Surveys, 2022 - dl.acm.org

Neural networks for NLP are becoming increasingly complex and widespread, and there is a
growing concern if these models are responsible to use. Explaining models helps to address …

被引用次数：193 相关文章所有 5 个版本

[PDF] arxiv.org

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org

Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

被引用次数：173 相关文章所有 6 个版本

[PDF] arxiv.org

Holistic evaluation of language models

P Liang, R Bommasani, T Lee, D Tsipras… - arXiv preprint arXiv …, 2022 - arxiv.org

Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …

被引用次数：757 相关文章所有 5 个版本

[PDF] arxiv.org

Prompting gpt-3 to be reliable

C Si, Z Gan, Z Yang, S Wang, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

Large language models (LLMs) show impressive abilities via few-shot prompting.
Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world …

被引用次数：179 相关文章所有 3 个版本

[PDF] arxiv.org

Training-free structured diffusion guidance for compositional text-to-image synthesis

W Feng, X He, TJ Fu, V Jampani, A Akula… - arXiv preprint arXiv …, 2022 - arxiv.org

Large-scale diffusion models have achieved state-of-the-art results on text-to-image
synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we …

被引用次数：185 相关文章所有 6 个版本

[PDF] arxiv.org

Are NLP models really able to solve simple math word problems?

A Patel, S Bhattamishra, N Goyal - arXiv preprint arXiv:2103.07191, 2021 - arxiv.org

The problem of designing NLP solvers for math word problems (MWP) has seen sustained
research activity and steady gains in the test accuracy. Since existing solvers achieve high …

被引用次数：430 相关文章所有 4 个版本

[PDF] arxiv.org

Dynabench: Rethinking benchmarking in NLP

D Kiela, M Bartolo, Y Nie, D Kaushik, A Geiger… - arXiv preprint arXiv …, 2021 - arxiv.org

We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …

被引用次数：323 相关文章所有 9 个版本

[PDF] arxiv.org

Towards a unified multi-dimensional evaluator for text generation

M Zhong, Y Liu, D Yin, Y Mao, Y Jiao, P Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural
Language Generation (NLG), ie, evaluating the generated text from multiple explainable …

被引用次数：133 相关文章所有 6 个版本

[PDF] mit.edu

Causal inference in natural language processing: Estimation, prediction, interpretation and beyond

A Feder, KA Keith, E Manzoor, R Pryzant… - Transactions of the …, 2022 - direct.mit.edu

A fundamental goal of scientific research is to learn about causal relationships. However,
despite its critical role in the life and social sciences, causality has not had the same …

被引用次数：195 相关文章所有 10 个版本

[PDF] arxiv.org

Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks

Z Wu, L Qiu, A Ross, E Akyürek, B Chen… - arXiv preprint arXiv …, 2023 - arxiv.org

The impressive performance of recent language models across a wide range of tasks
suggests that they possess a degree of abstract reasoning skills. Are these skills general …

被引用次数：70 相关文章所有 4 个版本