Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org
Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

The debate over understanding in AI's large language models

M Mitchell, DC Krakauer - Proceedings of the National …, 2023 - National Acad Sciences
We survey a current, heated debate in the artificial intelligence (AI) research community on
whether large pretrained language models can be said to understand language—and the …

Impact of pretraining term frequencies on few-shot reasoning

Y Razeghi, RL Logan IV, M Gardner… - arXiv preprint arXiv …, 2022 - arxiv.org
Pretrained Language Models (LMs) have demonstrated ability to perform numerical
reasoning by extrapolating from a few examples in few-shot settings. However, the extent to …

Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations

L Yuan, Y Chen, G Cui, H Gao, F Zou… - Advances in …, 2023 - proceedings.neurips.cc
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …

Wanli: Worker and ai collaboration for natural language inference dataset creation

A Liu, S Swayamdipta, NA Smith, Y Choi - arXiv preprint arXiv:2201.05955, 2022 - arxiv.org
A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often
rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We …

Generating data to mitigate spurious correlations in natural language inference datasets

Y Wu, M Gardner, P Stenetorp, P Dasigi - arXiv preprint arXiv:2203.12942, 2022 - arxiv.org
Natural language processing models often exploit spurious correlations between task-
independent features and labels in datasets to perform well only within the distributions they …

Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models

Z Lin, S Guan, W Zhang, H Zhang, Y Li… - Artificial Intelligence …, 2024 - Springer
Recently, large language models (LLMs) have attracted considerable attention due to their
remarkable capabilities. However, LLMs' generation of biased or hallucinatory content …

Tailor: Generating and perturbing text with semantic controls

A Ross, T Wu, H Peng, ME Peters… - arXiv preprint arXiv …, 2021 - arxiv.org
Controlled text perturbation is useful for evaluating and improving model generalizability.
However, current techniques rely on training a model for every target perturbation, which is …

Changing the world by changing the data

A Rogers - arXiv preprint arXiv:2105.13947, 2021 - arxiv.org
NLP community is currently investing a lot more research and resources into development of
deep learning models than training data. While we have made a lot of progress, it is now …

The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning

J Hullman, S Kapoor, P Nanayakkara… - Proceedings of the …, 2022 - dl.acm.org
Arguments that machine learning (ML) is facing a reproducibility and replication crisis
suggest that some published claims in research cannot be taken at face value. Concerns …