Large language models are not fair evaluators

P Wang, L Li, L Chen, Z Cai, D Zhu, B Lin… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large
language models~(LLMs), eg, GPT-4, as a referee to score and compare the quality of …

Recent advances in natural language inference: A survey of benchmarks, resources, and approaches

S Storks, Q Gao, JY Chai - arXiv preprint arXiv:1904.01172, 2019 - arxiv.org
In the NLP community, recent years have seen a surge of research activities that address
machines' ability to perform deep language understanding which goes beyond what is …

Calibrate before use: Improving few-shot performance of language models

Z Zhao, E Wallace, S Feng, D Klein… - … on machine learning, 2021 - proceedings.mlr.press
GPT-3 can perform numerous tasks when provided a natural language prompt that contains
a few training examples. We show that this type of few-shot learning can be unstable: the …

Symbolic knowledge distillation: from general language models to commonsense models

P West, C Bhagavatula, J Hessel, JD Hwang… - arXiv preprint arXiv …, 2021 - arxiv.org
The common practice for training commonsense models has gone from-human-to-corpus-to-
machine: humans author commonsense knowledge graphs in order to train commonsense …

Automatic story generation: Challenges and attempts

A Alabdulkarim, S Li, X Peng - arXiv preprint arXiv:2102.12634, 2021 - arxiv.org
The scope of this survey paper is to explore the challenges in automatic story generation.
We hope to contribute in the following ways: 1. Explore how previous research in story …

Commonsenseqa: A question answering challenge targeting commonsense knowledge

A Talmor, J Herzig, N Lourie, J Berant - arXiv preprint arXiv:1811.00937, 2018 - arxiv.org
When answering a question, people often draw upon their rich world knowledge in addition
to the particular context. Recent work has focused primarily on answering questions given …

From recognition to cognition: Visual commonsense reasoning

R Zellers, Y Bisk, A Farhadi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …

Don't take the easy way out: Ensemble based methods for avoiding known dataset biases

C Clark, M Yatskar, L Zettlemoyer - arXiv preprint arXiv:1909.03683, 2019 - arxiv.org
State-of-the-art models often make use of superficial patterns in the data that do not
generalize well to out-of-domain or adversarial settings. For example, textual entailment …

Swag: A large-scale adversarial dataset for grounded commonsense inference

R Zellers, Y Bisk, R Schwartz, Y Choi - arXiv preprint arXiv:1808.05326, 2018 - arxiv.org
Given a partial description like" she opened the hood of the car," humans can reason about
the situation and anticipate what might come next (" then, she examined the engine"). In this …

Annotation artifacts in natural language inference data

S Gururangan, S Swayamdipta, O Levy… - arXiv preprint arXiv …, 2018 - arxiv.org
Large-scale datasets for natural language inference are created by presenting crowd
workers with a sentence (premise), and asking them to generate three new sentences …