A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022 - dl.acm.org
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

[HTML][HTML] Human evaluation of automatically generated text: Current trends and best practice guidelines

C van der Lee, A Gatt, E van Miltenburg… - Computer Speech & …, 2021 - Elsevier
Currently, there is little agreement as to how Natural Language Generation (NLG) systems
should be evaluated, with a particularly high degree of variation in the way that human …

Can large language models be an alternative to human evaluations?

CH Chiang, H Lee - arXiv preprint arXiv:2305.01937, 2023 - arxiv.org
Human evaluation is indispensable and inevitable for assessing the quality of texts
generated by machine learning models or written by humans. However, human evaluation is …

All that's' human'is not gold: Evaluating human evaluation of generated text

E Clark, T August, S Serrano, N Haduong… - arXiv preprint arXiv …, 2021 - arxiv.org
Human evaluations are typically considered the gold standard in natural language
generation, but as models' fluency improves, how well can evaluators detect and judge …

Evaluation of text generation: A survey

A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, 2020 - arxiv.org
The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …

Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension

Y Xu, D Wang, M Yu, D Ritchie, B Yao, T Wu… - arXiv preprint arXiv …, 2022 - arxiv.org
Question answering (QA) is a fundamental means to facilitate assessment and training of
narrative comprehension skills for both machines and young children, yet there is scarcity of …

Assessing the quality of student-generated short answer questions using GPT-3

S Moore, HA Nguyen, N Bier, T Domadia… - European conference on …, 2022 - Springer
Generating short answer questions is a popular form of learnersourcing with benefits for
both the students' higher-order thinking and the instructors' collection of assessment items …

[PDF][PDF] We need to consider disagreement in evaluation

V Basile, M Fell, T Fornaciari, D Hovy, S Paun… - Proceedings of the 1st …, 2021 - iris.unito.it
Where have we been, and where are we going? It is easier to talk about the past than the
future. These days, benchmarks evolve more bottom up (such as papers with code). There …

Assessing the quality of multiple-choice questions using gpt-4 and rule-based methods

S Moore, HA Nguyen, T Chen, J Stamper - European Conference on …, 2023 - Springer
Multiple-choice questions with item-writing flaws can negatively impact student learning and
skew analytics. These flaws are often present in student-generated questions, making it …

Counseling-style reflection generation using generative pretrained transformers with augmented context

S Shen, C Welch, R Mihalcea… - Proceedings of the 21th …, 2020 - aclanthology.org
We introduce a counseling dialogue system that seeks to assist counselors while they are
learning and refining their counseling skills. The system generates counselors' reflections …