A survey of evaluation metrics used for NLG systems
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …
evaluating Natural Language Generation (NLG) systems. The rapid development and …
[HTML][HTML] Human evaluation of automatically generated text: Current trends and best practice guidelines
Currently, there is little agreement as to how Natural Language Generation (NLG) systems
should be evaluated, with a particularly high degree of variation in the way that human …
should be evaluated, with a particularly high degree of variation in the way that human …
Can large language models be an alternative to human evaluations?
Human evaluation is indispensable and inevitable for assessing the quality of texts
generated by machine learning models or written by humans. However, human evaluation is …
generated by machine learning models or written by humans. However, human evaluation is …
All that's' human'is not gold: Evaluating human evaluation of generated text
Human evaluations are typically considered the gold standard in natural language
generation, but as models' fluency improves, how well can evaluators detect and judge …
generation, but as models' fluency improves, how well can evaluators detect and judge …
Evaluation of text generation: A survey
A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, 2020 - arxiv.org
The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …
have been developed in the last few years. We group NLG evaluation methods into three …
Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension
Question answering (QA) is a fundamental means to facilitate assessment and training of
narrative comprehension skills for both machines and young children, yet there is scarcity of …
narrative comprehension skills for both machines and young children, yet there is scarcity of …
Assessing the quality of student-generated short answer questions using GPT-3
Generating short answer questions is a popular form of learnersourcing with benefits for
both the students' higher-order thinking and the instructors' collection of assessment items …
both the students' higher-order thinking and the instructors' collection of assessment items …
[PDF][PDF] We need to consider disagreement in evaluation
Where have we been, and where are we going? It is easier to talk about the past than the
future. These days, benchmarks evolve more bottom up (such as papers with code). There …
future. These days, benchmarks evolve more bottom up (such as papers with code). There …
Assessing the quality of multiple-choice questions using gpt-4 and rule-based methods
Multiple-choice questions with item-writing flaws can negatively impact student learning and
skew analytics. These flaws are often present in student-generated questions, making it …
skew analytics. These flaws are often present in student-generated questions, making it …
Counseling-style reflection generation using generative pretrained transformers with augmented context
We introduce a counseling dialogue system that seeks to assist counselors while they are
learning and refining their counseling skills. The system generates counselors' reflections …
learning and refining their counseling skills. The system generates counselors' reflections …