Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
Holistic evaluation of language models
Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …
technologies, but their capabilities, limitations, and risks are not well understood. We present …
Benchmarking large language models for news summarization
Large language models (LLMs) have shown promise for automatic summarization but the
reasons behind their successes are poorly understood. By conducting a human evaluation …
reasons behind their successes are poorly understood. By conducting a human evaluation …
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation
Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …
trivial because (1) generations often contain a mixture of supported and unsupported pieces …
News summarization and evaluation in the era of gpt-3
The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …
paradigm shift in NLP research. In this paper, we study its impact on text summarization …
FacTool: Factuality Detection in Generative AI--A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
The emergence of generative pre-trained models has facilitated the synthesis of high-quality
text, but it has also posed challenges in identifying factual errors in the generated text. In …
text, but it has also posed challenges in identifying factual errors in the generated text. In …
A critical evaluation of evaluations for long-form question answering
Long-form question answering (LFQA) enables answering a wide range of questions, but its
flexibility poses enormous challenges for evaluation. We perform the first targeted study of …
flexibility poses enormous challenges for evaluation. We perform the first targeted study of …
Large language model alignment: A survey
Recent years have witnessed remarkable progress made in large language models (LLMs).
Such advancements, while garnering significant attention, have concurrently elicited various …
Such advancements, while garnering significant attention, have concurrently elicited various …
AlignScore: Evaluating factual consistency with a unified alignment function
Many text generation applications require the generated text to be factually consistent with
input information. Automatic evaluation of factual consistency is challenging. Previous work …
input information. Automatic evaluation of factual consistency is challenging. Previous work …
LongEval: Guidelines for human evaluation of faithfulness in long-form summarization
While human evaluation remains best practice for accurately judging the faithfulness of
automatically-generated summaries, few solutions exist to address the increased difficulty …
automatically-generated summaries, few solutions exist to address the increased difficulty …