Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
A survey of evaluation metrics used for NLG systems
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …
evaluating Natural Language Generation (NLG) systems. The rapid development and …
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation
Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …
trivial because (1) generations often contain a mixture of supported and unsupported pieces …
Bartscore: Evaluating generated text as text generation
A wide variety of NLP applications, such as machine translation, summarization, and dialog,
involve text generation. One major challenge for these applications is how to evaluate …
involve text generation. One major challenge for these applications is how to evaluate …
COMET-22: Unbabel-IST 2022 submission for the metrics shared task
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …
COMET: A neural framework for MT evaluation
We present COMET, a neural framework for training multilingual machine translation
evaluation models which obtains new state-of-the-art levels of correlation with human …
evaluation models which obtains new state-of-the-art levels of correlation with human …
BLEURT: Learning robust metrics for text generation
Text generation has made significant advances in the last few years. Yet, evaluation metrics
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …
Findings of the 2019 conference on machine translation (WMT19)
This paper presents the results of the premier shared task organized alongside the
Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …
Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …
To ship or not to ship: An extensive evaluation of automatic metrics for machine translation
Automatic metrics are commonly used as the exclusive tool for declaring the superiority of
one machine translation system's quality over another. The community choice of automatic …
one machine translation system's quality over another. The community choice of automatic …
Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain
This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT21 News Translation …
to score the outputs of the translation systems competing in the WMT21 News Translation …