Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
Prompting palm for translation: Assessing strategies and performance
Large language models (LLMs) that have been trained on multilingual but not parallel text
exhibit a remarkable ability to translate between languages. We probe this ability in an in …
exhibit a remarkable ability to translate between languages. We probe this ability in an in …
Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain
This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT21 News Translation …
to score the outputs of the translation systems competing in the WMT21 News Translation …
LongEval: Guidelines for human evaluation of faithfulness in long-form summarization
While human evaluation remains best practice for accurately judging the faithfulness of
automatically-generated summaries, few solutions exist to address the increased difficulty …
automatically-generated summaries, few solutions exist to address the increased difficulty …
The multiberts: Bert reproductions for robustness analysis
Experiments with pre-trained models such as BERT are often based on a single checkpoint.
While the conclusions drawn apply to the artifact tested in the experiment (ie, the particular …
While the conclusions drawn apply to the artifact tested in the experiment (ie, the particular …
High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics
Abstract In Neural Machine Translation, it is typically assumed that the sentence with the
highest estimated probability should also be the translation with the highest quality as …
highest estimated probability should also be the translation with the highest quality as …
Towards question-answering as an automatic metric for evaluating the content quality of a summary
A desirable property of a reference-based evaluation metric that measures the content
quality of a summary is that it should estimate how much information that summary has in …
quality of a summary is that it should estimate how much information that summary has in …
xcomet: Transparent machine translation evaluation through fine-grained error detection
Widely used learned metrics for machine translation evaluation, such as COMET and
BLEURT, estimate the quality of a translation hypothesis by providing a single sentence …
BLEURT, estimate the quality of a translation hypothesis by providing a single sentence …
LENS: A learnable evaluation metric for text simplification
M Maddela, Y Dou, D Heineman, W Xu - arXiv preprint arXiv:2212.09739, 2022 - arxiv.org
Training learnable metrics using modern language models has recently emerged as a
promising method for the automatic evaluation of machine translation. However, existing …
promising method for the automatic evaluation of machine translation. However, existing …
Automatic text evaluation through the lens of Wasserstein barycenters
A new metric\texttt {BaryScore} to evaluate text generation based on deep contextualized
embeddings eg, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new …
embeddings eg, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new …