Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

Prompting palm for translation: Assessing strategies and performance

D Vilar, M Freitag, C Cherry, J Luo, V Ratnakar… - arXiv preprint arXiv …, 2022 - arxiv.org
Large language models (LLMs) that have been trained on multilingual but not parallel text
exhibit a remarkable ability to translate between languages. We probe this ability in an in …

Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain

M Freitag, R Rei, N Mathur, C Lo… - Proceedings of the …, 2021 - aclanthology.org
This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT21 News Translation …

LongEval: Guidelines for human evaluation of faithfulness in long-form summarization

K Krishna, E Bransom, B Kuehl, M Iyyer… - arXiv preprint arXiv …, 2023 - arxiv.org
While human evaluation remains best practice for accurately judging the faithfulness of
automatically-generated summaries, few solutions exist to address the increased difficulty …

The multiberts: Bert reproductions for robustness analysis

T Sellam, S Yadlowsky, J Wei, N Saphra… - arXiv preprint arXiv …, 2021 - arxiv.org
Experiments with pre-trained models such as BERT are often based on a single checkpoint.
While the conclusions drawn apply to the artifact tested in the experiment (ie, the particular …

High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics

M Freitag, D Grangier, Q Tan, B Liang - Transactions of the …, 2022 - direct.mit.edu
Abstract In Neural Machine Translation, it is typically assumed that the sentence with the
highest estimated probability should also be the translation with the highest quality as …

Towards question-answering as an automatic metric for evaluating the content quality of a summary

D Deutsch, T Bedrax-Weiss, D Roth - Transactions of the Association …, 2021 - direct.mit.edu
A desirable property of a reference-based evaluation metric that measures the content
quality of a summary is that it should estimate how much information that summary has in …

xcomet: Transparent machine translation evaluation through fine-grained error detection

NM Guerreiro, R Rei, D van Stigt, L Coheur… - arXiv preprint arXiv …, 2023 - arxiv.org
Widely used learned metrics for machine translation evaluation, such as COMET and
BLEURT, estimate the quality of a translation hypothesis by providing a single sentence …

LENS: A learnable evaluation metric for text simplification

M Maddela, Y Dou, D Heineman, W Xu - arXiv preprint arXiv:2212.09739, 2022 - arxiv.org
Training learnable metrics using modern language models has recently emerged as a
promising method for the automatic evaluation of machine translation. However, existing …

Automatic text evaluation through the lens of Wasserstein barycenters

P Colombo, G Staerman, C Clavel… - arXiv preprint arXiv …, 2021 - arxiv.org
A new metric\texttt {BaryScore} to evaluate text generation based on deep contextualized
embeddings eg, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new …