A statistical analysis of summarization evaluation metrics using resampling methods

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org

Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

被引用次数：127 相关文章所有 6 个版本

[PDF] arxiv.org

Prompting palm for translation: Assessing strategies and performance

D Vilar, M Freitag, C Cherry, J Luo, V Ratnakar… - arXiv preprint arXiv …, 2022 - arxiv.org

Large language models (LLMs) that have been trained on multilingual but not parallel text
exhibit a remarkable ability to translate between languages. We probe this ability in an in …

被引用次数：115 相关文章所有 5 个版本

[PDF] aclanthology.org

Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain

M Freitag, R Rei, N Mathur, C Lo… - Proceedings of the …, 2021 - aclanthology.org

This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT21 News Translation …

被引用次数：152 相关文章所有 8 个版本

[PDF] arxiv.org

LongEval: Guidelines for human evaluation of faithfulness in long-form summarization

K Krishna, E Bransom, B Kuehl, M Iyyer… - arXiv preprint arXiv …, 2023 - arxiv.org

While human evaluation remains best practice for accurately judging the faithfulness of
automatically-generated summaries, few solutions exist to address the increased difficulty …

被引用次数：49 相关文章所有 8 个版本

[PDF] arxiv.org

The multiberts: Bert reproductions for robustness analysis

T Sellam, S Yadlowsky, J Wei, N Saphra… - arXiv preprint arXiv …, 2021 - arxiv.org

Experiments with pre-trained models such as BERT are often based on a single checkpoint.
While the conclusions drawn apply to the artifact tested in the experiment (ie, the particular …

被引用次数：85 相关文章所有 7 个版本

[PDF] mit.edu

High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics

M Freitag, D Grangier, Q Tan, B Liang - Transactions of the …, 2022 - direct.mit.edu

Abstract In Neural Machine Translation, it is typically assumed that the sentence with the
highest estimated probability should also be the translation with the highest quality as …

被引用次数：51 相关文章所有 10 个版本

[PDF] mit.edu

Towards question-answering as an automatic metric for evaluating the content quality of a summary

D Deutsch, T Bedrax-Weiss, D Roth - Transactions of the Association …, 2021 - direct.mit.edu

A desirable property of a reference-based evaluation metric that measures the content
quality of a summary is that it should estimate how much information that summary has in …

被引用次数：91 相关文章所有 10 个版本

[PDF] arxiv.org

xcomet: Transparent machine translation evaluation through fine-grained error detection

NM Guerreiro, R Rei, D van Stigt, L Coheur… - arXiv preprint arXiv …, 2023 - arxiv.org

Widely used learned metrics for machine translation evaluation, such as COMET and
BLEURT, estimate the quality of a translation hypothesis by providing a single sentence …

被引用次数：22 相关文章所有 2 个版本

[PDF] arxiv.org

LENS: A learnable evaluation metric for text simplification

M Maddela, Y Dou, D Heineman, W Xu - arXiv preprint arXiv:2212.09739, 2022 - arxiv.org

Training learnable metrics using modern language models has recently emerged as a
promising method for the automatic evaluation of machine translation. However, existing …

被引用次数：42 相关文章所有 6 个版本

[PDF] arxiv.org

Automatic text evaluation through the lens of Wasserstein barycenters

P Colombo, G Staerman, C Clavel… - arXiv preprint arXiv …, 2021 - arxiv.org

A new metric\texttt {BaryScore} to evaluate text generation based on deep contextualized
embeddings eg, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new …

被引用次数：50 相关文章所有 12 个版本