Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org

Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

被引用次数：153 相关文章所有 6 个版本

[PDF] arxiv.org

A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022 - dl.acm.org

In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

被引用次数：271 相关文章所有 4 个版本

[PDF] arxiv.org

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

S Min, K Krishna, X Lyu, M Lewis, W Yih… - arXiv preprint arXiv …, 2023 - arxiv.org

Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …

被引用次数：436 相关文章所有 8 个版本

[PDF] neurips.cc

Bartscore: Evaluating generated text as text generation

W Yuan, G Neubig, P Liu - Advances in Neural Information …, 2021 - proceedings.neurips.cc

A wide variety of NLP applications, such as machine translation, summarization, and dialog,
involve text generation. One major challenge for these applications is how to evaluate …

被引用次数：767 相关文章所有 7 个版本

[PDF] aclanthology.org

COMET-22: Unbabel-IST 2022 submission for the metrics shared task

R Rei, JGC De Souza, D Alves, C Zerva… - Proceedings of the …, 2022 - aclanthology.org

In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …

被引用次数：219 相关文章

[PDF] arxiv.org

COMET: A neural framework for MT evaluation

R Rei, C Stewart, AC Farinha, A Lavie - arXiv preprint arXiv:2009.09025, 2020 - arxiv.org

We present COMET, a neural framework for training multilingual machine translation
evaluation models which obtains new state-of-the-art levels of correlation with human …

被引用次数：934 相关文章所有 7 个版本

[PDF] arxiv.org

BLEURT: Learning robust metrics for text generation

T Sellam, D Das, AP Parikh - arXiv preprint arXiv:2004.04696, 2020 - arxiv.org

Text generation has made significant advances in the last few years. Yet, evaluation metrics
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …

被引用次数：1452 相关文章所有 6 个版本

[PDF] uzh.ch

Findings of the 2019 conference on machine translation (WMT19)

L Barrault, O Bojar, MR Costa-Jussa, C Federmann… - 2019 - zora.uzh.ch

This paper presents the results of the premier shared task organized alongside the
Conference on Machine Translation (WMT) 2019. Participants were asked to build machine …

被引用次数：774 相关文章所有 13 个版本

[PDF] arxiv.org

To ship or not to ship: An extensive evaluation of automatic metrics for machine translation

T Kocmi, C Federmann, R Grundkiewicz… - arXiv preprint arXiv …, 2021 - arxiv.org

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of
one machine translation system's quality over another. The community choice of automatic …

被引用次数：210 相关文章所有 3 个版本

[PDF] aclanthology.org

Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain

M Freitag, R Rei, N Mathur, C Lo… - Proceedings of the …, 2021 - aclanthology.org

This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT21 News Translation …

被引用次数：170 相关文章所有 8 个版本