COMET-22: Unbabel-IST 2022 submission for the metrics shared task

R Rei, JGC De Souza, D Alves, C Zerva… - Proceedings of the …, 2022 - aclanthology.org
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …

Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain

M Freitag, R Rei, N Mathur, C Lo… - Proceedings of the …, 2021 - aclanthology.org
This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT21 News Translation …

Large language models effectively leverage document-level context for literary translation, but critical errors persist

M Karpinska, M Iyyer - arXiv preprint arXiv:2304.03245, 2023 - arxiv.org
Large language models (LLMs) are competitive with the state of the art on a wide range of
sentence-level translation datasets. However, their ability to translate paragraphs and …

Quality-aware decoding for neural machine translation

P Fernandes, A Farinhas, R Rei, JGC de Souza… - arXiv preprint arXiv …, 2022 - arxiv.org
Despite the progress in machine translation quality estimation and evaluation in the last
years, decoding in neural machine translation (NMT) is mostly oblivious to this and centers …

High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics

M Freitag, D Grangier, Q Tan, B Liang - Transactions of the …, 2022 - direct.mit.edu
Abstract In Neural Machine Translation, it is typically assumed that the sentence with the
highest estimated probability should also be the translation with the highest quality as …

Codescore: Evaluating code generation by learning code execution

Y Dong, J Ding, X Jiang, G Li, Z Li, Z Jin - arXiv preprint arXiv:2301.09043, 2023 - arxiv.org
A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation,
which is an important research field in NLP and software engineering. Prevailing match …

Identifying weaknesses in machine translation metrics through minimum Bayes risk decoding: A case study for COMET

C Amrhein, R Sennrich - arXiv preprint arXiv:2202.05148, 2022 - arxiv.org
Neural metrics have achieved impressive correlation with human judgements in the
evaluation of machine translation systems, but before we can safely optimise towards such …

Findings of the WMT 2023 shared task on quality estimation

F Blain, C Zerva, R Rei, NM Guerreiro… - Proceedings of the …, 2023 - aclanthology.org
We report the results of the WMT 2023 shared task on Quality Estimation, in which the
challenge is to predict the quality of the output of neural machine translation systems at the …

GEMBA-MQM: Detecting translation quality error spans with GPT-4

T Kocmi, C Federmann - arXiv preprint arXiv:2310.13988, 2023 - arxiv.org
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect
translation quality errors, specifically for the quality estimation setting without the need for …

On the evaluation metrics for paraphrase generation

L Shen, L Liu, H Jiang, S Shi - arXiv preprint arXiv:2202.08479, 2022 - arxiv.org
In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings
that disobey conventional wisdom:(1) Reference-free metrics achieve better performance …