Towards explainable evaluation metrics for machine translation

C Leiter, P Lertvittayakumjorn, M Fomicheva… - Journal of Machine …, 2024 - jmlr.org
Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for
machine translation (for example, COMET or BERTScore) are based on black-box large …

IndicMT eval: A dataset to meta-evaluate machine translation metrics for Indian languages

T Dixit, V Nagarajan, A Kunchukuttan… - Proceedings of the …, 2023 - aclanthology.org
The rapid growth of machine translation (MT) systems necessitates meta-evaluations of
evaluation metrics to enable selection of those that best reflect MT quality. Unfortunately …

Extrinsic evaluation of machine translation metrics

N Moghe, T Sherborne, M Steedman… - arXiv preprint arXiv …, 2022 - arxiv.org
Automatic machine translation (MT) metrics are widely used to distinguish the translation
qualities of machine translation systems across relatively large test sets (system-level …

Reranking for natural language generation from logical forms: A study based on large language models

L Haroutunian, Z Li, L Galescu, P Cohen… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have demonstrated impressive capabilities in natural
language generation. However, their output quality can be inconsistent, posing challenges …

Navigating the metrics maze: Reconciling score magnitudes and accuracies

T Kocmi, V Zouhar, C Federmann, M Post - arXiv preprint arXiv …, 2024 - arxiv.org
Ten years ago a single metric, BLEU, governed progress in machine translation research.
For better or worse, there is no such consensus today, and consequently it is difficult for …

Metric score landscape challenge (MSLC23): Understanding metrics' performance on a wider landscape of translation quality

C Lo, S Larkin, R Knowles - … of the Eighth Conference on Machine …, 2023 - aclanthology.org
Abstract The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into
metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a …

ACES: Translation accuracy challenge sets at WMT 2023

C Amrhein, N Moghe, L Guillou - arXiv preprint arXiv:2311.01153, 2023 - arxiv.org
We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the
ACES Challenge Set (Amrhein et al., 2022). The challenge set consists of 36K examples …

MT-Ranker: Reference-free machine translation evaluation by inter-system ranking

IM Moosa, R Zhang, W Yin - arXiv preprint arXiv:2401.17099, 2024 - arxiv.org
Traditionally, Machine Translation (MT) Evaluation has been treated as a regression
problem--producing an absolute translation-quality score. This approach has two limitations …

Towards fine-grained information: Identifying the type and location of translation errors

K Bao, Y Wan, D Liu, B Yang, W Lei, X He… - arXiv preprint arXiv …, 2023 - arxiv.org
Fine-grained information on translation errors is helpful for the translation evaluation
community. Existing approaches can not synchronously consider error position and type …

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

T Glushkova, C Zerva, AFT Martins - arXiv preprint arXiv:2305.19144, 2023 - arxiv.org
Although neural-based machine translation evaluation metrics, such as COMET or BLEURT,
have achieved strong correlations with human judgements, they are sometimes unreliable in …