ACES: Translation accuracy challenge sets for evaluating machine translation metrics

C Leiter, P Lertvittayakumjorn, M Fomicheva… - Journal of Machine …, 2024 - jmlr.org

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for
machine translation (for example, COMET or BERTScore) are based on black-box large …

被引用次数：6 相关文章所有 5 个版本

[PDF] aclanthology.org

IndicMT eval: A dataset to meta-evaluate machine translation metrics for Indian languages

T Dixit, V Nagarajan, A Kunchukuttan… - Proceedings of the …, 2023 - aclanthology.org

The rapid growth of machine translation (MT) systems necessitates meta-evaluations of
evaluation metrics to enable selection of those that best reflect MT quality. Unfortunately …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Extrinsic evaluation of machine translation metrics

N Moghe, T Sherborne, M Steedman… - arXiv preprint arXiv …, 2022 - arxiv.org

Automatic machine translation (MT) metrics are widely used to distinguish the translation
qualities of machine translation systems across relatively large test sets (system-level …

被引用次数：10 相关文章所有 5 个版本

[PDF] arxiv.org

Reranking for natural language generation from logical forms: A study based on large language models

L Haroutunian, Z Li, L Galescu, P Cohen… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) have demonstrated impressive capabilities in natural
language generation. However, their output quality can be inconsistent, posing challenges …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Navigating the metrics maze: Reconciling score magnitudes and accuracies

T Kocmi, V Zouhar, C Federmann, M Post - arXiv preprint arXiv …, 2024 - arxiv.org

Ten years ago a single metric, BLEU, governed progress in machine translation research.
For better or worse, there is no such consensus today, and consequently it is difficult for …

被引用次数：5 相关文章所有 2 个版本

[PDF] aclanthology.org

Metric score landscape challenge (MSLC23): Understanding metrics' performance on a wider landscape of translation quality

C Lo, S Larkin, R Knowles - … of the Eighth Conference on Machine …, 2023 - aclanthology.org

Abstract The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into
metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

ACES: Translation accuracy challenge sets at WMT 2023

C Amrhein, N Moghe, L Guillou - arXiv preprint arXiv:2311.01153, 2023 - arxiv.org

We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the
ACES Challenge Set (Amrhein et al., 2022). The challenge set consists of 36K examples …

被引用次数：2 相关文章所有 7 个版本

[PDF] arxiv.org

MT-Ranker: Reference-free machine translation evaluation by inter-system ranking

IM Moosa, R Zhang, W Yin - arXiv preprint arXiv:2401.17099, 2024 - arxiv.org

Traditionally, Machine Translation (MT) Evaluation has been treated as a regression
problem--producing an absolute translation-quality score. This approach has two limitations …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Towards fine-grained information: Identifying the type and location of translation errors

K Bao, Y Wan, D Liu, B Yang, W Lei, X He… - arXiv preprint arXiv …, 2023 - arxiv.org

Fine-grained information on translation errors is helpful for the translation evaluation
community. Existing approaches can not synchronously consider error position and type …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

T Glushkova, C Zerva, AFT Martins - arXiv preprint arXiv:2305.19144, 2023 - arxiv.org

Although neural-based machine translation evaluation metrics, such as COMET or BLEURT,
have achieved strong correlations with human judgements, they are sometimes unreliable in …

被引用次数：2 相关文章所有 4 个版本