Ties matter: Modifying kendall’s tau for modern metric meta-evaluation

W Xu, D Wang, L Pan, Z Song, M Freitag… - arXiv preprint arXiv …, 2023 - arxiv.org

Automatically evaluating the quality of language generation is critical. Although recent
learned metrics show high correlation with human judgement, these metrics can not explain …

被引用次数：46 相关文章所有 10 个版本

[PDF] arxiv.org

GEMBA-MQM: Detecting translation quality error spans with GPT-4

T Kocmi, C Federmann - arXiv preprint arXiv:2310.13988, 2023 - arxiv.org

This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect
translation quality errors, specifically for the quality estimation setting without the need for …

被引用次数：21 相关文章所有 5 个版本

[PDF] arxiv.org

Mbr and qe finetuning: Training-time distillation of the best and most expensive decoding methods

M Finkelstein, M Freitag - arXiv preprint arXiv:2309.10966, 2023 - arxiv.org

Recent research in decoding methods for Natural Language Generation (NLG) tasks has
shown that the traditional beam search and greedy decoding algorithms are not optimal …

被引用次数：6 相关文章所有 3 个版本

[PDF] aclanthology.org

Metric score landscape challenge (MSLC23): Understanding metrics' performance on a wider landscape of translation quality

C Lo, S Larkin, R Knowles - … of the Eighth Conference on Machine …, 2023 - aclanthology.org

Abstract The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into
metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a …

被引用次数：2 相关文章所有 3 个版本

[PDF] aclanthology.org

Overview of the second shared task on automatic minuting (automin) at inlg 2023

T Ghosal, O Bojar, M Hledíková, T Kocmi… - Proceedings of the …, 2023 - aclanthology.org

In this article, we report the findings of the second shared task on Automatic Minuting
(AutoMin) held as a Generation Challenge at the 16th International Natural Language …

被引用次数：6 相关文章所有 3 个版本

[PDF] aclanthology.org

Beyond correlation: Making sense of the score differences of new mt evaluation metrics

C Lo, R Knowles, C Goutte - … Summit XIX, Vol. 1: Research Track, 2023 - aclanthology.org

While many new automatic metrics for machine translation evaluation have been proposed
in recent years, BLEU scores are still used as the primary metric in the vast majority of MT …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

T Tang, H Lu, YE Jiang, H Huang, D Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Most research about natural language generation (NLG) relies on evaluation benchmarks
with limited references for a sample, which may result in poor correlations with human …