INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback

W Xu, D Wang, L Pan, Z Song, M Freitag… - arXiv preprint arXiv …, 2023 - arxiv.org
Automatically evaluating the quality of language generation is critical. Although recent
learned metrics show high correlation with human judgement, these metrics can not explain …

GEMBA-MQM: Detecting translation quality error spans with GPT-4

T Kocmi, C Federmann - arXiv preprint arXiv:2310.13988, 2023 - arxiv.org
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect
translation quality errors, specifically for the quality estimation setting without the need for …

Mbr and qe finetuning: Training-time distillation of the best and most expensive decoding methods

M Finkelstein, M Freitag - arXiv preprint arXiv:2309.10966, 2023 - arxiv.org
Recent research in decoding methods for Natural Language Generation (NLG) tasks has
shown that the traditional beam search and greedy decoding algorithms are not optimal …

Metric score landscape challenge (MSLC23): Understanding metrics' performance on a wider landscape of translation quality

C Lo, S Larkin, R Knowles - … of the Eighth Conference on Machine …, 2023 - aclanthology.org
Abstract The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into
metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a …

Overview of the second shared task on automatic minuting (automin) at inlg 2023

T Ghosal, O Bojar, M Hledíková, T Kocmi… - Proceedings of the …, 2023 - aclanthology.org
In this article, we report the findings of the second shared task on Automatic Minuting
(AutoMin) held as a Generation Challenge at the 16th International Natural Language …

Beyond correlation: Making sense of the score differences of new mt evaluation metrics

C Lo, R Knowles, C Goutte - … Summit XIX, Vol. 1: Research Track, 2023 - aclanthology.org
While many new automatic metrics for machine translation evaluation have been proposed
in recent years, BLEU scores are still used as the primary metric in the vast majority of MT …

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

T Tang, H Lu, YE Jiang, H Huang, D Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Most research about natural language generation (NLG) relies on evaluation benchmarks
with limited references for a sample, which may result in poor correlations with human …