INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback
Automatically evaluating the quality of language generation is critical. Although recent
learned metrics show high correlation with human judgement, these metrics can not explain …
learned metrics show high correlation with human judgement, these metrics can not explain …
GEMBA-MQM: Detecting translation quality error spans with GPT-4
T Kocmi, C Federmann - arXiv preprint arXiv:2310.13988, 2023 - arxiv.org
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect
translation quality errors, specifically for the quality estimation setting without the need for …
translation quality errors, specifically for the quality estimation setting without the need for …
Mbr and qe finetuning: Training-time distillation of the best and most expensive decoding methods
M Finkelstein, M Freitag - arXiv preprint arXiv:2309.10966, 2023 - arxiv.org
Recent research in decoding methods for Natural Language Generation (NLG) tasks has
shown that the traditional beam search and greedy decoding algorithms are not optimal …
shown that the traditional beam search and greedy decoding algorithms are not optimal …
Metric score landscape challenge (MSLC23): Understanding metrics' performance on a wider landscape of translation quality
Abstract The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into
metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a …
metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a …
Overview of the second shared task on automatic minuting (automin) at inlg 2023
In this article, we report the findings of the second shared task on Automatic Minuting
(AutoMin) held as a Generation Challenge at the 16th International Natural Language …
(AutoMin) held as a Generation Challenge at the 16th International Natural Language …
Beyond correlation: Making sense of the score differences of new mt evaluation metrics
While many new automatic metrics for machine translation evaluation have been proposed
in recent years, BLEU scores are still used as the primary metric in the vast majority of MT …
in recent years, BLEU scores are still used as the primary metric in the vast majority of MT …
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References
Most research about natural language generation (NLG) relies on evaluation benchmarks
with limited references for a sample, which may result in poor correlations with human …
with limited references for a sample, which may result in poor correlations with human …