Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022 - dl.acm.org
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

Menli: Robust evaluation metrics from natural language inference

Y Chen, S Eger - Transactions of the Association for Computational …, 2023 - direct.mit.edu
Recently proposed BERT-based evaluation metrics for text generation perform well on
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …

Discoscore: Evaluating text generation with bert and discourse coherence

W Zhao, M Strube, S Eger - arXiv preprint arXiv:2201.11176, 2022 - arxiv.org
Recently, there has been a growing interest in designing text generation systems from a
discourse coherence perspective, eg, modeling the interdependence between sentences …

DEMETR: Diagnosing evaluation metrics for translation

M Karpinska, N Raj, K Thai, Y Song, A Gupta… - arXiv preprint arXiv …, 2022 - arxiv.org
While machine translation evaluation metrics based on string overlap (eg, BLEU) have their
limitations, their computations are transparent: the BLEU score assigned to a particular …

Towards explainable evaluation metrics for machine translation

C Leiter, P Lertvittayakumjorn, M Fomicheva… - Journal of Machine …, 2024 - jmlr.org
Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for
machine translation (for example, COMET or BERTScore) are based on black-box large …

Multi-Objective Hyperparameter Optimization--An Overview

F Karl, T Pielok, J Moosbauer, F Pfisterer… - arXiv preprint arXiv …, 2022 - arxiv.org
Hyperparameter optimization constitutes a large part of typical modern machine learning
workflows. This arises from the fact that machine learning methods and corresponding …

IndicMT eval: A dataset to meta-evaluate machine translation metrics for Indian languages

T Dixit, V Nagarajan, A Kunchukuttan… - Proceedings of the …, 2023 - aclanthology.org
The rapid growth of machine translation (MT) systems necessitates meta-evaluations of
evaluation metrics to enable selection of those that best reflect MT quality. Unfortunately …

BLEURT has universal translations: An analysis of automatic metrics by minimum risk training

Y Yan, T Wang, C Zhao, S Huang, J Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-
gram-based metrics, there has been a recent surge in the development of pre-trained model …