Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
A survey of evaluation metrics used for NLG systems
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …
evaluating Natural Language Generation (NLG) systems. The rapid development and …
Efficient methods for natural language processing: A survey
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …
scaling model parameters and training data; however, using only scale to improve …
Menli: Robust evaluation metrics from natural language inference
Recently proposed BERT-based evaluation metrics for text generation perform well on
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …
Discoscore: Evaluating text generation with bert and discourse coherence
Recently, there has been a growing interest in designing text generation systems from a
discourse coherence perspective, eg, modeling the interdependence between sentences …
discourse coherence perspective, eg, modeling the interdependence between sentences …
DEMETR: Diagnosing evaluation metrics for translation
While machine translation evaluation metrics based on string overlap (eg, BLEU) have their
limitations, their computations are transparent: the BLEU score assigned to a particular …
limitations, their computations are transparent: the BLEU score assigned to a particular …
Towards explainable evaluation metrics for machine translation
Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for
machine translation (for example, COMET or BERTScore) are based on black-box large …
machine translation (for example, COMET or BERTScore) are based on black-box large …
Multi-Objective Hyperparameter Optimization--An Overview
Hyperparameter optimization constitutes a large part of typical modern machine learning
workflows. This arises from the fact that machine learning methods and corresponding …
workflows. This arises from the fact that machine learning methods and corresponding …
IndicMT eval: A dataset to meta-evaluate machine translation metrics for Indian languages
The rapid growth of machine translation (MT) systems necessitates meta-evaluations of
evaluation metrics to enable selection of those that best reflect MT quality. Unfortunately …
evaluation metrics to enable selection of those that best reflect MT quality. Unfortunately …
BLEURT has universal translations: An analysis of automatic metrics by minimum risk training
Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-
gram-based metrics, there has been a recent surge in the development of pre-trained model …
gram-based metrics, there has been a recent surge in the development of pre-trained model …