A survey of evaluation metrics used for NLG systems
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …
evaluating Natural Language Generation (NLG) systems. The rapid development and …
Evaluation of text generation: A survey
A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, 2020 - arxiv.org
The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …
have been developed in the last few years. We group NLG evaluation methods into three …
Bertscore: Evaluating text generation with bert
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …
common metrics, BERTScore computes a similarity score for each token in the candidate …
Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges
This paper presents the results of the WMT19 Metrics Shared Task. Participants were asked
to score the outputs of the translations systems competing in the WMT19 News Translation …
to score the outputs of the translations systems competing in the WMT19 News Translation …
Automatic machine translation evaluation in many languages via zero-shot paraphrasing
B Thompson, M Post - arXiv preprint arXiv:2004.14564, 2020 - arxiv.org
We frame the task of machine translation evaluation as one of scoring machine translation
output with a sequence-to-sequence paraphraser, conditioned on a human reference. We …
output with a sequence-to-sequence paraphraser, conditioned on a human reference. We …
Biologically inspired design concept generation using generative pre-trained transformers
Biological systems in nature have evolved for millions of years to adapt and survive the
environment. Many features they developed can be inspirational and beneficial for solving …
environment. Many features they developed can be inspirational and beneficial for solving …
Automatic text evaluation through the lens of Wasserstein barycenters
A new metric\texttt {BaryScore} to evaluate text generation based on deep contextualized
embeddings eg, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new …
embeddings eg, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new …
Toward human-like evaluation for natural language generation with error analysis
The state-of-the-art language model-based automatic metrics, eg BARTScore, benefiting
from large-scale contextualized pre-training, have been successfully used in a wide range of …
from large-scale contextualized pre-training, have been successfully used in a wide range of …
Infolm: A new metric to evaluate summarization & data2text generation
Assessing the quality of natural language generation (NLG) systems through human
annotation is very expensive. Additionally, human annotation campaigns are time …
annotation is very expensive. Additionally, human annotation campaigns are time …
Perturbation CheckLists for evaluating NLG evaluation metrics
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment
of multiple desirable criteria, eg, fluency, coherency, coverage, relevance, adequacy, overall …
of multiple desirable criteria, eg, fluency, coherency, coverage, relevance, adequacy, overall …