A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022 - dl.acm.org
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

Evaluation of text generation: A survey

A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, 2020 - arxiv.org
The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …

Bertscore: Evaluating text generation with bert

T Zhang, V Kishore, F Wu, KQ Weinberger… - arXiv preprint arXiv …, 2019 - arxiv.org
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …

Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges

Q Ma, JTZ Wei, O Bojar, Y Graham - 2019 - doras.dcu.ie
This paper presents the results of the WMT19 Metrics Shared Task. Participants were asked
to score the outputs of the translations systems competing in the WMT19 News Translation …

Automatic machine translation evaluation in many languages via zero-shot paraphrasing

B Thompson, M Post - arXiv preprint arXiv:2004.14564, 2020 - arxiv.org
We frame the task of machine translation evaluation as one of scoring machine translation
output with a sequence-to-sequence paraphraser, conditioned on a human reference. We …

Biologically inspired design concept generation using generative pre-trained transformers

Q Zhu, X Zhang, J Luo - Journal of Mechanical …, 2023 - asmedigitalcollection.asme.org
Biological systems in nature have evolved for millions of years to adapt and survive the
environment. Many features they developed can be inspirational and beneficial for solving …

Automatic text evaluation through the lens of Wasserstein barycenters

P Colombo, G Staerman, C Clavel… - arXiv preprint arXiv …, 2021 - arxiv.org
A new metric\texttt {BaryScore} to evaluate text generation based on deep contextualized
embeddings eg, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new …

Toward human-like evaluation for natural language generation with error analysis

Q Lu, L Ding, L Xie, K Zhang, DF Wong… - arXiv preprint arXiv …, 2022 - arxiv.org
The state-of-the-art language model-based automatic metrics, eg BARTScore, benefiting
from large-scale contextualized pre-training, have been successfully used in a wide range of …

Infolm: A new metric to evaluate summarization & data2text generation

PJA Colombo, C Clavel, P Piantanida - Proceedings of the AAAI …, 2022 - ojs.aaai.org
Assessing the quality of natural language generation (NLG) systems through human
annotation is very expensive. Additionally, human annotation campaigns are time …

Perturbation CheckLists for evaluating NLG evaluation metrics

AB Sai, T Dixit, DY Sheth, S Mohan… - arXiv preprint arXiv …, 2021 - arxiv.org
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment
of multiple desirable criteria, eg, fluency, coherency, coverage, relevance, adequacy, overall …