Perturbation CheckLists for evaluating NLG evaluation metrics

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org

Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

被引用次数：133 相关文章所有 6 个版本

[PDF] arxiv.org

A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022 - dl.acm.org

In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

被引用次数：237 相关文章所有 4 个版本

[PDF] mit.edu

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu

Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

被引用次数：83 相关文章所有 10 个版本

[PDF] mit.edu

Menli: Robust evaluation metrics from natural language inference

Y Chen, S Eger - Transactions of the Association for Computational …, 2023 - direct.mit.edu

Recently proposed BERT-based evaluation metrics for text generation perform well on
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …

被引用次数：33 相关文章所有 8 个版本

[PDF] arxiv.org

Discoscore: Evaluating text generation with bert and discourse coherence

W Zhao, M Strube, S Eger - arXiv preprint arXiv:2201.11176, 2022 - arxiv.org

Recently, there has been a growing interest in designing text generation systems from a
discourse coherence perspective, eg, modeling the interdependence between sentences …

被引用次数：40 相关文章所有 3 个版本

[PDF] arxiv.org

DEMETR: Diagnosing evaluation metrics for translation

M Karpinska, N Raj, K Thai, Y Song, A Gupta… - arXiv preprint arXiv …, 2022 - arxiv.org

While machine translation evaluation metrics based on string overlap (eg, BLEU) have their
limitations, their computations are transparent: the BLEU score assigned to a particular …

被引用次数：28 相关文章所有 5 个版本

[PDF] jmlr.org

Towards explainable evaluation metrics for machine translation

C Leiter, P Lertvittayakumjorn, M Fomicheva… - Journal of Machine …, 2024 - jmlr.org

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for
machine translation (for example, COMET or BERTScore) are based on black-box large …

被引用次数：7 相关文章所有 5 个版本

[PDF] arxiv.org

Multi-Objective Hyperparameter Optimization--An Overview

F Karl, T Pielok, J Moosbauer, F Pfisterer… - arXiv preprint arXiv …, 2022 - arxiv.org

Hyperparameter optimization constitutes a large part of typical modern machine learning
workflows. This arises from the fact that machine learning methods and corresponding …

被引用次数：24 相关文章所有 2 个版本

[PDF] aclanthology.org

IndicMT eval: A dataset to meta-evaluate machine translation metrics for Indian languages

T Dixit, V Nagarajan, A Kunchukuttan… - Proceedings of the …, 2023 - aclanthology.org

The rapid growth of machine translation (MT) systems necessitates meta-evaluations of
evaluation metrics to enable selection of those that best reflect MT quality. Unfortunately …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

BLEURT has universal translations: An analysis of automatic metrics by minimum risk training

Y Yan, T Wang, C Zhao, S Huang, J Chen… - arXiv preprint arXiv …, 2023 - arxiv.org

Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-
gram-based metrics, there has been a recent surge in the development of pre-trained model …

被引用次数：9 相关文章所有 6 个版本