On the limitations of reference-free evaluations of generated text

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu

Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

被引用次数：81 相关文章所有 10 个版本

[PDF] mit.edu

Menli: Robust evaluation metrics from natural language inference

Y Chen, S Eger - Transactions of the Association for Computational …, 2023 - direct.mit.edu

Recently proposed BERT-based evaluation metrics for text generation perform well on
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …

被引用次数：32 相关文章所有 8 个版本

[PDF] arxiv.org

Llms as narcissistic evaluators: When ego inflates evaluation scores

Y Liu, NS Moosavi, C Lin - arXiv preprint arXiv:2311.09766, 2023 - arxiv.org

Automatic evaluation of generated textual content presents an ongoing challenge within the
field of NLP. Given the impressive capabilities of modern language models (LMs) across …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue

Z Shi, W Sun, S Zhang, Z Zhang, P Ren… - arXiv preprint arXiv …, 2023 - arxiv.org

Evaluating open-domain dialogue systems is challenging for reasons such as the one-to-
many problem, ie, many appropriate responses other than just the golden response. As of …

被引用次数：7 相关文章所有 7 个版本

[PDF] arxiv.org

CLEME: debiasing multi-reference evaluation for grammatical error correction

J Ye, Y Li, Q Zhou, Y Li, S Ma, HT Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org

Evaluating the performance of Grammatical Error Correction (GEC) systems is a challenging
task due to its subjectivity. Designing an evaluation metric that is as objective as possible is …

被引用次数：14 相关文章所有 4 个版本

[PDF] arxiv.org

Aligning neural machine translation models: Human feedback in training and inference

MM Ramos, P Fernandes, A Farinhas… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a recent technique to improve the
quality of the text generated by a language model, making it closer to what humans would …

被引用次数：5 相关文章

Evaluation metrics on text summarization: comprehensive survey

E Davoodijam, M Alambardar Meybodi - Knowledge and Information …, 2024 - Springer

Automatic text summarization is the process of shortening a large document into a summary
text that preserves the main concepts and key points of the original document. Due to the …

[PDF] arxiv.org

Large language models are inconsistent and biased evaluators

R Stureborg, D Alikaniotis, Y Suhara - arXiv preprint arXiv:2405.01724, 2024 - arxiv.org

The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible,
reference-free metrics for various tasks, making LLM evaluators common tools in NLP …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives

DH Hagos, R Battle, DB Rawat - IEEE Transactions on Artificial …, 2024 - ieeexplore.ieee.org

The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs)
has marked a new era of Natural Language Processing (NLP), introducing unprecedented …

ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications

S Takeshita, T Green, I Reinig, K Eckert… - arXiv preprint arXiv …, 2024 - arxiv.org

Extensive efforts in the past have been directed toward the development of summarization
datasets. However, a predominant number of these resources have been (semi) …

被引用次数：4 相关文章所有 3 个版本