Evaluating large language models: A comprehensive survey

Z Guo, R Jin, C Liu, Y Huang, D Shi, L Yu, Y Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have demonstrated remarkable capabilities across a broad
spectrum of tasks. They have attracted significant attention and been deployed in numerous …

News summarization and evaluation in the era of gpt-3

T Goyal, JJ Li, G Durrett - arXiv preprint arXiv:2209.12356, 2022 - arxiv.org
The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …

Evaluating large language models on medical evidence summarization

L Tang, Z Sun, B Idnay, JG Nestor, A Soroush… - NPJ digital …, 2023 - nature.com
Recent advances in large language models (LLMs) have demonstrated remarkable
successes in zero-and few-shot performance on various downstream tasks, paving the way …

Felm: Benchmarking factuality evaluation of large language models

Y Zhao, J Zhang, I Chern, S Gao… - Advances in Neural …, 2024 - proceedings.neurips.cc
Assessing factuality of text generated by large language models (LLMs) is an emerging yet
crucial research area, aimed at alerting users to potential errors and guiding the …

What you see is what you read? improving text-image alignment evaluation

M Yarom, Y Bitton, S Changpinyo… - Advances in …, 2024 - proceedings.neurips.cc
Automatically determining whether a text and a corresponding image are semantically
aligned is a significant challenge for vision-language models, with applications in generative …

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms

S Balloccu, P Schmidtová, M Lango… - arXiv preprint arXiv …, 2024 - arxiv.org
Natural Language Processing (NLP) research is increasingly focusing on the use of Large
Language Models (LLMs), with some of the most popular ones being either fully or partially …

Reading subtext: Evaluating large language models on short story summarization with writers

M Subbiah, S Zhang, LB Chilton… - Transactions of the …, 2024 - direct.mit.edu
Abstract We evaluate recent Large Language Models (LLMs) on the challenging task of
summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled …

mface: Multilingual summarization with factual consistency evaluation

R Aharoni, S Narayan, J Maynez, J Herzig… - arXiv preprint arXiv …, 2022 - arxiv.org
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-
trained language models and the availability of large-scale datasets. Despite promising …

Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge

S Feng, V Balachandran, Y Bai, Y Tsvetkov - arXiv preprint arXiv …, 2023 - arxiv.org
Evaluating the factual consistency of automatically generated summaries is essential for the
progress and adoption of reliable summarization systems. Despite recent advances, existing …

SUMMEDITS: measuring LLM ability at factual reasoning through the lens of summarization

P Laban, W Kryściński, D Agarwal… - Proceedings of the …, 2023 - aclanthology.org
With the recent appearance of LLMs in practical settings, having methods that can effectively
detect factual inconsistencies is crucial to reduce the propagation of misinformation and …