Evaluating large language models: A comprehensive survey
Large language models (LLMs) have demonstrated remarkable capabilities across a broad
spectrum of tasks. They have attracted significant attention and been deployed in numerous …
spectrum of tasks. They have attracted significant attention and been deployed in numerous …
News summarization and evaluation in the era of gpt-3
The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …
paradigm shift in NLP research. In this paper, we study its impact on text summarization …
Evaluating large language models on medical evidence summarization
Recent advances in large language models (LLMs) have demonstrated remarkable
successes in zero-and few-shot performance on various downstream tasks, paving the way …
successes in zero-and few-shot performance on various downstream tasks, paving the way …
Felm: Benchmarking factuality evaluation of large language models
Assessing factuality of text generated by large language models (LLMs) is an emerging yet
crucial research area, aimed at alerting users to potential errors and guiding the …
crucial research area, aimed at alerting users to potential errors and guiding the …
What you see is what you read? improving text-image alignment evaluation
Automatically determining whether a text and a corresponding image are semantically
aligned is a significant challenge for vision-language models, with applications in generative …
aligned is a significant challenge for vision-language models, with applications in generative …
Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms
Natural Language Processing (NLP) research is increasingly focusing on the use of Large
Language Models (LLMs), with some of the most popular ones being either fully or partially …
Language Models (LLMs), with some of the most popular ones being either fully or partially …
Reading subtext: Evaluating large language models on short story summarization with writers
M Subbiah, S Zhang, LB Chilton… - Transactions of the …, 2024 - direct.mit.edu
Abstract We evaluate recent Large Language Models (LLMs) on the challenging task of
summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled …
summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled …
mface: Multilingual summarization with factual consistency evaluation
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-
trained language models and the availability of large-scale datasets. Despite promising …
trained language models and the availability of large-scale datasets. Despite promising …
Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge
Evaluating the factual consistency of automatically generated summaries is essential for the
progress and adoption of reliable summarization systems. Despite recent advances, existing …
progress and adoption of reliable summarization systems. Despite recent advances, existing …
SUMMEDITS: measuring LLM ability at factual reasoning through the lens of summarization
With the recent appearance of LLMs in practical settings, having methods that can effectively
detect factual inconsistencies is crucial to reduce the propagation of misinformation and …
detect factual inconsistencies is crucial to reduce the propagation of misinformation and …