Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

S Min, K Krishna, X Lyu, M Lewis, W Yih… - arXiv preprint arXiv …, 2023 - arxiv.org
Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …

News summarization and evaluation in the era of gpt-3

T Goyal, JJ Li, G Durrett - arXiv preprint arXiv:2209.12356, 2022 - arxiv.org
The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …

TRUE: Re-evaluating factual consistency evaluation

O Honovich, R Aharoni, J Herzig, H Taitelbaum… - arXiv preprint arXiv …, 2022 - arxiv.org
Grounded text generation systems often generate text that contains factual inconsistencies,
hindering their real-world applicability. Automatic factual consistency evaluation may help …

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

QAFactEval: Improved QA-based factual consistency evaluation for summarization

AR Fabbri, CS Wu, W Liu, C Xiong - arXiv preprint arXiv:2112.08542, 2021 - arxiv.org
Factual consistency is an essential quality of text summarization models in practical settings.
Existing work in evaluating this dimension can be broadly categorized into two lines of …

Improving faithfulness in abstractive summarization with contrast candidate generation and selection

S Chen, F Zhang, K Sone, D Roth - arXiv preprint arXiv:2104.09061, 2021 - arxiv.org
Despite significant progress in neural abstractive summarization, recent studies have shown
that the current models are prone to generating summaries that are unfaithful to the original …

mface: Multilingual summarization with factual consistency evaluation

R Aharoni, S Narayan, J Maynez, J Herzig… - arXiv preprint arXiv …, 2022 - arxiv.org
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-
trained language models and the availability of large-scale datasets. Despite promising …

Zero-shot faithful factual error correction

KH Huang, HP Chan, H Ji - arXiv preprint arXiv:2305.07982, 2023 - arxiv.org
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge
bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans' …

Menli: Robust evaluation metrics from natural language inference

Y Chen, S Eger - Transactions of the Association for Computational …, 2023 - direct.mit.edu
Recently proposed BERT-based evaluation metrics for text generation perform well on
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …