QAFactEval: Improved QA-based factual consistency evaluation for summarization

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org

Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

被引用次数：133 相关文章所有 6 个版本

[PDF] arxiv.org

Holistic evaluation of language models

P Liang, R Bommasani, T Lee, D Tsipras… - arXiv preprint arXiv …, 2022 - arxiv.org

Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …

被引用次数：926 相关文章所有 5 个版本

[PDF] mit.edu

Benchmarking large language models for news summarization

T Zhang, F Ladhak, E Durmus, P Liang… - Transactions of the …, 2024 - direct.mit.edu

Large language models (LLMs) have shown promise for automatic summarization but the
reasons behind their successes are poorly understood. By conducting a human evaluation …

被引用次数：313 相关文章所有 6 个版本

[PDF] arxiv.org

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

S Min, K Krishna, X Lyu, M Lewis, W Yih… - arXiv preprint arXiv …, 2023 - arxiv.org

Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …

被引用次数：292 相关文章所有 8 个版本

[PDF] arxiv.org

News summarization and evaluation in the era of gpt-3

T Goyal, JJ Li, G Durrett - arXiv preprint arXiv:2209.12356, 2022 - arxiv.org

The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …

被引用次数：317 相关文章所有 2 个版本

[PDF] arxiv.org

FacTool: Factuality Detection in Generative AI--A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

I Chern, S Chern, S Chen, W Yuan, K Feng… - arXiv preprint arXiv …, 2023 - arxiv.org

The emergence of generative pre-trained models has facilitated the synthesis of high-quality
text, but it has also posed challenges in identifying factual errors in the generated text. In …

被引用次数：98 相关文章所有 3 个版本

[PDF] arxiv.org

A critical evaluation of evaluations for long-form question answering

F Xu, Y Song, M Iyyer, E Choi - arXiv preprint arXiv:2305.18201, 2023 - arxiv.org

Long-form question answering (LFQA) enables answering a wide range of questions, but its
flexibility poses enormous challenges for evaluation. We perform the first targeted study of …

被引用次数：55 相关文章所有 6 个版本

[PDF] arxiv.org

Large language model alignment: A survey

T Shen, R Jin, Y Huang, C Liu, W Dong, Z Guo… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent years have witnessed remarkable progress made in large language models (LLMs).
Such advancements, while garnering significant attention, have concurrently elicited various …

被引用次数：87 相关文章所有 2 个版本

[PDF] arxiv.org

AlignScore: Evaluating factual consistency with a unified alignment function

Y Zha, Y Yang, R Li, Z Hu - arXiv preprint arXiv:2305.16739, 2023 - arxiv.org

Many text generation applications require the generated text to be factually consistent with
input information. Automatic evaluation of factual consistency is challenging. Previous work …

被引用次数：83 相关文章所有 4 个版本

[PDF] arxiv.org

LongEval: Guidelines for human evaluation of faithfulness in long-form summarization

K Krishna, E Bransom, B Kuehl, M Iyyer… - arXiv preprint arXiv …, 2023 - arxiv.org

While human evaluation remains best practice for accurately judging the faithfulness of
automatically-generated summaries, few solutions exist to address the increased difficulty …

被引用次数：51 相关文章所有 8 个版本