Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

Trustllm: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu, Q Zhang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …

[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu… - International …, 2024 - proceedings.mlr.press
Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …

Zero-shot faithful factual error correction

KH Huang, HP Chan, H Ji - arXiv preprint arXiv:2305.07982, 2023 - arxiv.org
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge
bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans' …

Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge

S Feng, V Balachandran, Y Bai, Y Tsvetkov - arXiv preprint arXiv …, 2023 - arxiv.org
Evaluating the factual consistency of automatically generated summaries is essential for the
progress and adoption of reliable summarization systems. Despite recent advances, existing …

A meta-evaluation of faithfulness metrics for long-form hospital-course summarization

G Adams, J Zuckerg, N Elhadad - Machine Learning for …, 2023 - proceedings.mlr.press
Long-form clinical summarization of hospital admissions has real-world significance
because of its potential to help both clinicians and patients. The factual consistency of …

Faithfulness-aware decoding strategies for abstractive summarization

D Wan, M Liu, K McKeown, M Dreyer… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite significant progress in understanding and improving faithfulness in abstractive
summarization, the question of how decoding strategies affect faithfulness is less studied …

How Far are We from Robust Long Abstractive Summarization?

HY Koh, J Ju, H Zhang, M Liu, S Pan - arXiv preprint arXiv:2210.16732, 2022 - arxiv.org
Abstractive summarization has made tremendous progress in recent years. In this work, we
perform fine-grained human annotations to evaluate long document abstractive …

Interpretable automatic fine-grained inconsistency detection in text summarization

HP Chan, Q Zeng, H Ji - arXiv preprint arXiv:2305.14548, 2023 - arxiv.org
Existing factual consistency evaluation approaches for text summarization provide binary
predictions and limited insights into the weakness of summarization systems. Therefore, we …

Evaluate AMR graph similarity via self-supervised learning

Z Shou, F Lin - Proceedings of the 61st Annual Meeting of the …, 2023 - aclanthology.org
In work on AMR (Abstract Meaning Representation), similarity metrics are crucial as they are
used to evaluate AMR systems such as AMR parsers. Current AMR metrics are all based on …