Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
Trustllm: Trustworthiness in large language models
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …
attention for their excellent natural language processing capabilities. Nonetheless, these …
[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models
Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …
natural language processing capabilities. Nonetheless, these LLMs present many …
Zero-shot faithful factual error correction
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge
bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans' …
bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans' …
Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge
Evaluating the factual consistency of automatically generated summaries is essential for the
progress and adoption of reliable summarization systems. Despite recent advances, existing …
progress and adoption of reliable summarization systems. Despite recent advances, existing …
A meta-evaluation of faithfulness metrics for long-form hospital-course summarization
Long-form clinical summarization of hospital admissions has real-world significance
because of its potential to help both clinicians and patients. The factual consistency of …
because of its potential to help both clinicians and patients. The factual consistency of …
Faithfulness-aware decoding strategies for abstractive summarization
Despite significant progress in understanding and improving faithfulness in abstractive
summarization, the question of how decoding strategies affect faithfulness is less studied …
summarization, the question of how decoding strategies affect faithfulness is less studied …
How Far are We from Robust Long Abstractive Summarization?
Abstractive summarization has made tremendous progress in recent years. In this work, we
perform fine-grained human annotations to evaluate long document abstractive …
perform fine-grained human annotations to evaluate long document abstractive …
Interpretable automatic fine-grained inconsistency detection in text summarization
Existing factual consistency evaluation approaches for text summarization provide binary
predictions and limited insights into the weakness of summarization systems. Therefore, we …
predictions and limited insights into the weakness of summarization systems. Therefore, we …
Evaluate AMR graph similarity via self-supervised learning
In work on AMR (Abstract Meaning Representation), similarity metrics are crucial as they are
used to evaluate AMR systems such as AMR parsers. Current AMR metrics are all based on …
used to evaluate AMR systems such as AMR parsers. Current AMR metrics are all based on …