Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation
Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …
trivial because (1) generations often contain a mixture of supported and unsupported pieces …
News summarization and evaluation in the era of gpt-3
The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …
paradigm shift in NLP research. In this paper, we study its impact on text summarization …
TRUE: Re-evaluating factual consistency evaluation
Grounded text generation systems often generate text that contains factual inconsistencies,
hindering their real-world applicability. Automatic factual consistency evaluation may help …
hindering their real-world applicability. Automatic factual consistency evaluation may help …
Efficient methods for natural language processing: A survey
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …
scaling model parameters and training data; however, using only scale to improve …
QAFactEval: Improved QA-based factual consistency evaluation for summarization
Factual consistency is an essential quality of text summarization models in practical settings.
Existing work in evaluating this dimension can be broadly categorized into two lines of …
Existing work in evaluating this dimension can be broadly categorized into two lines of …
Improving faithfulness in abstractive summarization with contrast candidate generation and selection
Despite significant progress in neural abstractive summarization, recent studies have shown
that the current models are prone to generating summaries that are unfaithful to the original …
that the current models are prone to generating summaries that are unfaithful to the original …
mface: Multilingual summarization with factual consistency evaluation
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-
trained language models and the availability of large-scale datasets. Despite promising …
trained language models and the availability of large-scale datasets. Despite promising …
Zero-shot faithful factual error correction
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge
bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans' …
bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans' …
Menli: Robust evaluation metrics from natural language inference
Recently proposed BERT-based evaluation metrics for text generation perform well on
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …
standard benchmarks but are vulnerable to adversarial attacks, eg, relating to information …