MOCHA: A dataset for training and evaluating generative reading comprehension metrics

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org

Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

被引用次数：185 相关文章所有 6 个版本

[PDF] neurips.cc

Benchmarking foundation models with language-model-as-an-examiner

Y Bai, J Ying, Y Cao, X Lv, Y He… - Advances in …, 2024 - proceedings.neurips.cc

Numerous benchmarks have been established to assess the performance of foundation
models on open-ended question answering, which serves as a comprehensive test of a …

被引用次数：70 相关文章所有 6 个版本

[PDF] arxiv.org

Evaluating open-domain question answering in the era of large language models

E Kamalloo, N Dziri, CLA Clarke, D Rafiei - arXiv preprint arXiv …, 2023 - arxiv.org

Lexical matching remains the de facto evaluation method for open-domain question
answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate …

被引用次数：94 相关文章所有 6 个版本

[PDF] arxiv.org

QAFactEval: Improved QA-based factual consistency evaluation for summarization

AR Fabbri, CS Wu, W Liu, C Xiong - arXiv preprint arXiv:2112.08542, 2021 - arxiv.org

Factual consistency is an essential quality of text summarization models in practical settings.
Existing work in evaluating this dimension can be broadly categorized into two lines of …

被引用次数：144 相关文章所有 3 个版本

[PDF] arxiv.org

Crossfit: A few-shot learning challenge for cross-task generalization in nlp

Q Ye, BY Lin, X Ren - arXiv preprint arXiv:2104.08835, 2021 - arxiv.org

Humans can learn a new language task efficiently with only few examples, by leveraging
their knowledge obtained when learning prior tasks. In this paper, we explore whether and …

被引用次数：148 相关文章所有 3 个版本

[PDF] arxiv.org

A critical evaluation of evaluations for long-form question answering

F Xu, Y Song, M Iyyer, E Choi - arXiv preprint arXiv:2305.18201, 2023 - arxiv.org

Long-form question answering (LFQA) enables answering a wide range of questions, but its
flexibility poses enormous challenges for evaluation. We perform the first targeted study of …

被引用次数：53 相关文章所有 6 个版本

[PDF] arxiv.org

Faithfulness in natural language generation: A systematic survey of analysis, evaluation and optimization methods

W Li, W Wu, M Chen, J Liu, X Xiao, H Wu - arXiv preprint arXiv:2203.05227, 2022 - arxiv.org

Natural Language Generation (NLG) has made great progress in recent years due to the
development of deep learning techniques such as pre-trained language models. This …

被引用次数：50 相关文章所有 2 个版本

[PDF] jair.org Full View

Neural natural language generation: A survey on multilinguality, multimodality, controllability and learning

E Erdem, M Kuyu, S Yagcioglu, A Frank… - Journal of Artificial …, 2022 - jair.org

Developing artificial learning systems that can understand and generate natural language
has been one of the long-standing goals of artificial intelligence. Recent decades have …

被引用次数：47 相关文章所有 20 个版本

[PDF] mit.edu

Towards question-answering as an automatic metric for evaluating the content quality of a summary

D Deutsch, T Bedrax-Weiss, D Roth - Transactions of the Association …, 2021 - direct.mit.edu

A desirable property of a reference-based evaluation metric that measures the content
quality of a summary is that it should estimate how much information that summary has in …

被引用次数：91 相关文章所有 10 个版本

[PDF] arxiv.org

Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation

J Bulian, C Buck, W Gajewski, B Boerschinger… - arXiv preprint arXiv …, 2022 - arxiv.org

The predictions of question answering (QA) systems are typically evaluated against
manually annotated finite sets of one or more answers. This leads to a coverage limitation …

被引用次数：48 相关文章所有 6 个版本