Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org
Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

Benchmarking foundation models with language-model-as-an-examiner

Y Bai, J Ying, Y Cao, X Lv, Y He… - Advances in …, 2024 - proceedings.neurips.cc
Numerous benchmarks have been established to assess the performance of foundation
models on open-ended question answering, which serves as a comprehensive test of a …

Evaluating open-domain question answering in the era of large language models

E Kamalloo, N Dziri, CLA Clarke, D Rafiei - arXiv preprint arXiv …, 2023 - arxiv.org
Lexical matching remains the de facto evaluation method for open-domain question
answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate …

QAFactEval: Improved QA-based factual consistency evaluation for summarization

AR Fabbri, CS Wu, W Liu, C Xiong - arXiv preprint arXiv:2112.08542, 2021 - arxiv.org
Factual consistency is an essential quality of text summarization models in practical settings.
Existing work in evaluating this dimension can be broadly categorized into two lines of …

Crossfit: A few-shot learning challenge for cross-task generalization in nlp

Q Ye, BY Lin, X Ren - arXiv preprint arXiv:2104.08835, 2021 - arxiv.org
Humans can learn a new language task efficiently with only few examples, by leveraging
their knowledge obtained when learning prior tasks. In this paper, we explore whether and …

A critical evaluation of evaluations for long-form question answering

F Xu, Y Song, M Iyyer, E Choi - arXiv preprint arXiv:2305.18201, 2023 - arxiv.org
Long-form question answering (LFQA) enables answering a wide range of questions, but its
flexibility poses enormous challenges for evaluation. We perform the first targeted study of …

Faithfulness in natural language generation: A systematic survey of analysis, evaluation and optimization methods

W Li, W Wu, M Chen, J Liu, X Xiao, H Wu - arXiv preprint arXiv:2203.05227, 2022 - arxiv.org
Natural Language Generation (NLG) has made great progress in recent years due to the
development of deep learning techniques such as pre-trained language models. This …

Neural natural language generation: A survey on multilinguality, multimodality, controllability and learning

E Erdem, M Kuyu, S Yagcioglu, A Frank… - Journal of Artificial …, 2022 - jair.org
Developing artificial learning systems that can understand and generate natural language
has been one of the long-standing goals of artificial intelligence. Recent decades have …

Towards question-answering as an automatic metric for evaluating the content quality of a summary

D Deutsch, T Bedrax-Weiss, D Roth - Transactions of the Association …, 2021 - direct.mit.edu
A desirable property of a reference-based evaluation metric that measures the content
quality of a summary is that it should estimate how much information that summary has in …

Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation

J Bulian, C Buck, W Gajewski, B Boerschinger… - arXiv preprint arXiv …, 2022 - arxiv.org
The predictions of question answering (QA) systems are typically evaluated against
manually annotated finite sets of one or more answers. This leads to a coverage limitation …