Automatic construction of evaluation suites for natural language generation datasets

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org

Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

被引用次数：153 相关文章所有 6 个版本

[PDF] arxiv.org

Language generation models can cause harm: So what can we do about it? an actionable survey

S Kumar, V Balachandran, L Njoo… - arXiv preprint arXiv …, 2022 - arxiv.org

Recent advances in the capacity of large language models to generate human-like text have
resulted in their increased adoption in user-facing settings. In parallel, these improvements …

被引用次数：79 相关文章所有 5 个版本

[PDF] arxiv.org

Nl-augmenter: A framework for task-sensitive natural language augmentation

KD Dhole, V Gangal, S Gehrmann, A Gupta, Z Li… - arXiv preprint arXiv …, 2021 - arxiv.org

Data augmentation is an important component in the robustness evaluation of models in
natural language processing (NLP) and in enhancing the diversity of the data they are …

被引用次数：76 相关文章所有 5 个版本

[PDF] arxiv.org

Bring your own data! self-supervised evaluation for large language models

N Jain, K Saifullah, Y Wen, J Kirchenbauer… - arXiv preprint arXiv …, 2023 - arxiv.org

With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse
domains, measuring language model behavior on realistic data is imperative. For example …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

Quantifying social biases using templates is unreliable

P Seshadri, P Pezeshkpour, S Singh - arXiv preprint arXiv:2210.04337, 2022 - arxiv.org

Recently, there has been an increase in efforts to understand how large language models
(LLMs) propagate and amplify social biases. Several works have utilized templates for …

被引用次数：31 相关文章所有 4 个版本

[PDF] arxiv.org

Why only micro-f1? class weighting of measures for relation classification

D Harbecke, Y Chen, L Hennig, C Alt - arXiv preprint arXiv:2205.09460, 2022 - arxiv.org

Relation classification models are conventionally evaluated using only a single measure,
eg, micro-F1, macro-F1 or AUC. In this work, we analyze weighting schemes, such as micro …

被引用次数：28 相关文章所有 5 个版本

[PDF] arxiv.org

Recode: Robustness evaluation of code generation models

S Wang, Z Li, H Qian, C Yang, Z Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

Code generation models have achieved impressive performance. However, they tend to be
brittle as slight edits to a prompt could lead to very different generations; these robustness …

被引用次数：32 相关文章所有 9 个版本

[PDF] arxiv.org

Gemv2: Multilingual nlg benchmarking in a single line of code

S Gehrmann, A Bhattacharjee, A Mahendiran… - arXiv preprint arXiv …, 2022 - arxiv.org

Evaluation in machine learning is usually informed by past choices, for example which
datasets or metrics to use. This standardization enables the comparison on equal footing …

被引用次数：16 相关文章所有 14 个版本

[PDF] mit.edu

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition

M Geva, T Wolfson, J Berant - Transactions of the Association for …, 2022 - direct.mit.edu

Recent efforts to create challenge benchmarks that test the abilities of natural language
understanding models have largely depended on human annotations. In this work, we …

被引用次数：17 相关文章所有 8 个版本

[PDF] arxiv.org

Measuring the measuring tools: An automatic evaluation of semantic metrics for text corpora

G Kour, S Ackerman, O Raz, E Farchi, B Carmeli… - arXiv preprint arXiv …, 2022 - arxiv.org

The ability to compare the semantic similarity between text corpora is important in a variety
of natural language processing applications. However, standard methods for evaluating …

被引用次数：10 相关文章所有 6 个版本