Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

Language generation models can cause harm: So what can we do about it? an actionable survey

S Kumar, V Balachandran, L Njoo… - arXiv preprint arXiv …, 2022 - arxiv.org
Recent advances in the capacity of large language models to generate human-like text have
resulted in their increased adoption in user-facing settings. In parallel, these improvements …

Nl-augmenter: A framework for task-sensitive natural language augmentation

KD Dhole, V Gangal, S Gehrmann, A Gupta, Z Li… - arXiv preprint arXiv …, 2021 - arxiv.org
Data augmentation is an important component in the robustness evaluation of models in
natural language processing (NLP) and in enhancing the diversity of the data they are …

Bring your own data! self-supervised evaluation for large language models

N Jain, K Saifullah, Y Wen, J Kirchenbauer… - arXiv preprint arXiv …, 2023 - arxiv.org
With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse
domains, measuring language model behavior on realistic data is imperative. For example …

Quantifying social biases using templates is unreliable

P Seshadri, P Pezeshkpour, S Singh - arXiv preprint arXiv:2210.04337, 2022 - arxiv.org
Recently, there has been an increase in efforts to understand how large language models
(LLMs) propagate and amplify social biases. Several works have utilized templates for …

Why only micro-f1? class weighting of measures for relation classification

D Harbecke, Y Chen, L Hennig, C Alt - arXiv preprint arXiv:2205.09460, 2022 - arxiv.org
Relation classification models are conventionally evaluated using only a single measure,
eg, micro-F1, macro-F1 or AUC. In this work, we analyze weighting schemes, such as micro …

Recode: Robustness evaluation of code generation models

S Wang, Z Li, H Qian, C Yang, Z Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Code generation models have achieved impressive performance. However, they tend to be
brittle as slight edits to a prompt could lead to very different generations; these robustness …

Gemv2: Multilingual nlg benchmarking in a single line of code

S Gehrmann, A Bhattacharjee, A Mahendiran… - arXiv preprint arXiv …, 2022 - arxiv.org
Evaluation in machine learning is usually informed by past choices, for example which
datasets or metrics to use. This standardization enables the comparison on equal footing …

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition

M Geva, T Wolfson, J Berant - Transactions of the Association for …, 2022 - direct.mit.edu
Recent efforts to create challenge benchmarks that test the abilities of natural language
understanding models have largely depended on human annotations. In this work, we …

Measuring the measuring tools: An automatic evaluation of semantic metrics for text corpora

G Kour, S Ackerman, O Raz, E Farchi, B Carmeli… - arXiv preprint arXiv …, 2022 - arxiv.org
The ability to compare the semantic similarity between text corpora is important in a variety
of natural language processing applications. However, standard methods for evaluating …