Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
Language generation models can cause harm: So what can we do about it? an actionable survey
Recent advances in the capacity of large language models to generate human-like text have
resulted in their increased adoption in user-facing settings. In parallel, these improvements …
resulted in their increased adoption in user-facing settings. In parallel, these improvements …
Nl-augmenter: A framework for task-sensitive natural language augmentation
Data augmentation is an important component in the robustness evaluation of models in
natural language processing (NLP) and in enhancing the diversity of the data they are …
natural language processing (NLP) and in enhancing the diversity of the data they are …
Bring your own data! self-supervised evaluation for large language models
With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse
domains, measuring language model behavior on realistic data is imperative. For example …
domains, measuring language model behavior on realistic data is imperative. For example …
Quantifying social biases using templates is unreliable
Recently, there has been an increase in efforts to understand how large language models
(LLMs) propagate and amplify social biases. Several works have utilized templates for …
(LLMs) propagate and amplify social biases. Several works have utilized templates for …
Why only micro-f1? class weighting of measures for relation classification
Relation classification models are conventionally evaluated using only a single measure,
eg, micro-F1, macro-F1 or AUC. In this work, we analyze weighting schemes, such as micro …
eg, micro-F1, macro-F1 or AUC. In this work, we analyze weighting schemes, such as micro …
Recode: Robustness evaluation of code generation models
Code generation models have achieved impressive performance. However, they tend to be
brittle as slight edits to a prompt could lead to very different generations; these robustness …
brittle as slight edits to a prompt could lead to very different generations; these robustness …
Gemv2: Multilingual nlg benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which
datasets or metrics to use. This standardization enables the comparison on equal footing …
datasets or metrics to use. This standardization enables the comparison on equal footing …
Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition
Recent efforts to create challenge benchmarks that test the abilities of natural language
understanding models have largely depended on human annotations. In this work, we …
understanding models have largely depended on human annotations. In this work, we …
Measuring the measuring tools: An automatic evaluation of semantic metrics for text corpora
The ability to compare the semantic similarity between text corpora is important in a variety
of natural language processing applications. However, standard methods for evaluating …
of natural language processing applications. However, standard methods for evaluating …