The'Problem'of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation
B Plank - arXiv preprint arXiv:2211.02570, 2022 - arxiv.org
Human variation in labeling is often considered noise. Annotation projects for machine
learning (ML) aim at minimizing human label variation, with the assumption to maximize …
learning (ML) aim at minimizing human label variation, with the assumption to maximize …
TRUE: Re-evaluating factual consistency evaluation
Grounded text generation systems often generate text that contains factual inconsistencies,
hindering their real-world applicability. Automatic factual consistency evaluation may help …
hindering their real-world applicability. Automatic factual consistency evaluation may help …
Supporting human-ai collaboration in auditing llms with llms
Large language models (LLMs) are increasingly becoming all-powerful and pervasive via
deployment in sociotechnical systems. Yet these language models, be it for classification or …
deployment in sociotechnical systems. Yet these language models, be it for classification or …
Facet: Fairness in computer vision evaluation benchmark
Computer vision models have known performance disparities across attributes such as
gender and skin tone. This means during tasks such as classification and detection, model …
gender and skin tone. This means during tasks such as classification and detection, model …
AI's regimes of representation: A community-centered study of text-to-image models in South Asia
This paper presents a community-centered study of cultural limitations of text-to-image (T2I)
models in the South Asian context. We theorize these failures using scholarship on …
models in the South Asian context. We theorize these failures using scholarship on …
Designing responsible ai: Adaptations of ux practice to meet responsible ai challenges
Technology companies continue to invest in efforts to incorporate responsibility in their
Artificial Intelligence (AI) advancements, while efforts to audit and regulate AI systems …
Artificial Intelligence (AI) advancements, while efforts to audit and regulate AI systems …
Dices dataset: Diversity in conversational ai evaluation for safety
Abstract Machine learning approaches often require training and evaluation datasets with a
clear separation between positive and negative examples. This requirement overly …
clear separation between positive and negative examples. This requirement overly …
Labelling instructions matter in biomedical image analysis
Biomedical image analysis algorithm validation depends on high-quality annotation of
reference datasets, for which labelling instructions are key. Despite their importance, their …
reference datasets, for which labelling instructions are key. Despite their importance, their …
Ground (less) truth: A causal framework for proxy labels in human-algorithm decision-making
A growing literature on human-AI decision-making investigates strategies for combining
human judgment with statistical models to improve decision-making. Research in this area …
human judgment with statistical models to improve decision-making. Research in this area …
Skin deep: Investigating subjectivity in skin tone annotations for computer vision benchmark datasets
To investigate the well-observed racial disparities in computer vision systems that analyze
images of humans, researchers have turned to skin tone as a more objective annotation …
images of humans, researchers have turned to skin tone as a more objective annotation …