The'Problem'of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

B Plank - arXiv preprint arXiv:2211.02570, 2022 - arxiv.org
Human variation in labeling is often considered noise. Annotation projects for machine
learning (ML) aim at minimizing human label variation, with the assumption to maximize …

TRUE: Re-evaluating factual consistency evaluation

O Honovich, R Aharoni, J Herzig, H Taitelbaum… - arXiv preprint arXiv …, 2022 - arxiv.org
Grounded text generation systems often generate text that contains factual inconsistencies,
hindering their real-world applicability. Automatic factual consistency evaluation may help …

Supporting human-ai collaboration in auditing llms with llms

C Rastogi, M Tulio Ribeiro, N King, H Nori… - Proceedings of the 2023 …, 2023 - dl.acm.org
Large language models (LLMs) are increasingly becoming all-powerful and pervasive via
deployment in sociotechnical systems. Yet these language models, be it for classification or …

Facet: Fairness in computer vision evaluation benchmark

L Gustafson, C Rolland, N Ravi… - Proceedings of the …, 2023 - openaccess.thecvf.com
Computer vision models have known performance disparities across attributes such as
gender and skin tone. This means during tasks such as classification and detection, model …

AI's regimes of representation: A community-centered study of text-to-image models in South Asia

R Qadri, R Shelby, CL Bennett, E Denton - Proceedings of the 2023 …, 2023 - dl.acm.org
This paper presents a community-centered study of cultural limitations of text-to-image (T2I)
models in the South Asian context. We theorize these failures using scholarship on …

Designing responsible ai: Adaptations of ux practice to meet responsible ai challenges

Q Wang, M Madaio, S Kane, S Kapania… - Proceedings of the …, 2023 - dl.acm.org
Technology companies continue to invest in efforts to incorporate responsibility in their
Artificial Intelligence (AI) advancements, while efforts to audit and regulate AI systems …

Dices dataset: Diversity in conversational ai evaluation for safety

L Aroyo, A Taylor, M Diaz, C Homan… - Advances in …, 2023 - proceedings.neurips.cc
Abstract Machine learning approaches often require training and evaluation datasets with a
clear separation between positive and negative examples. This requirement overly …

Labelling instructions matter in biomedical image analysis

T Rädsch, A Reinke, V Weru, MD Tizabi… - Nature Machine …, 2023 - nature.com
Biomedical image analysis algorithm validation depends on high-quality annotation of
reference datasets, for which labelling instructions are key. Despite their importance, their …

Ground (less) truth: A causal framework for proxy labels in human-algorithm decision-making

L Guerdan, A Coston, ZS Wu, K Holstein - Proceedings of the 2023 ACM …, 2023 - dl.acm.org
A growing literature on human-AI decision-making investigates strategies for combining
human judgment with statistical models to improve decision-making. Research in this area …

Skin deep: Investigating subjectivity in skin tone annotations for computer vision benchmark datasets

T Barrett, Q Chen, A Zhang - Proceedings of the 2023 ACM Conference …, 2023 - dl.acm.org
To investigate the well-observed racial disparities in computer vision systems that analyze
images of humans, researchers have turned to skin tone as a more objective annotation …