Self-supervised learning in medicine and healthcare

R Krishnan, P Rajpurkar, EJ Topol - Nature Biomedical Engineering, 2022 - nature.com
The development of medical applications of machine learning has required manual
annotation of data, often by medical experts. Yet, the availability of large-scale unannotated …

Advances, challenges and opportunities in creating data for trustworthy AI

W Liang, GA Tadesse, D Ho, L Fei-Fei… - Nature Machine …, 2022 - nature.com
As artificial intelligence (AI) transitions from research to deployment, creating the appropriate
datasets and data pipelines to develop and evaluate AI models is increasingly the biggest …

Bloom: A 176b-parameter open-access multilingual language model

T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow… - 2023 - inria.hal.science
Large language models (LLMs) have been shown to be able to perform new tasks based on
a few demonstrations or natural language instructions. While these capabilities have led to …

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

Power to the people? Opportunities and challenges for participatory AI

A Birhane, W Isaac, V Prabhakaran, M Diaz… - Proceedings of the 2nd …, 2022 - dl.acm.org
Participatory approaches to artificial intelligence (AI) and machine learning (ML) are gaining
momentum: the increased attention comes partly with the view that participation opens the …

Pervasive label errors in test sets destabilize machine learning benchmarks

CG Northcutt, A Athalye, J Mueller - arXiv preprint arXiv:2103.14749, 2021 - arxiv.org
We identify label errors in the test sets of 10 of the most commonly-used computer vision,
natural language, and audio datasets, and subsequently study the potential for these label …

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

Into the laion's den: Investigating hate in multimodal datasets

A Birhane, S Han, V Boddeti… - Advances in Neural …, 2024 - proceedings.neurips.cc
AbstractScale the model, scale the data, scale the compute'is the reigning sentiment in the
world of generative AI today. While the impact of model scaling has been extensively …

Dataperf: Benchmarks for data-centric ai development

M Mazumder, C Banbury, X Yao… - Advances in …, 2024 - proceedings.neurips.cc
Abstract Machine learning research has long focused on models rather than datasets, and
prominent datasets are used for common ML tasks without regard to the breadth, difficulty …

Do datasets have politics? Disciplinary values in computer vision dataset development

MK Scheuerman, A Hanna, E Denton - … of the ACM on Human-Computer …, 2021 - dl.acm.org
Data is a crucial component of machine learning. The field is reliant on data to train, validate,
and test models. With increased technical capabilities, machine learning research has …