“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI

R Krishnan, P Rajpurkar, EJ Topol - Nature Biomedical Engineering, 2022 - nature.com

The development of medical applications of machine learning has required manual
annotation of data, often by medical experts. Yet, the availability of large-scale unannotated …

被引用次数：266 相关文章所有 4 个版本

Advances, challenges and opportunities in creating data for trustworthy AI

W Liang, GA Tadesse, D Ho, L Fei-Fei… - Nature Machine …, 2022 - nature.com

As artificial intelligence (AI) transitions from research to deployment, creating the appropriate
datasets and data pipelines to develop and evaluate AI models is increasingly the biggest …

被引用次数：229 相关文章所有 3 个版本

[PDF] hal.science

Bloom: A 176b-parameter open-access multilingual language model

T Le Scao, A Fan, C Akiki, E Pavlick, S Ilić, D Hesslow… - 2023 - inria.hal.science

Large language models (LLMs) have been shown to be able to perform new tasks based on
a few demonstrations or natural language instructions. While these capabilities have led to …

被引用次数：1349 相关文章所有 16 个版本

[PDF] arxiv.org

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org

AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

被引用次数：3359 相关文章所有 2 个版本

[PDF] acm.org

Power to the people? Opportunities and challenges for participatory AI

A Birhane, W Isaac, V Prabhakaran, M Diaz… - Proceedings of the 2nd …, 2022 - dl.acm.org

Participatory approaches to artificial intelligence (AI) and machine learning (ML) are gaining
momentum: the increased attention comes partly with the view that participation opens the …

被引用次数：157 相关文章所有 5 个版本

[PDF] arxiv.org

Pervasive label errors in test sets destabilize machine learning benchmarks

CG Northcutt, A Athalye, J Mueller - arXiv preprint arXiv:2103.14749, 2021 - arxiv.org

We identify label errors in the test sets of 10 of the most commonly-used computer vision,
natural language, and audio datasets, and subsequently study the potential for these label …

被引用次数：530 相关文章所有 9 个版本

[PDF] neurips.cc

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

被引用次数：45 相关文章所有 6 个版本

[PDF] neurips.cc

Into the laion's den: Investigating hate in multimodal datasets

A Birhane, S Han, V Boddeti… - Advances in Neural …, 2024 - proceedings.neurips.cc

AbstractScale the model, scale the data, scale the compute'is the reigning sentiment in the
world of generative AI today. While the impact of model scaling has been extensively …

被引用次数：35 相关文章所有 7 个版本

[PDF] neurips.cc

Dataperf: Benchmarks for data-centric ai development

M Mazumder, C Banbury, X Yao… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Machine learning research has long focused on models rather than datasets, and
prominent datasets are used for common ML tasks without regard to the breadth, difficulty …

被引用次数：93 相关文章所有 6 个版本

[PDF] acm.org

Do datasets have politics? Disciplinary values in computer vision dataset development

MK Scheuerman, A Hanna, E Denton - … of the ACM on Human-Computer …, 2021 - dl.acm.org

Data is a crucial component of machine learning. The field is reliant on data to train, validate,
and test models. With increased technical capabilities, machine learning research has …

被引用次数：211 相关文章所有 6 个版本