Self-supervised learning in medicine and healthcare
The development of medical applications of machine learning has required manual
annotation of data, often by medical experts. Yet, the availability of large-scale unannotated …
annotation of data, often by medical experts. Yet, the availability of large-scale unannotated …
Advances, challenges and opportunities in creating data for trustworthy AI
As artificial intelligence (AI) transitions from research to deployment, creating the appropriate
datasets and data pipelines to develop and evaluate AI models is increasingly the biggest …
datasets and data pipelines to develop and evaluate AI models is increasingly the biggest …
Bloom: A 176b-parameter open-access multilingual language model
Large language models (LLMs) have been shown to be able to perform new tasks based on
a few demonstrations or natural language instructions. While these capabilities have led to …
a few demonstrations or natural language instructions. While these capabilities have led to …
On the opportunities and risks of foundation models
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
Power to the people? Opportunities and challenges for participatory AI
Participatory approaches to artificial intelligence (AI) and machine learning (ML) are gaining
momentum: the increased attention comes partly with the view that participation opens the …
momentum: the increased attention comes partly with the view that participation opens the …
Pervasive label errors in test sets destabilize machine learning benchmarks
We identify label errors in the test sets of 10 of the most commonly-used computer vision,
natural language, and audio datasets, and subsequently study the potential for these label …
natural language, and audio datasets, and subsequently study the potential for these label …
Madlad-400: A multilingual and document-level large audited dataset
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …
Into the laion's den: Investigating hate in multimodal datasets
AbstractScale the model, scale the data, scale the compute'is the reigning sentiment in the
world of generative AI today. While the impact of model scaling has been extensively …
world of generative AI today. While the impact of model scaling has been extensively …
Dataperf: Benchmarks for data-centric ai development
Abstract Machine learning research has long focused on models rather than datasets, and
prominent datasets are used for common ML tasks without regard to the breadth, difficulty …
prominent datasets are used for common ML tasks without regard to the breadth, difficulty …
Do datasets have politics? Disciplinary values in computer vision dataset development
Data is a crucial component of machine learning. The field is reliant on data to train, validate,
and test models. With increased technical capabilities, machine learning research has …
and test models. With increased technical capabilities, machine learning research has …