Data stream clustering: A survey

JA Silva, ER Faria, RC Barros, ER Hruschka… - ACM Computing …, 2013 - dl.acm.org
Data stream mining is an active research area that has recently emerged to discover
knowledge from large amounts of continuously generated data. In this context, several data …

Datacomp: In search of the next generation of multimodal datasets

SY Gadre, G Ilharco, A Fang… - Advances in …, 2024 - proceedings.neurips.cc
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable
Diffusion and GPT-4, yet their design does not receive the same research attention as model …

Cafe: Learning to condense dataset by aligning features

K Wang, B Zhao, X Peng, Z Zhu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Dataset condensation aims at reducing the network training effort through condensing a
cumbersome training set into a compact synthetic one. State-of-the-art approaches largely …

Dataset condensation with differentiable siamese augmentation

B Zhao, H Bilen - International Conference on Machine …, 2021 - proceedings.mlr.press
In many machine learning problems, large-scale datasets have become the de-facto
standard to train state-of-the-art deep networks at the price of heavy computation load. In this …

Improved distribution matching for dataset condensation

G Zhao, G Li, Y Qin, Y Yu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Dataset Condensation aims to condense a large dataset into a smaller one while
maintaining its ability to train a well-performing model, thus reducing the storage cost and …

Dataset condensation with gradient matching

B Zhao, KR Mopuri, H Bilen - arXiv preprint arXiv:2006.05929, 2020 - arxiv.org
As the state-of-the-art machine learning methods in many fields rely on larger datasets,
storing datasets and training models on them become significantly more expensive. This …

Coresets for data-efficient training of machine learning models

B Mirzasoleiman, J Bilmes… - … Conference on Machine …, 2020 - proceedings.mlr.press
Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are
commonly used for large scale optimization in machine learning. Despite the sustained effort …

Dataset pruning: Reducing training data by examining generalization influence

S Yang, Z Xie, H Peng, M Xu, M Sun, P Li - arXiv preprint arXiv …, 2022 - arxiv.org
The great success of deep learning heavily relies on increasingly larger training data, which
comes at a price of huge computational and infrastructural costs. This poses crucial …

Turning Big Data Into Tiny Data: Constant-Size Coresets for -Means, PCA, and Projective Clustering

D Feldman, M Schmidt, C Sohler - SIAM Journal on Computing, 2020 - SIAM
We develop and analyze a method to reduce the size of a very large set of data points in a
high-dimensional Euclidean space R^d to a small set of weighted points such that the result …

T-mars: Improving visual representations by circumventing text feature learning

P Maini, S Goyal, ZC Lipton, JZ Kolter… - arXiv preprint arXiv …, 2023 - arxiv.org
Large web-sourced multimodal datasets have powered a slew of new methods for learning
general-purpose visual representations, advancing the state of the art in computer vision …