Training data influence analysis and estimation: A survey

Z Hammoudeh, D Lowd - Machine Learning, 2024 - Springer
Good models require good training data. For overparameterized deep models, the causal
relationship between training data and model predictions is increasingly opaque and poorly …

Make every example count: On the stability and utility of self-influence for learning from noisy NLP datasets

I Bejan, A Sokolov, K Filippova - arXiv preprint arXiv:2302.13959, 2023 - arxiv.org
Increasingly larger datasets have become a standard ingredient to advancing the state-of-
the-art in NLP. However, data quality might have already become the bottleneck to unlock …

Influenciæ: A library for tracing the influence back to the data-points

A Picard, L Hervier, T Fel, D Vigouroux - World Conference on Explainable …, 2024 - Springer
In today's AI-driven world, understanding model behavior is becoming more important than
ever. While libraries abound for doing so via traditional XAI methods, the domain of …

Unveiling privacy, memorization, and input curvature links

D Ravikumar, E Soufleri, A Hashemi, K Roy - arXiv preprint arXiv …, 2024 - arxiv.org
Deep Neural Nets (DNNs) have become a pervasive tool for solving many emerging
problems. However, they tend to overfit to and memorize the training set. Memorization is of …

Training Data Attribution via Approximate Unrolled Differentation

J Bae, W Lin, J Lorraine, R Grosse - arXiv preprint arXiv:2405.12186, 2024 - arxiv.org
Many training data attribution (TDA) methods aim to estimate how a model's behavior would
change if one or more data points were removed from the training set. Methods based on …

LayerMatch: Do Pseudo-labels Benefit All Layers?

C Liang, G Yang, L Qiao, Z Huang, H Yan… - arXiv preprint arXiv …, 2024 - arxiv.org
Deep neural networks have achieved remarkable performance across various tasks when
supplied with large-scale labeled data. However, the collection of labeled data can be time …

SoK: Memorisation in machine learning

D Usynin, M Knolle, G Kaissis - arXiv preprint arXiv:2311.03075, 2023 - arxiv.org
Quantifying the impact of individual data samples on machine learning models is an open
research problem. This is particularly relevant when complex and high-dimensional …

Outlier Gradient Analysis: Efficiently Improving Deep Learning Model Performance via Hessian-Free Influence Functions

A Chhabra, B Li, J Chen, P Mohapatra, H Liu - arXiv preprint arXiv …, 2024 - arxiv.org
Influence functions offer a robust framework for assessing the impact of each training data
sample on model predictions, serving as a prominent tool in data-centric learning. Despite …

Causal Estimation of Memorisation Profiles

P Lesci, C Meister, T Hofmann, A Vlachos… - arXiv preprint arXiv …, 2024 - arxiv.org
Understanding memorisation in language models has practical and societal implications,
eg, studying models' training dynamics or preventing copyright infringements. Prior work …

Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

J Liu, Z Yang - arXiv preprint arXiv:2408.10468, 2024 - arxiv.org
The responses generated by Large Language Models (LLMs) can include sensitive
information from individuals and organizations, leading to potential privacy leakage. This …