Training data influence analysis and estimation: A survey
Z Hammoudeh, D Lowd - Machine Learning, 2024 - Springer
Good models require good training data. For overparameterized deep models, the causal
relationship between training data and model predictions is increasingly opaque and poorly …
relationship between training data and model predictions is increasingly opaque and poorly …
Make every example count: On the stability and utility of self-influence for learning from noisy NLP datasets
Increasingly larger datasets have become a standard ingredient to advancing the state-of-
the-art in NLP. However, data quality might have already become the bottleneck to unlock …
the-art in NLP. However, data quality might have already become the bottleneck to unlock …
Influenciæ: A library for tracing the influence back to the data-points
In today's AI-driven world, understanding model behavior is becoming more important than
ever. While libraries abound for doing so via traditional XAI methods, the domain of …
ever. While libraries abound for doing so via traditional XAI methods, the domain of …
Unveiling privacy, memorization, and input curvature links
Deep Neural Nets (DNNs) have become a pervasive tool for solving many emerging
problems. However, they tend to overfit to and memorize the training set. Memorization is of …
problems. However, they tend to overfit to and memorize the training set. Memorization is of …
Training Data Attribution via Approximate Unrolled Differentation
Many training data attribution (TDA) methods aim to estimate how a model's behavior would
change if one or more data points were removed from the training set. Methods based on …
change if one or more data points were removed from the training set. Methods based on …
LayerMatch: Do Pseudo-labels Benefit All Layers?
Deep neural networks have achieved remarkable performance across various tasks when
supplied with large-scale labeled data. However, the collection of labeled data can be time …
supplied with large-scale labeled data. However, the collection of labeled data can be time …
SoK: Memorisation in machine learning
Quantifying the impact of individual data samples on machine learning models is an open
research problem. This is particularly relevant when complex and high-dimensional …
research problem. This is particularly relevant when complex and high-dimensional …
Outlier Gradient Analysis: Efficiently Improving Deep Learning Model Performance via Hessian-Free Influence Functions
Influence functions offer a robust framework for assessing the impact of each training data
sample on model predictions, serving as a prominent tool in data-centric learning. Despite …
sample on model predictions, serving as a prominent tool in data-centric learning. Despite …
Causal Estimation of Memorisation Profiles
Understanding memorisation in language models has practical and societal implications,
eg, studying models' training dynamics or preventing copyright infringements. Prior work …
eg, studying models' training dynamics or preventing copyright infringements. Prior work …
Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions
J Liu, Z Yang - arXiv preprint arXiv:2408.10468, 2024 - arxiv.org
The responses generated by Large Language Models (LLMs) can include sensitive
information from individuals and organizations, leading to potential privacy leakage. This …
information from individuals and organizations, leading to potential privacy leakage. This …