Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments

Jinkun Lin, Anqi Zhang, Mathias Lécuyer, Jinyang Li, Aurojit Panda, Siddhartha Sen

Research output: Contribution to journalConference articlepeer-review

Abstract

We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When subsets are sampled from the uniform distribution, the AME reduces to the well-known Shapley value. Our approach is inspired by causal inference and randomized experiments: we sample different subsets of the training data to train multiple submodels, and evaluate each submodel's behavior. We then use a LASSO regression to jointly estimate the AME of each data point, based on the subset compositions. Under sparsity assumptions (k ≪ N datapoints have large AME), our estimator requires only O(k log N) randomized submodel trainings, improving upon the best prior Shapley value estimators.

Original languageEnglish (US)
Pages (from-to)13468-13504
Number of pages37
JournalProceedings of Machine Learning Research
Volume162
StatePublished - 2022
Event39th International Conference on Machine Learning, ICML 2022 - Baltimore, United States
Duration: Jul 17 2022Jul 23 2022

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments'. Together they form a unique fingerprint.

Cite this