TY - GEN
T1 - DataPrism
T2 - 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
AU - Galhotra, Sainyam
AU - Fariha, Anna
AU - Lourenço, Raoni
AU - Freire, Juliana
AU - Meliou, Alexandra
AU - Srivastava, Divesh
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/6/10
Y1 - 2022/6/10
N2 - As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of data. E.g., a health-monitoring system that is designed under the assumption that weight is reported in lbs will malfunction when encountering weight reported in kilograms. Like software debugging, which aims to find bugs in the source code or runtime conditions, our goal is to debug data to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. We propose DataPrism, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system. Such identification is necessary to repair data and resolve the disconnect between data and systems. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataPrism alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataPrism reports causally verified root causes-in terms of data profiles-of the system malfunction. We empirically evaluate DataPrism on seven real-world and several synthetic data-driven systems that fail on certain datasets due to a diverse set of reasons. In all cases, DataPrism identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques.
AB - As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of data. E.g., a health-monitoring system that is designed under the assumption that weight is reported in lbs will malfunction when encountering weight reported in kilograms. Like software debugging, which aims to find bugs in the source code or runtime conditions, our goal is to debug data to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. We propose DataPrism, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system. Such identification is necessary to repair data and resolve the disconnect between data and systems. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataPrism alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataPrism reports causally verified root causes-in terms of data profiles-of the system malfunction. We empirically evaluate DataPrism on seven real-world and several synthetic data-driven systems that fail on certain datasets due to a diverse set of reasons. In all cases, DataPrism identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques.
KW - causal testing
KW - data profiles
KW - debugging
KW - root-cause identification
UR - http://www.scopus.com/inward/record.url?scp=85132742654&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132742654&partnerID=8YFLogxK
U2 - 10.1145/3514221.3517864
DO - 10.1145/3514221.3517864
M3 - Conference contribution
AN - SCOPUS:85132742654
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 217
EP - 231
BT - SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 12 June 2022 through 17 June 2022
ER -