TY - GEN
T1 - DataSynthesizer
T2 - 29th International Conference on Scientific and Statistical Database Management, SSDBM 2017
AU - Ping, Haoyue
AU - Stoyanovich, Julia
AU - Howe, Bill
PY - 2017/6/27
Y1 - 2017/6/27
N2 - To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability-the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules-DataDe-scriber, DataGenerator and Modellnspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. Modellnspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance.
AB - To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability-the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules-DataDe-scriber, DataGenerator and Modellnspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. Modellnspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance.
KW - Data sharing
KW - Differential privacy
KW - Synthetic data
UR - http://www.scopus.com/inward/record.url?scp=85025678631&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85025678631&partnerID=8YFLogxK
U2 - 10.1145/3085504.3091117
DO - 10.1145/3085504.3091117
M3 - Conference contribution
AN - SCOPUS:85025678631
T3 - ACM International Conference Proceeding Series
BT - SSDBM 2017
PB - Association for Computing Machinery
Y2 - 27 June 2017 through 29 June 2017
ER -