DataSynthesizer: Privacy-preserving synthetic datasets

Haoyue Ping, Julia Stoyanovich, Bill Howe

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability-the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules-DataDe-scriber, DataGenerator and Modellnspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. Modellnspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance.

    Original languageEnglish (US)
    Title of host publicationSSDBM 2017
    Subtitle of host publication29th International Conference on Scientific and Statistical Database Management
    PublisherAssociation for Computing Machinery
    ISBN (Electronic)9781450352826
    DOIs
    StatePublished - Jun 27 2017
    Event29th International Conference on Scientific and Statistical Database Management, SSDBM 2017 - Chicago, United States
    Duration: Jun 27 2017Jun 29 2017

    Publication series

    NameACM International Conference Proceeding Series
    VolumePart F128636

    Other

    Other29th International Conference on Scientific and Statistical Database Management, SSDBM 2017
    Country/TerritoryUnited States
    CityChicago
    Period6/27/176/29/17

    Keywords

    • Data sharing
    • Differential privacy
    • Synthetic data

    ASJC Scopus subject areas

    • Software
    • Human-Computer Interaction
    • Computer Vision and Pattern Recognition
    • Computer Networks and Communications

    Fingerprint

    Dive into the research topics of 'DataSynthesizer: Privacy-preserving synthetic datasets'. Together they form a unique fingerprint.

    Cite this