Fast and Near Optimal Algorithms for Approximating Distributions by Histograms

Jayadev Acharya, Ilias Diakonikolas, Chinmay Hegde, Jerry Li, Ludwig Schmidt

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and nearoptimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A κ-histogram is a piecewise constant function with κ pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld [ILR12] in PODS 2012: Given samples from a distribution p over (1, ... , n), compute a κ-histogram that minimizes the l2-distance from p, up to an additive ". We design an algorithm for this problem that uses the information{ theoretically minimal sample size of m = O(1=ε2), runs in sample-linear time O(m), and outputs an O(k){ histogram whose l2-distance from p is at most O(optκ)+ε, where optκ is the minimum l2-distance between p and any κ-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size n. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude improvement in terms of empirical running times over previous state-of-the-art algorithms.

    Original languageEnglish (US)
    Title of host publicationPODS 2015 - Proceedings of the 34th ACM Symposium on Principles of Database Systems
    PublisherAssociation for Computing Machinery
    Pages249-263
    Number of pages15
    ISBN (Electronic)9781450327572
    DOIs
    StatePublished - May 20 2015
    Event34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2015 - Melbourne, Australia
    Duration: May 21 2015Jun 4 2015

    Publication series

    NameProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
    Volume31

    Conference

    Conference34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2015
    CountryAustralia
    CityMelbourne
    Period5/21/156/4/15

    ASJC Scopus subject areas

    • Software
    • Information Systems
    • Hardware and Architecture

    Fingerprint Dive into the research topics of 'Fast and Near Optimal Algorithms for Approximating Distributions by Histograms'. Together they form a unique fingerprint.

    Cite this