TY - GEN

T1 - Fast and Near Optimal Algorithms for Approximating Distributions by Histograms

AU - Acharya, Jayadev

AU - Diakonikolas, Ilias

AU - Hegde, Chinmay

AU - Li, Jerry

AU - Schmidt, Ludwig

N1 - Funding Information:
The authors would like to thank Piotr Indyk for helpful discussions. I.D. was supported by aMarie Curie CIG, EPSRC grant EP/L021749/1, and a SICSA grant. J.A., C.H., and L.S. were supported by a grant from the MIT-Shell Energy Initiative. J.L. was supported by NSF grant CCF-1217921 and DOE grant DE-SC0008923.

PY - 2015/5/20

Y1 - 2015/5/20

N2 - Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and nearoptimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A κ-histogram is a piecewise constant function with κ pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld [ILR12] in PODS 2012: Given samples from a distribution p over (1, ... , n), compute a κ-histogram that minimizes the l2-distance from p, up to an additive ". We design an algorithm for this problem that uses the information{ theoretically minimal sample size of m = O(1=ε2), runs in sample-linear time O(m), and outputs an O(k){ histogram whose l2-distance from p is at most O(optκ)+ε, where optκ is the minimum l2-distance between p and any κ-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size n. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude improvement in terms of empirical running times over previous state-of-the-art algorithms.

AB - Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and nearoptimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A κ-histogram is a piecewise constant function with κ pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld [ILR12] in PODS 2012: Given samples from a distribution p over (1, ... , n), compute a κ-histogram that minimizes the l2-distance from p, up to an additive ". We design an algorithm for this problem that uses the information{ theoretically minimal sample size of m = O(1=ε2), runs in sample-linear time O(m), and outputs an O(k){ histogram whose l2-distance from p is at most O(optκ)+ε, where optκ is the minimum l2-distance between p and any κ-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size n. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude improvement in terms of empirical running times over previous state-of-the-art algorithms.

UR - http://www.scopus.com/inward/record.url?scp=84955260637&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84955260637&partnerID=8YFLogxK

U2 - 10.1145/2745754.2745772

DO - 10.1145/2745754.2745772

M3 - Conference contribution

AN - SCOPUS:84955260637

T3 - Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

SP - 249

EP - 263

BT - PODS 2015 - Proceedings of the 34th ACM Symposium on Principles of Database Systems

PB - Association for Computing Machinery

T2 - 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2015

Y2 - 21 May 2015 through 4 June 2015

ER -