TY - GEN
T1 - Fast and Near Optimal Algorithms for Approximating Distributions by Histograms
AU - Acharya, Jayadev
AU - Diakonikolas, Ilias
AU - Hegde, Chinmay
AU - Li, Jerry
AU - Schmidt, Ludwig
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/5/20
Y1 - 2015/5/20
N2 - Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and nearoptimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A κ-histogram is a piecewise constant function with κ pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld [ILR12] in PODS 2012: Given samples from a distribution p over (1, ... , n), compute a κ-histogram that minimizes the l2-distance from p, up to an additive ". We design an algorithm for this problem that uses the information{ theoretically minimal sample size of m = O(1=ε2), runs in sample-linear time O(m), and outputs an O(k){ histogram whose l2-distance from p is at most O(optκ)+ε, where optκ is the minimum l2-distance between p and any κ-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size n. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude improvement in terms of empirical running times over previous state-of-the-art algorithms.
AB - Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and nearoptimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A κ-histogram is a piecewise constant function with κ pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld [ILR12] in PODS 2012: Given samples from a distribution p over (1, ... , n), compute a κ-histogram that minimizes the l2-distance from p, up to an additive ". We design an algorithm for this problem that uses the information{ theoretically minimal sample size of m = O(1=ε2), runs in sample-linear time O(m), and outputs an O(k){ histogram whose l2-distance from p is at most O(optκ)+ε, where optκ is the minimum l2-distance between p and any κ-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size n. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude improvement in terms of empirical running times over previous state-of-the-art algorithms.
UR - http://www.scopus.com/inward/record.url?scp=84955260637&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84955260637&partnerID=8YFLogxK
U2 - 10.1145/2745754.2745772
DO - 10.1145/2745754.2745772
M3 - Conference contribution
AN - SCOPUS:84955260637
T3 - Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
SP - 249
EP - 263
BT - PODS 2015 - Proceedings of the 34th ACM Symposium on Principles of Database Systems
PB - Association for Computing Machinery
T2 - 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2015
Y2 - 21 May 2015 through 4 June 2015
ER -