### Abstract

Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and nearoptimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A κ-histogram is a piecewise constant function with κ pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld [ILR12] in PODS 2012: Given samples from a distribution p over (1, ... , n), compute a κ-histogram that minimizes the l2-distance from p, up to an additive ". We design an algorithm for this problem that uses the information{ theoretically minimal sample size of m = O(1=ε^{2}), runs in sample-linear time O(m), and outputs an O(k){ histogram whose l2-distance from p is at most O(optκ)+ε, where optκ is the minimum l2-distance between p and any κ-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size n. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude improvement in terms of empirical running times over previous state-of-the-art algorithms.

Original language | English (US) |
---|---|

Title of host publication | PODS 2015 - Proceedings of the 34th ACM Symposium on Principles of Database Systems |

Publisher | Association for Computing Machinery |

Pages | 249-263 |

Number of pages | 15 |

ISBN (Electronic) | 9781450327572 |

DOIs | |

State | Published - May 20 2015 |

Event | 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2015 - Melbourne, Australia Duration: May 21 2015 → Jun 4 2015 |

### Publication series

Name | Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems |
---|---|

Volume | 31 |

### Conference

Conference | 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2015 |
---|---|

Country | Australia |

City | Melbourne |

Period | 5/21/15 → 6/4/15 |

### ASJC Scopus subject areas

- Software
- Information Systems
- Hardware and Architecture

## Fingerprint Dive into the research topics of 'Fast and Near Optimal Algorithms for Approximating Distributions by Histograms'. Together they form a unique fingerprint.

## Cite this

*PODS 2015 - Proceedings of the 34th ACM Symposium on Principles of Database Systems*(pp. 249-263). (Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; Vol. 31). Association for Computing Machinery. https://doi.org/10.1145/2745754.2745772