Optimal sampling from distributed streams

Graham Cormode, S. Muthukrishnan, Ke Yi, Qin Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol and track parameters of the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communication-efficient protocols for sampling (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent items, or arrivals within the last w time units. We show that our protocols are optimal, not just in terms of the communication used, but also that they use minimal or near minimal (up to logarithmic factors) time to process each new item, and space to operate.

Original languageEnglish (US)
Title of host publicationPODS'10 - Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems
Pages77-86
Number of pages10
DOIs
StatePublished - Jul 23 2010
Externally publishedYes
Event29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010 - Indianapolis, IN, United States
Duration: Jun 6 2010Jun 11 2010

Publication series

NameProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

Conference

Conference29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010
CountryUnited States
CityIndianapolis, IN
Period6/6/106/11/10

Keywords

  • distributed tracking
  • random sampling

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Optimal sampling from distributed streams'. Together they form a unique fingerprint.

  • Cite this

    Cormode, G., Muthukrishnan, S., Yi, K., & Zhang, Q. (2010). Optimal sampling from distributed streams. In PODS'10 - Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 77-86). (Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems). https://doi.org/10.1145/1807085.1807099