Optimal sampling from distributed streams

Graham Cormode, S. Muthukrishnan, Ke Yi, Qin Zhang

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol and track parameters of the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communication-efficient protocols for sampling (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent items, or arrivals within the last w time units. We show that our protocols are optimal, not just in terms of the communication used, but also that they use minimal or near minimal (up to logarithmic factors) time to process each new item, and space to operate.

    Original languageEnglish (US)
    Title of host publicationPODS'10 - Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems
    Pages77-86
    Number of pages10
    DOIs
    StatePublished - 2010
    Event29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010 - Indianapolis, IN, United States
    Duration: Jun 6 2010Jun 11 2010

    Publication series

    NameProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

    Conference

    Conference29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010
    Country/TerritoryUnited States
    CityIndianapolis, IN
    Period6/6/106/11/10

    Keywords

    • distributed tracking
    • random sampling

    ASJC Scopus subject areas

    • Software
    • Information Systems
    • Hardware and Architecture

    Fingerprint

    Dive into the research topics of 'Optimal sampling from distributed streams'. Together they form a unique fingerprint.

    Cite this