Continuous sampling from distributed streams

Graham Cormode, S. Muthukrishnan, Ke Yi, Qin Zhang

    Research output: Contribution to journalArticlepeer-review

    Abstract

    A fundamental problem in data management is to draw and maintain a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The main challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol on the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this article, we present communication-efficient protocols for continuously maintaining a sample (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent elements, or arrivals within the last w time units. We show that our protocols are optimal (up to logarithmic factors), not just in terms of the communication used, but also the time and space costs for each participant.

    Original languageEnglish (US)
    Article number2160163
    JournalJournal of the ACM
    Volume59
    Issue number2
    DOIs
    StatePublished - Apr 2012

    Keywords

    • Distributed tracking
    • Random sampling

    ASJC Scopus subject areas

    • Software
    • Control and Systems Engineering
    • Information Systems
    • Hardware and Architecture
    • Artificial Intelligence

    Fingerprint

    Dive into the research topics of 'Continuous sampling from distributed streams'. Together they form a unique fingerprint.

    Cite this