Sampling algorithms in a stream operator

Theodore Johnson, S. Muthukrishnan, Irina Rozenbaum

    Research output: Contribution to journalConference articlepeer-review

    Abstract

    Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and sampling-based aggregations. Also, we show how to implement the operator in Gigascope - a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subset-sum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system.

    Original languageEnglish (US)
    Pages (from-to)1-12
    Number of pages12
    JournalProceedings of the ACM SIGMOD International Conference on Management of Data
    DOIs
    StatePublished - 2005
    EventSIGMOD 2005: ACM SIGMOD International Conference on Management of Data - Baltimore, MD, United States
    Duration: Jun 14 2005Jun 16 2005

    ASJC Scopus subject areas

    • Software
    • Information Systems

    Fingerprint

    Dive into the research topics of 'Sampling algorithms in a stream operator'. Together they form a unique fingerprint.

    Cite this