Timely Reporting of Heavy Hitters using External Memory

Prashant Pandey, Shikha Singh, Michael A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, Cynthia A. Phillips

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Given an input stream of size N, a †-heavy hitter is an item that occurs at least † N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = † N-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ∼100K observations per second.

    Original languageEnglish (US)
    Title of host publicationSIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    PublisherAssociation for Computing Machinery
    Pages1431-1446
    Number of pages16
    ISBN (Electronic)9781450367356
    DOIs
    StatePublished - Jun 14 2020
    Event2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020 - Portland, United States
    Duration: Jun 14 2020Jun 19 2020

    Publication series

    NameProceedings of the ACM SIGMOD International Conference on Management of Data
    ISSN (Print)0730-8078

    Conference

    Conference2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
    Country/TerritoryUnited States
    CityPortland
    Period6/14/206/19/20

    Keywords

    • dictionary data structure
    • external-memory algorithms
    • streaming algorithms

    ASJC Scopus subject areas

    • Software
    • Information Systems

    Fingerprint

    Dive into the research topics of 'Timely Reporting of Heavy Hitters using External Memory'. Together they form a unique fingerprint.

    Cite this