TY - JOUR
T1 - Timely Reporting of Heavy Hitters Using External Memory
AU - Singh, Shikha
AU - Pandey, Prashant
AU - Bender, Michael A.
AU - Berry, Jonathan W.
AU - Farach-Colton, Martín
AU - Johnson, Rob
AU - Kroeger, Thomas M.
AU - Phillips, Cynthia A.
N1 - Publisher Copyright:
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2021/12
Y1 - 2021/12
N2 - Given an input stream S of size N, a φ-heavy hitter is an item that occurs at least φN times in S. The problem of finding heavy-hitters is extensively studied in the database literature.We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φN-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ≈100K observations per second.
AB - Given an input stream S of size N, a φ-heavy hitter is an item that occurs at least φN times in S. The problem of finding heavy-hitters is extensively studied in the database literature.We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φN-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ≈100K observations per second.
KW - Dictionary data structure
KW - external-memory algorithms
KW - streaming algorithms
UR - http://www.scopus.com/inward/record.url?scp=85121288710&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85121288710&partnerID=8YFLogxK
U2 - 10.1145/3472392
DO - 10.1145/3472392
M3 - Article
AN - SCOPUS:85121288710
SN - 0362-5915
VL - 46
JO - ACM Transactions on Database Systems
JF - ACM Transactions on Database Systems
IS - 4
M1 - 3472392
ER -