Finding subcube heavy hitters in analytics data streams

Branislav Kveton, S. Muthukrishnan, Hoa T. Vu, Yikun Xian

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem is to find frequent joint values (heavy hitters) of subsets of dimensions. Formally, the data stream consists of d-dimensional items and a \em k-dimensional subcube T is a subset of k distinct coordinates. Given a theshold '3, a \em subcube heavy hitter query $\rm Query (T,v)$ outputs YES if $f-T(v) \geq '3$ and NO if $f-T(v) g3/4$ where $f-T$ is the ratio of the number of stream items whose coordinates T have joint values v. The \em all subcube heavy hitters query $\rm AllQuery(T)$ outputs all joint values v that return YES to $\rm Query (T,v)$. The problem is to answer these queries correctly for all T and v. We present a simple one-pass sampling algorithm to solve the subcube heavy hitters problem in $\tildeO (kd/3)$ space. $\tildeO(\cdot)$ suppresses polylogarithmic factors. This is optimal up to polylogarithmic factors based on the lower bound of Liberty et al. In the worst case, this bound becomes (d^2/3)$ which is prohibitive for large d. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass, $\tildeO (d/3)$-space algorithm for our problem, and a fast algorithm for answering $\rm AllQuery (T)$ in $\tO((k/)^2)$ time. We demonstrate the effectiveness of our approach on a synthetic dataset as well as real datasets from Adobe and Yandex. Our work shows the potential of model-based approach to data streams.

    Original languageEnglish (US)
    Title of host publicationThe Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018
    PublisherAssociation for Computing Machinery, Inc
    Pages1705-1714
    Number of pages10
    ISBN (Electronic)9781450356398
    DOIs
    StatePublished - Apr 10 2018
    Event27th International World Wide Web, WWW 2018 - Lyon, France
    Duration: Apr 23 2018Apr 27 2018

    Publication series

    NameThe Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018

    Conference

    Conference27th International World Wide Web, WWW 2018
    Country/TerritoryFrance
    CityLyon
    Period4/23/184/27/18

    Keywords

    • Algorithms
    • Data streams
    • Graphical models
    • Heavy hitters

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Software

    Fingerprint

    Dive into the research topics of 'Finding subcube heavy hitters in analytics data streams'. Together they form a unique fingerprint.

    Cite this