## Abstract

Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem is to find frequent joint values (heavy hitters) of subsets of dimensions. Formally, the data stream consists of d-dimensional items and a \em k-dimensional subcube T is a subset of k distinct coordinates. Given a theshold '3, a \em subcube heavy hitter query $\rm Query (T,v)$ outputs YES if $f-T(v) \geq '3$ and NO if $f-T(v) g3/4$ where $f-T$ is the ratio of the number of stream items whose coordinates T have joint values v. The \em all subcube heavy hitters query $\rm AllQuery(T)$ outputs all joint values v that return YES to $\rm Query (T,v)$. The problem is to answer these queries correctly for all T and v. We present a simple one-pass sampling algorithm to solve the subcube heavy hitters problem in $\tildeO (kd/3)$ space. $\tildeO(\cdot)$ suppresses polylogarithmic factors. This is optimal up to polylogarithmic factors based on the lower bound of Liberty et al. In the worst case, this bound becomes (d^2/3)$ which is prohibitive for large d. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass, $\tildeO (d/3)$-space algorithm for our problem, and a fast algorithm for answering $\rm AllQuery (T)$ in $\tO((k/)^2)$ time. We demonstrate the effectiveness of our approach on a synthetic dataset as well as real datasets from Adobe and Yandex. Our work shows the potential of model-based approach to data streams.

Original language | English (US) |
---|---|

Title of host publication | The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018 |

Publisher | Association for Computing Machinery, Inc |

Pages | 1705-1714 |

Number of pages | 10 |

ISBN (Electronic) | 9781450356398 |

DOIs | |

State | Published - Apr 10 2018 |

Event | 27th International World Wide Web, WWW 2018 - Lyon, France Duration: Apr 23 2018 → Apr 27 2018 |

### Publication series

Name | The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018 |
---|

### Conference

Conference | 27th International World Wide Web, WWW 2018 |
---|---|

Country | France |

City | Lyon |

Period | 4/23/18 → 4/27/18 |

## Keywords

- Algorithms
- Data streams
- Graphical models
- Heavy hitters

## ASJC Scopus subject areas

- Computer Networks and Communications
- Software