TY - GEN
T1 - Finding subcube heavy hitters in analytics data streams
AU - Kveton, Branislav
AU - Muthukrishnan, S.
AU - Vu, Hoa T.
AU - Xian, Yikun
N1 - Publisher Copyright:
© 2018 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC BY 4.0 License.
PY - 2018/4/10
Y1 - 2018/4/10
N2 - Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem is to find frequent joint values (heavy hitters) of subsets of dimensions. Formally, the data stream consists of d-dimensional items and a \em k-dimensional subcube T is a subset of k distinct coordinates. Given a theshold '3, a \em subcube heavy hitter query $\rm Query (T,v)$ outputs YES if $f-T(v) \geq '3$ and NO if $f-T(v) g3/4$ where $f-T$ is the ratio of the number of stream items whose coordinates T have joint values v. The \em all subcube heavy hitters query $\rm AllQuery(T)$ outputs all joint values v that return YES to $\rm Query (T,v)$. The problem is to answer these queries correctly for all T and v. We present a simple one-pass sampling algorithm to solve the subcube heavy hitters problem in $\tildeO (kd/3)$ space. $\tildeO(\cdot)$ suppresses polylogarithmic factors. This is optimal up to polylogarithmic factors based on the lower bound of Liberty et al. In the worst case, this bound becomes (d^2/3)$ which is prohibitive for large d. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass, $\tildeO (d/3)$-space algorithm for our problem, and a fast algorithm for answering $\rm AllQuery (T)$ in $\tO((k/)^2)$ time. We demonstrate the effectiveness of our approach on a synthetic dataset as well as real datasets from Adobe and Yandex. Our work shows the potential of model-based approach to data streams.
AB - Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem is to find frequent joint values (heavy hitters) of subsets of dimensions. Formally, the data stream consists of d-dimensional items and a \em k-dimensional subcube T is a subset of k distinct coordinates. Given a theshold '3, a \em subcube heavy hitter query $\rm Query (T,v)$ outputs YES if $f-T(v) \geq '3$ and NO if $f-T(v) g3/4$ where $f-T$ is the ratio of the number of stream items whose coordinates T have joint values v. The \em all subcube heavy hitters query $\rm AllQuery(T)$ outputs all joint values v that return YES to $\rm Query (T,v)$. The problem is to answer these queries correctly for all T and v. We present a simple one-pass sampling algorithm to solve the subcube heavy hitters problem in $\tildeO (kd/3)$ space. $\tildeO(\cdot)$ suppresses polylogarithmic factors. This is optimal up to polylogarithmic factors based on the lower bound of Liberty et al. In the worst case, this bound becomes (d^2/3)$ which is prohibitive for large d. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass, $\tildeO (d/3)$-space algorithm for our problem, and a fast algorithm for answering $\rm AllQuery (T)$ in $\tO((k/)^2)$ time. We demonstrate the effectiveness of our approach on a synthetic dataset as well as real datasets from Adobe and Yandex. Our work shows the potential of model-based approach to data streams.
KW - Algorithms
KW - Data streams
KW - Graphical models
KW - Heavy hitters
UR - http://www.scopus.com/inward/record.url?scp=85075678839&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075678839&partnerID=8YFLogxK
U2 - 10.1145/3178876.3186082
DO - 10.1145/3178876.3186082
M3 - Conference contribution
AN - SCOPUS:85075678839
T3 - The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018
SP - 1705
EP - 1714
BT - The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018
PB - Association for Computing Machinery, Inc
T2 - 27th International World Wide Web, WWW 2018
Y2 - 23 April 2018 through 27 April 2018
ER -