TY - GEN
T1 - Real-Time Clustering for Large Sparse Online Visitor Data
AU - Chan, Gromit Yeuk Yin
AU - Du, Fan
AU - Rossi, Ryan A.
AU - Rao, Anup B.
AU - Koh, Eunyee
AU - Silva, Cláudio T.
AU - Freire, Juliana
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/4/20
Y1 - 2020/4/20
N2 - Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.
AB - Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.
KW - Clustering
KW - Density peaks
KW - Sketching
KW - Spark
KW - Sparse binary data
UR - http://www.scopus.com/inward/record.url?scp=85086594257&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086594257&partnerID=8YFLogxK
U2 - 10.1145/3366423.3380183
DO - 10.1145/3366423.3380183
M3 - Conference contribution
AN - SCOPUS:85086594257
T3 - The Web Conference 2020 - Proceedings of the World Wide Web Conference, WWW 2020
SP - 1049
EP - 1059
BT - The Web Conference 2020 - Proceedings of the World Wide Web Conference, WWW 2020
PB - Association for Computing Machinery, Inc
T2 - 29th International World Wide Web Conference, WWW 2020
Y2 - 20 April 2020 through 24 April 2020
ER -