Real-Time Clustering for Large Sparse Online Visitor Data

Gromit Yeuk Yin Chan, Fan Du, Ryan A. Rossi, Anup B. Rao, Eunyee Koh, Cláudio T. Silva, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.

Original languageEnglish (US)
Title of host publicationThe Web Conference 2020 - Proceedings of the World Wide Web Conference, WWW 2020
PublisherAssociation for Computing Machinery, Inc
Number of pages11
ISBN (Electronic)9781450370233
StatePublished - Apr 20 2020
Event29th International World Wide Web Conference, WWW 2020 - Taipei, Taiwan, Province of China
Duration: Apr 20 2020Apr 24 2020

Publication series

NameThe Web Conference 2020 - Proceedings of the World Wide Web Conference, WWW 2020


Conference29th International World Wide Web Conference, WWW 2020
Country/TerritoryTaiwan, Province of China


  • Clustering
  • Density peaks
  • Sketching
  • Spark
  • Sparse binary data

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software


Dive into the research topics of 'Real-Time Clustering for Large Sparse Online Visitor Data'. Together they form a unique fingerprint.

Cite this