Conquering the divide: Continuous clustering of distributed data streams

Graham Cormode, S. Muthukrishnan, Wei Zhuang

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the k-center clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We. show that these algorithms can be designed to give accuracy guarantees that are. close to the best possible even in the centralized case. In our experiments, we. see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location.

    Original languageEnglish (US)
    Title of host publication23rd International Conference on Data Engineering, ICDE 2007
    Pages1036-1045
    Number of pages10
    DOIs
    StatePublished - 2007
    Event23rd International Conference on Data Engineering, ICDE 2007 - Istanbul, Turkey
    Duration: Apr 15 2007Apr 20 2007

    Publication series

    NameProceedings - International Conference on Data Engineering
    ISSN (Print)1084-4627

    Other

    Other23rd International Conference on Data Engineering, ICDE 2007
    Country/TerritoryTurkey
    CityIstanbul
    Period4/15/074/20/07

    ASJC Scopus subject areas

    • Software
    • Signal Processing
    • Information Systems

    Fingerprint

    Dive into the research topics of 'Conquering the divide: Continuous clustering of distributed data streams'. Together they form a unique fingerprint.

    Cite this