TY - GEN
T1 - Conquering the divide
T2 - 23rd International Conference on Data Engineering, ICDE 2007
AU - Cormode, Graham
AU - Muthukrishnan, S.
AU - Zhuang, Wei
N1 - Copyright:
Copyright 2008 Elsevier B.V., All rights reserved.
PY - 2007
Y1 - 2007
N2 - Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the k-center clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We. show that these algorithms can be designed to give accuracy guarantees that are. close to the best possible even in the centralized case. In our experiments, we. see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location.
AB - Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the k-center clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We. show that these algorithms can be designed to give accuracy guarantees that are. close to the best possible even in the centralized case. In our experiments, we. see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location.
UR - http://www.scopus.com/inward/record.url?scp=34548803489&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34548803489&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2007.368962
DO - 10.1109/ICDE.2007.368962
M3 - Conference contribution
AN - SCOPUS:34548803489
SN - 1424408032
SN - 9781424408030
T3 - Proceedings - International Conference on Data Engineering
SP - 1036
EP - 1045
BT - 23rd International Conference on Data Engineering, ICDE 2007
Y2 - 15 April 2007 through 20 April 2007
ER -