TY - GEN
T1 - Optimal sampling from distributed streams
AU - Cormode, Graham
AU - Muthukrishnan, S.
AU - Yi, Ke
AU - Zhang, Qin
N1 - Copyright:
Copyright 2010 Elsevier B.V., All rights reserved.
PY - 2010
Y1 - 2010
N2 - A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol and track parameters of the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communication-efficient protocols for sampling (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent items, or arrivals within the last w time units. We show that our protocols are optimal, not just in terms of the communication used, but also that they use minimal or near minimal (up to logarithmic factors) time to process each new item, and space to operate.
AB - A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol and track parameters of the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communication-efficient protocols for sampling (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent items, or arrivals within the last w time units. We show that our protocols are optimal, not just in terms of the communication used, but also that they use minimal or near minimal (up to logarithmic factors) time to process each new item, and space to operate.
KW - distributed tracking
KW - random sampling
UR - http://www.scopus.com/inward/record.url?scp=77954709478&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954709478&partnerID=8YFLogxK
U2 - 10.1145/1807085.1807099
DO - 10.1145/1807085.1807099
M3 - Conference contribution
AN - SCOPUS:77954709478
SN - 9781450300339
T3 - Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
SP - 77
EP - 86
BT - PODS'10 - Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems
T2 - 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010
Y2 - 6 June 2010 through 11 June 2010
ER -