TY - GEN
T1 - Query-aware partitioning for monitoring massive network data streams
AU - Johnson, Theodore
AU - Muthukrishnan, S.
AU - Shkapenyuk, Vladislav
AU - Spatscheck, Oliver
PY - 2008
Y1 - 2008
N2 - Data Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We present methods for analyzing any given query set and choose the optimal partitioning scheme, and show how to reconcile potentially conflicting requirements that different queries might place on partitioning. We conclude with experiments on a small cluster of processing nodes on high-rate network traffic feed that demonstrates with different query sets that our methods effectively distribute the load across all processing nodes and facilitate efficient scaling whenever more processing nodes becomes available.
AB - Data Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We present methods for analyzing any given query set and choose the optimal partitioning scheme, and show how to reconcile potentially conflicting requirements that different queries might place on partitioning. We conclude with experiments on a small cluster of processing nodes on high-rate network traffic feed that demonstrates with different query sets that our methods effectively distribute the load across all processing nodes and facilitate efficient scaling whenever more processing nodes becomes available.
KW - Data streams
KW - Partitioning
KW - Query optimization
UR - http://www.scopus.com/inward/record.url?scp=57149138292&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=57149138292&partnerID=8YFLogxK
U2 - 10.1145/1376616.1376730
DO - 10.1145/1376616.1376730
M3 - Conference contribution
AN - SCOPUS:57149138292
SN - 9781605581026
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1135
EP - 1146
BT - SIGMOD 2008
T2 - 2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08
Y2 - 9 June 2008 through 12 June 2008
ER -