TY - GEN
T1 - Query-aware sampling for data streams
AU - Johnson, Theodore
AU - Muthukrishnan, S.
AU - Shkapenyuk, Vladislav
AU - Spatscheck, Oliver
PY - 2007
Y1 - 2007
N2 - Data Stream Management Systems are useful when large volumes of data need to be processed in real time. Examples include monitoring network traffic, monitoring financial transactions, and analyzing large scale scientific data feeds. These applications have varying data rates and often show bursts of high activity that overload the system, often during the most critical instants (e.g., network attacks, financial spikes) for analysis. Therefore, load shedding is necessary to preserve the stability of the system, gracefully degrade its performance and extract answers. Existing methods for load shedding in a general purpose data stream query system use random sampling of tuples, essentially independent of the query. While this technique is acceptable for some queries, the results may be meaningless or even incorrect for other queries. In principle, a number of different query-dependent sampling methods exist, but they work only for particular queries. In this paper, we show how to perform query-aware sampling (semantic sampling) which works in general. We present methods for analyzing any given query, choosing sampling methods judiciously, and reconciling the sampling methods required by different queries in a query set. We conclude with experiments on a high-speed data stream that demonstrate with different query sets that our method produces accurate results while decreasing the load significantly.
AB - Data Stream Management Systems are useful when large volumes of data need to be processed in real time. Examples include monitoring network traffic, monitoring financial transactions, and analyzing large scale scientific data feeds. These applications have varying data rates and often show bursts of high activity that overload the system, often during the most critical instants (e.g., network attacks, financial spikes) for analysis. Therefore, load shedding is necessary to preserve the stability of the system, gracefully degrade its performance and extract answers. Existing methods for load shedding in a general purpose data stream query system use random sampling of tuples, essentially independent of the query. While this technique is acceptable for some queries, the results may be meaningless or even incorrect for other queries. In principle, a number of different query-dependent sampling methods exist, but they work only for particular queries. In this paper, we show how to perform query-aware sampling (semantic sampling) which works in general. We present methods for analyzing any given query, choosing sampling methods judiciously, and reconciling the sampling methods required by different queries in a query set. We conclude with experiments on a high-speed data stream that demonstrate with different query sets that our method produces accurate results while decreasing the load significantly.
UR - http://www.scopus.com/inward/record.url?scp=48349088657&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=48349088657&partnerID=8YFLogxK
U2 - 10.1109/ICDEW.2007.4401053
DO - 10.1109/ICDEW.2007.4401053
M3 - Conference contribution
AN - SCOPUS:48349088657
SN - 1424408326
SN - 9781424408320
T3 - Proceedings - International Conference on Data Engineering
SP - 664
EP - 673
BT - Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07
T2 - Workshops in Conjunction with the 23rd International Conference on Data Engineering - ICDE 2007
Y2 - 15 April 2007 through 20 April 2007
ER -