TY - GEN
T1 - How to scalably and accurately skip past streams
AU - Bhattacharyya, Supratik
AU - Madeira, André
AU - Muthukrishnan, S.
AU - Ye, Tao
PY - 2007
Y1 - 2007
N2 - Data stream methods look at each new item of the stream, perform a small number of operations while keeping a small amount of memory, and still perform much-needed analyses. However, in many situations, the update speed per item is extremely critical and not every item can be extensively examined. In practice, this has been addressed by only examining every Nth item from the input; decreasing the input rate by a fraction 1/N, but resulting in loss of guarantees on the accuracy of the post-hoc analyses. In this paper, we present a technique of skipping past streams and looking at only a fraction of the input. Unlike traditional methods, our skipping is performed in a principled manner based on the "norm" of the stream seen. Using this technique on top of well-known sketches, we show several-fold improvement in the update time for processing streams with a given guaranteed accuracy, for a number of stream processing problems including data summarization, heavy hitters detection and self-join size estimation. We present experimental results of our methods over synthetic data and integrate our methods into Sprint's Continuous Monitoring (CMON) system for live network traffic analyses. Furthermore, aiming at future scalable stream processing systems and going beyond state-of-art packet header analyses, we show how the packet contents can be analyzed at streaming speeds, a more challenging task because each packet content can result in many updates.
AB - Data stream methods look at each new item of the stream, perform a small number of operations while keeping a small amount of memory, and still perform much-needed analyses. However, in many situations, the update speed per item is extremely critical and not every item can be extensively examined. In practice, this has been addressed by only examining every Nth item from the input; decreasing the input rate by a fraction 1/N, but resulting in loss of guarantees on the accuracy of the post-hoc analyses. In this paper, we present a technique of skipping past streams and looking at only a fraction of the input. Unlike traditional methods, our skipping is performed in a principled manner based on the "norm" of the stream seen. Using this technique on top of well-known sketches, we show several-fold improvement in the update time for processing streams with a given guaranteed accuracy, for a number of stream processing problems including data summarization, heavy hitters detection and self-join size estimation. We present experimental results of our methods over synthetic data and integrate our methods into Sprint's Continuous Monitoring (CMON) system for live network traffic analyses. Furthermore, aiming at future scalable stream processing systems and going beyond state-of-art packet header analyses, we show how the packet contents can be analyzed at streaming speeds, a more challenging task because each packet content can result in many updates.
UR - http://www.scopus.com/inward/record.url?scp=48349093467&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=48349093467&partnerID=8YFLogxK
U2 - 10.1109/ICDEW.2007.4401052
DO - 10.1109/ICDEW.2007.4401052
M3 - Conference contribution
AN - SCOPUS:48349093467
SN - 1424408326
SN - 9781424408320
T3 - Proceedings - International Conference on Data Engineering
SP - 654
EP - 663
BT - Workshops in Conjunction with the International Conference on Data Engineering - ICDE' 07
T2 - Workshops in Conjunction with the 23rd International Conference on Data Engineering - ICDE 2007
Y2 - 15 April 2007 through 20 April 2007
ER -