Abstract
Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with ε accuracy and show that the space needed is only O(ε-min{1, 1/z}). This is the first o(1/ε) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L2 norm of the stream in o(1/ε2) space for z > 1/2, another improvement over the existing Ω(1/ε2) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.
Original language | English (US) |
---|---|
Pages | 44-55 |
Number of pages | 12 |
State | Published - 2005 |
Event | 5th SIAM International Conference on Data Mining, SDM 2005 - Newport Beach, CA, United States Duration: Apr 21 2005 → Apr 23 2005 |
Conference
Conference | 5th SIAM International Conference on Data Mining, SDM 2005 |
---|---|
Country/Territory | United States |
City | Newport Beach, CA |
Period | 4/21/05 → 4/23/05 |
Keywords
- Data mining
- Data stream analysis
- Heavy hitters
- Massive data
- Power laws
- Zipf distribution
ASJC Scopus subject areas
- Software