Summarizing and mining Skewed data streams

Graham Cormode, S. Muthukrishnan

    Research output: Contribution to conferencePaper

    Abstract

    Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with ε accuracy and show that the space needed is only O(ε-min{1, 1/z}). This is the first o(1/ε) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L2 norm of the stream in o(1/ε2) space for z > 1/2, another improvement over the existing Ω(1/ε2) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.

    Original languageEnglish (US)
    Pages44-55
    Number of pages12
    StatePublished - 2005
    Event5th SIAM International Conference on Data Mining, SDM 2005 - Newport Beach, CA, United States
    Duration: Apr 21 2005Apr 23 2005

    Conference

    Conference5th SIAM International Conference on Data Mining, SDM 2005
    CountryUnited States
    CityNewport Beach, CA
    Period4/21/054/23/05

    Keywords

    • Data mining
    • Data stream analysis
    • Heavy hitters
    • Massive data
    • Power laws
    • Zipf distribution

    ASJC Scopus subject areas

    • Software

    Fingerprint Dive into the research topics of 'Summarizing and mining Skewed data streams'. Together they form a unique fingerprint.

  • Cite this

    Cormode, G., & Muthukrishnan, S. (2005). Summarizing and mining Skewed data streams. 44-55. Paper presented at 5th SIAM International Conference on Data Mining, SDM 2005, Newport Beach, CA, United States.