Estimating rarity and similarity over data stream windows

Mayur Datar, S. Muthukrishnan

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    In the windowed data stream model, we observe items coming in over time. At any time t, we consider the window of the last N observations at-(N-1), at-(N-2),…, at, each ai Є{1,…, u}; we are required to support queries about the data in the window. A crucial restriction is that we are only allowed o(N) (often polylogarithmic in N) storage space, so not all items within the window can be archived. We study two basic problems in the windowed data stream model. The first is the estimation of the rarity of items in the window. Our second problem is one of estimating similarity between two data stream windows using the Jacard’s coefficient. The problems of estimating rarity and similarity have many applications in mining massive data sets. We present novel, simple algorithms for estimating rarity and similarity on windowed data streams, accurate up to factor 1 ± e using space only logarithmic in the window size.

    Original languageEnglish (US)
    Title of host publicationAlgorithms - ESA 2002 - 10th Annual European Symposium, Proceedings
    EditorsRolf Möhring, Rajeev Raman
    PublisherSpringer Verlag
    Pages323-335
    Number of pages13
    ISBN (Electronic)3540441808, 9783540441809
    DOIs
    StatePublished - 2002
    Event10th Annual European Symposium on Algorithms, ESA 2002 - Rome, Italy
    Duration: Sep 17 2002Sep 21 2002

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume2461
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Other

    Other10th Annual European Symposium on Algorithms, ESA 2002
    CountryItaly
    CityRome
    Period9/17/029/21/02

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • Computer Science(all)

    Fingerprint Dive into the research topics of 'Estimating rarity and similarity over data stream windows'. Together they form a unique fingerprint.

    Cite this