TY - GEN
T1 - Estimating statistical aggregates on probabilistic data streams
AU - Jayram, T. S.
AU - McGregor, Andrew
AU - Muthukrishnan, S.
AU - Vee, Erik
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2007
Y1 - 2007
N2 - The probabilistic-stream model was introduced by Jayram et al. [20]. It is a generalization of the data stream model that is suited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical "deterministic" streams where each item is deterministically one of the domain values. Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient one-pass algorithm for estimating the median of a probabilistic stream.
AB - The probabilistic-stream model was introduced by Jayram et al. [20]. It is a generalization of the data stream model that is suited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical "deterministic" streams where each item is deterministically one of the domain values. Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient one-pass algorithm for estimating the median of a probabilistic stream.
KW - Frequency moments
KW - Mean
KW - Median
KW - OLAP
KW - Probabilistic streams
UR - http://www.scopus.com/inward/record.url?scp=35448934923&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35448934923&partnerID=8YFLogxK
U2 - 10.1145/1265530.1265565
DO - 10.1145/1265530.1265565
M3 - Conference contribution
AN - SCOPUS:35448934923
SN - 1595936858
SN - 9781595936851
T3 - Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
SP - 243
EP - 252
BT - Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
T2 - 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
Y2 - 11 June 2007 through 13 June 2007
ER -