TY - GEN
T1 - Checks and balances
T2 - 29th International Conference on Very Large Data Bases, VLDB 2003
AU - Korn, Flip
AU - Muthukrishnan, S.
AU - Zhu, Yunyue
N1 - Funding Information:
Work supported in part by NSF CCR 00-87022, NSF ITR 0220280 and EIA 02-05116. This work was started while the author was a DIMACS visitor at AT&T Labs.
PY - 2003
Y1 - 2003
N2 - Internet Service Providers (ISPs) use realtime data feeds of aggregated traffic in their network to support technical as well as business decisions. A fundamental difficulty with building decision support tools based on aggregated traffic data feeds is one of data quality. Data quality problems stem from network-specific issues (irregular polling caused by UDP packet drops and delays, topological mislabelings, etc.), and make it difficult to distinguish between artifacts and actual phenomena, rendering data analysis based on such data feeds ineffective. In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid and limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach - both in principle and in practice - to face data quality problems in network traffic databases. We propose a continuous data quality monitoring approach based on probabilistic, approximate constraints (PACs). These are simple, user-specified rule templates with open parameters for tolerance and likelihood. We use statistical techniques to instantiate suitable parameter values from the data, and show how to apply them for monitoring data quality. In principle, our PAC-based approach can be applied to data quality problems in any data feed. We present PAC-Man, which is the system that manages PACs for the entire aggregate network traffic database in a large ISP, and show that it is very effective in monitoring data quality problems.
AB - Internet Service Providers (ISPs) use realtime data feeds of aggregated traffic in their network to support technical as well as business decisions. A fundamental difficulty with building decision support tools based on aggregated traffic data feeds is one of data quality. Data quality problems stem from network-specific issues (irregular polling caused by UDP packet drops and delays, topological mislabelings, etc.), and make it difficult to distinguish between artifacts and actual phenomena, rendering data analysis based on such data feeds ineffective. In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid and limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach - both in principle and in practice - to face data quality problems in network traffic databases. We propose a continuous data quality monitoring approach based on probabilistic, approximate constraints (PACs). These are simple, user-specified rule templates with open parameters for tolerance and likelihood. We use statistical techniques to instantiate suitable parameter values from the data, and show how to apply them for monitoring data quality. In principle, our PAC-based approach can be applied to data quality problems in any data feed. We present PAC-Man, which is the system that manages PACs for the entire aggregate network traffic database in a large ISP, and show that it is very effective in monitoring data quality problems.
UR - http://www.scopus.com/inward/record.url?scp=57449119633&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=57449119633&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:57449119633
T3 - Proceedings - 29th International Conference on Very Large Data Bases, VLDB 2003
SP - 536
EP - 547
BT - Proceedings - 29th International Conference on Very Large Data Bases, VLDB 2003
A2 - Selinger, Patricia G.
A2 - Carey, Michael J.
A2 - Freytag, Johann Christoph
A2 - Abiteboul, Serge
A2 - Lockemann, Peter C.
A2 - Heuer, Andreas
PB - Morgan Kaufmann
Y2 - 9 September 2003 through 12 September 2003
ER -