TY - GEN
T1 - Scalable computation of distributions from large scale data sets
AU - Chaudhuri, Abon
AU - Lee, Teng Yok
AU - Zhou, Bo
AU - Wang, Cong
AU - Xu, Tiantian
AU - Shen, Han Wei
AU - Peterka, Tom
AU - Chiang, Yi Jen
N1 - Copyright:
Copyright 2013 Elsevier B.V., All rights reserved.
PY - 2012
Y1 - 2012
N2 - As we approach the era of exascale computing, the role of distributions to summarize, analyze and visualize large scale data is becoming more and more important. Since histograms continue to be a popular way of modeling the underlying data distribution, we propose a scalable and distributed framework for computing histograms from scalar and vector data at different levels of detail required by various types of analysis algorithms. We present efficient parallel techniques for histogram computation from regular as well as rectilinear grid data. We also study a technique called cross-validation to estimate the quality of computed histograms as a model of the actual data distribution. We parallelize cross-validation in a scalable manner to support histogram evaluation and selection of histogram parameters such as number of bins. We also present our distributed software framework for supporting science applications which require large scale distribution-based data analysis. The presented case studies highlight how the proposed algorithms and the related software benefit information theoretic and other distribution-driven analysis.
AB - As we approach the era of exascale computing, the role of distributions to summarize, analyze and visualize large scale data is becoming more and more important. Since histograms continue to be a popular way of modeling the underlying data distribution, we propose a scalable and distributed framework for computing histograms from scalar and vector data at different levels of detail required by various types of analysis algorithms. We present efficient parallel techniques for histogram computation from regular as well as rectilinear grid data. We also study a technique called cross-validation to estimate the quality of computed histograms as a model of the actual data distribution. We parallelize cross-validation in a scalable manner to support histogram evaluation and selection of histogram parameters such as number of bins. We also present our distributed software framework for supporting science applications which require large scale distribution-based data analysis. The presented case studies highlight how the proposed algorithms and the related software benefit information theoretic and other distribution-driven analysis.
UR - http://www.scopus.com/inward/record.url?scp=84872198627&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84872198627&partnerID=8YFLogxK
U2 - 10.1109/LDAV.2012.6378985
DO - 10.1109/LDAV.2012.6378985
M3 - Conference contribution
AN - SCOPUS:84872198627
SN - 9781467347334
T3 - IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings
SP - 113
EP - 120
BT - IEEE Symposium on Large Data Analysis and Visualization 2012, LDAV 2012 - Proceedings
T2 - 2nd Symposium on Large-Scale Data Analysis and Visualization, LDAV 2012
Y2 - 14 October 2012 through 19 October 2012
ER -