TY - GEN
T1 - Forest density estimation
AU - Gupta, Anupam
AU - Lafferty, John
AU - Liu, Han
AU - Wasserman, Larry
AU - Xu, Min
PY - 2010
Y1 - 2010
N2 - We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal's algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to sparse Gaussian graphical models.
AB - We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal's algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to sparse Gaussian graphical models.
UR - http://www.scopus.com/inward/record.url?scp=84877780403&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84877780403&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84877780403
SN - 9780982252925
T3 - COLT 2010 - The 23rd Conference on Learning Theory
SP - 394
EP - 406
BT - COLT 2010 - The 23rd Conference on Learning Theory
T2 - 23rd Conference on Learning Theory, COLT 2010
Y2 - 27 June 2010 through 29 June 2010
ER -