TY - JOUR
T1 - Exact recovery in the stochastic block model
AU - Abbe, Emmanuel
AU - Bandeira, Afonso S.
AU - Hall, Georgina
N1 - Funding Information:
A. S. Bandeira was supported by the Air Force Office of Scientific Research under Grant FA9550-12-1-0317. The authors are grateful to Dustin G. Mixon for thoroughly reading a preliminary version of this manuscript and providing useful comments, as well as to Philippe Rigollet and Van Vu for stimulating discussions on the efficient recovery via partial recovery.
Publisher Copyright:
© 2015 IEEE.
PY - 2016/1/1
Y1 - 2016/1/1
N2 - The stochastic block model with two communities, or equivalently the planted bisection model, is a popular model of random graph exhibiting a cluster behavior. In the symmetric case, the graph has two equally sized clusters and vertices connect with probability p within clusters and q across clusters. In the past two decades, a large body of literature in statistics and computer science has focused on providing lower bounds on the scaling of |p - q| to ensure exact recovery. In this paper, we identify a sharp threshold phenomenon for exact recovery: if α = pn/log(n) and β = qn/log(n) are constant (with α> β), recovering the communities with high probability is possible if (α + β/2) - √αβ > 1 and is impossible if (α + β/2) - √αβ < 1. In particular, this improves the existing bounds. This also sets a new line of sight for efficient clustering algorithms. While maximum likelihood (ML) achieves the optimal threshold (by definition), it is in the worst case NP-hard. This paper proposes an efficient algorithm based on a semidefinite programming relaxation of ML, which is proved to succeed in recovering the communities close to the threshold, while numerical experiments suggest that it may achieve the threshold. An efficient algorithm that succeeds all the way down to the threshold is also obtained using a partial recovery algorithm combined with a local improvement procedure.
AB - The stochastic block model with two communities, or equivalently the planted bisection model, is a popular model of random graph exhibiting a cluster behavior. In the symmetric case, the graph has two equally sized clusters and vertices connect with probability p within clusters and q across clusters. In the past two decades, a large body of literature in statistics and computer science has focused on providing lower bounds on the scaling of |p - q| to ensure exact recovery. In this paper, we identify a sharp threshold phenomenon for exact recovery: if α = pn/log(n) and β = qn/log(n) are constant (with α> β), recovering the communities with high probability is possible if (α + β/2) - √αβ > 1 and is impossible if (α + β/2) - √αβ < 1. In particular, this improves the existing bounds. This also sets a new line of sight for efficient clustering algorithms. While maximum likelihood (ML) achieves the optimal threshold (by definition), it is in the worst case NP-hard. This paper proposes an efficient algorithm based on a semidefinite programming relaxation of ML, which is proved to succeed in recovering the communities close to the threshold, while numerical experiments suggest that it may achieve the threshold. An efficient algorithm that succeeds all the way down to the threshold is also obtained using a partial recovery algorithm combined with a local improvement procedure.
KW - Clustering algorithms
KW - Communities
KW - Detection algorithms
KW - Network theory (graphs)
KW - Statistical learning
UR - http://www.scopus.com/inward/record.url?scp=84958842054&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84958842054&partnerID=8YFLogxK
U2 - 10.1109/TIT.2015.2490670
DO - 10.1109/TIT.2015.2490670
M3 - Article
AN - SCOPUS:84958842054
SN - 0018-9448
VL - 62
SP - 471
EP - 487
JO - IEEE Transactions on Information Theory
JF - IEEE Transactions on Information Theory
IS - 1
M1 - 7298436
ER -