Deep significance clustering: A novel approach for identifying risk-stratified and predictive patient subgroups

Yufang Huang, Yifan Liu, Peter A.D. Steel, Kelly M. Axsom, John R. Lee, Sri Lekha Tummalapalli, Fei Wang, Jyotishman Pathak, Lakshminarayanan Subramanian, Yiye Zhang

Research output: Contribution to journalArticlepeer-review


Objective: Deep significance clustering (DICE) is a self-supervised learning framework. DICE identifies clinically similar and risk-stratified subgroups that neither unsupervised clustering algorithms nor supervised risk prediction algorithms alone are guaranteed to generate. Materials and Methods: Enabled by an optimization process that enforces statistical significance between the outcome and subgroup membership, DICE jointly trains 3 components, representation learning, clustering, and outcome prediction while providing interpretability to the deep representations. DICE also allows unseen patients to be predicted into trained subgroups for population-level risk stratification. We evaluated DICE using electronic health record datasets derived from 2 urban hospitals. Outcomes and patient cohorts used include discharge disposition to home among heart failure (HF) patients and acute kidney injury among COVID-19 (Cov-AKI) patients, respectively. Results: Compared to baseline approaches including principal component analysis, DICE demonstrated superior performance in the cluster purity metrics: Silhouette score (0.48 for HF, 0.51 for Cov-AKI), Calinski-Harabasz index (212 for HF, 254 for Cov-AKI), and Davies-Bouldin index (0.86 for HF, 0.66 for Cov-AKI), and prediction metric: area under the Receiver operating characteristic (ROC) curve (0.83 for HF, 0.78 for Cov-AKI). Clinical evaluation of DICE-generated subgroups revealed more meaningful distributions of member characteristics across subgroups, and higher risk ratios between subgroups. Furthermore, DICE-generated subgroup membership alone was moderately predictive of outcomes. Discussion: DICE addresses a gap in current machine learning approaches where predicted risk may not lead directly to actionable clinical steps. Conclusion: DICE demonstrated the potential to apply in heterogeneous populations, where having the same quantitative risk does not equate with having a similar clinical profile.

Original languageEnglish (US)
Pages (from-to)2641-2653
Number of pages13
JournalJournal of the American Medical Informatics Association
Issue number12
StatePublished - Dec 1 2021


  • Machine learning
  • Predictive clustering
  • Risk stratification

ASJC Scopus subject areas

  • Health Informatics


Dive into the research topics of 'Deep significance clustering: A novel approach for identifying risk-stratified and predictive patient subgroups'. Together they form a unique fingerprint.

Cite this