TY - JOUR
T1 - Fast and Scalable Private Genotype Imputation Using Machine Learning and Partially Homomorphic Encryption
AU - Sarkar, Esha
AU - Chielle, Eduardo
AU - Gursoy, Gamze
AU - Mazonka, Oleg
AU - Gerstein, Mark
AU - Maniatakos, Michail
N1 - Funding Information:
This work was supported in part by the New York University Abu Dhabi Global Ph.D. Fellowship Program and in part by the U.S. National Institutes of Health under Grant K99 HG010909 and Grant R01 HG010749.
Publisher Copyright:
© 2013 IEEE.
PY - 2021
Y1 - 2021
N2 - The recent advances in genome sequencing technologies provide unprecedented opportunities to understand the relationship between human genetic variation and diseases. However, genotyping whole genomes from a large cohort of individuals is still cost prohibitive. Imputation methods to predict genotypes of missing genetic variants are widely used, especially for genome-wide association studies. Accurate genotype imputation requires complex statistical methods. Due to the data and computing-intensive nature of the problem, imputation is increasingly outsourced, raising serious privacy concerns. In this work, we investigate solutions for fast, scalable, and accurate privacy-preserving genotype imputation using Machine Learning (ML) and a standardized homomorphic encryption scheme, Paillier cryptosystem. ML-based privacy-preserving inference has been largely optimized for computation-heavy non-linear functions in a single-output multi-class classification setting. However, having a large number of multi-class outputs per genome per individual calls for further optimizations and/or approximations specific to this application. Here we explore the effectiveness of linear models for genotype imputation to convert them to privacy-preserving equivalents using standardized homomorphic encryption schemes. Our results show that performance of our privacy-preserving genotype imputation method is equivalent to the state-of-the-art plaintext solutions, achieving up to 99% micro area under curve score, even on real-world large-scale datasets up to 80,000 targets.
AB - The recent advances in genome sequencing technologies provide unprecedented opportunities to understand the relationship between human genetic variation and diseases. However, genotyping whole genomes from a large cohort of individuals is still cost prohibitive. Imputation methods to predict genotypes of missing genetic variants are widely used, especially for genome-wide association studies. Accurate genotype imputation requires complex statistical methods. Due to the data and computing-intensive nature of the problem, imputation is increasingly outsourced, raising serious privacy concerns. In this work, we investigate solutions for fast, scalable, and accurate privacy-preserving genotype imputation using Machine Learning (ML) and a standardized homomorphic encryption scheme, Paillier cryptosystem. ML-based privacy-preserving inference has been largely optimized for computation-heavy non-linear functions in a single-output multi-class classification setting. However, having a large number of multi-class outputs per genome per individual calls for further optimizations and/or approximations specific to this application. Here we explore the effectiveness of linear models for genotype imputation to convert them to privacy-preserving equivalents using standardized homomorphic encryption schemes. Our results show that performance of our privacy-preserving genotype imputation method is equivalent to the state-of-the-art plaintext solutions, achieving up to 99% micro area under curve score, even on real-world large-scale datasets up to 80,000 targets.
KW - Genotype imputation
KW - machine learning
KW - privacy-preserving computation
UR - http://www.scopus.com/inward/record.url?scp=85112212890&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112212890&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2021.3093005
DO - 10.1109/ACCESS.2021.3093005
M3 - Article
AN - SCOPUS:85112212890
SN - 2169-3536
VL - 9
SP - 93097
EP - 93110
JO - IEEE Access
JF - IEEE Access
M1 - 9466098
ER -