TY - JOUR
T1 - Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data
AU - Conderino, Sarah
AU - Divers, Jasmin
AU - Dodson, John A.
AU - Thorpe, Lorna E.
AU - Weiner, Mark G.
AU - Adhikari, Samrachana
N1 - Publisher Copyright:
© 2025 The Author(s). Health Services Research published by Wiley Periodicals LLC.
PY - 2025
Y1 - 2025
N2 - Objective: To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets. Study Setting and Design: In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches. Data Sources and Analytic Sample: Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions. Principal Findings: Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κsingle = 0.25, κMICE = 0.25, κrandomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis. Conclusions: BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.
AB - Objective: To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets. Study Setting and Design: In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches. Data Sources and Analytic Sample: Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions. Principal Findings: Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κsingle = 0.25, κMICE = 0.25, κrandomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis. Conclusions: BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.
KW - Bayesian Improved Surname Geocoding
KW - electronic health record
KW - ethnicity
KW - multiple imputation with chained equations
KW - race
KW - random forest imputation
UR - http://www.scopus.com/inward/record.url?scp=105006746317&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105006746317&partnerID=8YFLogxK
U2 - 10.1111/1475-6773.14649
DO - 10.1111/1475-6773.14649
M3 - Article
AN - SCOPUS:105006746317
SN - 0017-9124
JO - Health Services Research
JF - Health Services Research
ER -