TY - JOUR
T1 - Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms
AU - DM DREAM Consortium
AU - Schaffter, Thomas
AU - Buist, Diana S.M.
AU - Lee, Christoph I.
AU - Nikulin, Yaroslav
AU - Ribli, Dezsõ
AU - Guan, Yuanfang
AU - Lotter, William
AU - Jie, Zequn
AU - Du, Hao
AU - Wang, Sijia
AU - Feng, Jiashi
AU - Feng, Mengling
AU - Kim, Hyo Eun
AU - Albiol, Francisco
AU - Albiol, Alberto
AU - Morrell, Stephen
AU - Wojna, Zbigniew
AU - Ahsen, Mehmet Eren
AU - Asif, Umar
AU - Jimeno Yepes, Antonio
AU - Yohanandan, Shivanthan
AU - Rabinovici-Cohen, Simona
AU - Yi, Darvin
AU - Hoff, Bruce
AU - Yu, Thomas
AU - Chaibub Neto, Elias
AU - Rubin, Daniel L.
AU - Lindholm, Peter
AU - Margolies, Laurie R.
AU - McBride, Russell Bailey
AU - Rothstein, Joseph H.
AU - Sieh, Weiva
AU - Ben-Ari, Rami
AU - Harrer, Stefan
AU - Trister, Andrew
AU - Friend, Stephen
AU - Norman, Thea
AU - Sahiner, Berkman
AU - Strand, Fredrik
AU - Guinney, Justin
AU - Stolovitzky, Gustavo
AU - Mackey, Lester
AU - Cahoon, Joyce
AU - Shen, Li
AU - Sohn, Jae Ho
AU - Trivedi, Hari
AU - Shen, Yiqiu
AU - Buturovic, Ljubomir
AU - Pereira, Jose Costa
AU - Cho, Kyunghyun
N1 - Publisher Copyright:
© 2020 Schaffter T et al.
PY - 2020/3/2
Y1 - 2020/3/2
N2 - IMPORTANCE Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives. OBJECTIVE To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms. DESIGN, SETTING, AND PARTICIPANTS In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016. MAIN OUTCOMES AND MEASUREMENTS Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists’ specificity with radiologists’ sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists’ recall assessment was developed and evaluated. RESULTS Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists’ sensitivity, lower than community-practice radiologists’ specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity. CONCLUSIONS AND RELEVANCE While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.
AB - IMPORTANCE Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives. OBJECTIVE To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms. DESIGN, SETTING, AND PARTICIPANTS In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016. MAIN OUTCOMES AND MEASUREMENTS Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists’ specificity with radiologists’ sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists’ recall assessment was developed and evaluated. RESULTS Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists’ sensitivity, lower than community-practice radiologists’ specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity. CONCLUSIONS AND RELEVANCE While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.
UR - http://www.scopus.com/inward/record.url?scp=85081118470&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85081118470&partnerID=8YFLogxK
U2 - 10.1001/jamanetworkopen.2020.0265
DO - 10.1001/jamanetworkopen.2020.0265
M3 - Article
C2 - 32119094
AN - SCOPUS:85081118470
SN - 2574-3805
VL - 3
SP - E200265
JO - JAMA network open
JF - JAMA network open
IS - 3
ER -