TY - JOUR
T1 - Dataset Construction to Explore Chemical Space with 3D Geometry and Deep Learning
AU - Lu, Jianing
AU - Xia, Song
AU - Lu, Jieyu
AU - Zhang, Yingkai
N1 - Funding Information:
We would like to acknowledge the support by NIH (R35-GM127040) and computing resources provided by NYU-ITS.
Publisher Copyright:
©
PY - 2021/3/22
Y1 - 2021/3/22
N2 - A dataset is the basis of deep learning model development, and the success of deep learning models heavily relies on the quality and size of the dataset. In this work, we present a new data preparation protocol and build a large fragment-based dataset Frag20, which consists of optimized 3D geometries and calculated molecular properties from Merck molecular force field (MMFF) and DFT at the B3LYP/6-31G∗ level of theory for more than half a million molecules composed of H, B, C, O, N, F, P, S, Cl, and Br with no larger than 20 heavy atoms. Based on the new dataset, we develop robust molecular energy prediction models using a simplified PhysNet architecture for both DFT-optimized and MMFF-optimized geometries, which achieve better than or close to chemical accuracy (1 kcal/mol) on multiple test sets, including CSD20 and Plati20 based on experimental crystal structures.
AB - A dataset is the basis of deep learning model development, and the success of deep learning models heavily relies on the quality and size of the dataset. In this work, we present a new data preparation protocol and build a large fragment-based dataset Frag20, which consists of optimized 3D geometries and calculated molecular properties from Merck molecular force field (MMFF) and DFT at the B3LYP/6-31G∗ level of theory for more than half a million molecules composed of H, B, C, O, N, F, P, S, Cl, and Br with no larger than 20 heavy atoms. Based on the new dataset, we develop robust molecular energy prediction models using a simplified PhysNet architecture for both DFT-optimized and MMFF-optimized geometries, which achieve better than or close to chemical accuracy (1 kcal/mol) on multiple test sets, including CSD20 and Plati20 based on experimental crystal structures.
UR - http://www.scopus.com/inward/record.url?scp=85103305584&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85103305584&partnerID=8YFLogxK
U2 - 10.1021/acs.jcim.1c00007
DO - 10.1021/acs.jcim.1c00007
M3 - Article
C2 - 33683885
AN - SCOPUS:85103305584
SN - 1549-9596
VL - 61
SP - 1095
EP - 1104
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 3
ER -