TY - JOUR
T1 - Comparison of the Performance of Machine Learning Models in Representing High-Dimensional Free Energy Surfaces and Generating Observables
AU - Cendagorta, Joseph R.
AU - Tolpin, Jocelyn
AU - Schneider, Elia
AU - Topper, Robert Q.
AU - Tuckerman, Mark E.
N1 - Funding Information:
M.E.T. acknowledges support from the National Science Foundation (NSF), Award No. CHE-1565980. J.T. acknowledges REU support from the New York University Materials Research Science and Engineering Center (MRSEC) program of the NSF, Award No. DMR-1420073. We are grateful to Steven Topper (University of South Carolina) for valuable advice regarding scalable shuffling algorithms. E.S. is grateful to Roberto Covino (Max Planck) for valuable discussion regarding oligopeptides MD simulations.
Publisher Copyright:
© 2020 American Chemical Society.
PY - 2020/5/7
Y1 - 2020/5/7
N2 - Free energy surfaces of chemical and physical systems are often generated using a popular class of enhanced sampling methods that target a set of collective variables (CVs) chosen to distinguish the characteristic features of these surfaces. While some of these approaches are typically limited to low (∼1-3)-dimensional CV subspaces, methods such as driven adiabatic free-energy dynamics/temperature-accelerated molecular dynamics have been shown to be capable of generating free energy surfaces of quite high dimension by sampling the associated marginal probability distribution via full sweeps over the CV landscape. These approaches repeatedly visit conformational basins, producing a scattering of points within the basins on each visit. Consequently, they are particularly amenable to synergistic combination with regression machine learning methods for filling in the surfaces between the sampled points and for providing a compact and continuous (or semicontinuous) representation of the surfaces that can be easily stored and used for further computation of observable properties. Given the central role of machine learning techniques in this combined approach, it is timely to provide a detailed comparison of the performance of different machine learning strategies and models, including neural networks, kernel ridge regression, support vector machines, and weighted neighbor schemes, for their ability to learn these high-dimensional surfaces as a function of the amount of sampled training data and, once trained, to subsequently generate accurate ensemble averages corresponding to observable properties of the systems. In this article, we perform such a comparison on a set of oligopeptides, in both gas and aqueous phases, corresponding to CV spaces of 2-10 dimensions and assess their ability to provide a global representation of the free energy surfaces and to generate accurate ensemble averages.
AB - Free energy surfaces of chemical and physical systems are often generated using a popular class of enhanced sampling methods that target a set of collective variables (CVs) chosen to distinguish the characteristic features of these surfaces. While some of these approaches are typically limited to low (∼1-3)-dimensional CV subspaces, methods such as driven adiabatic free-energy dynamics/temperature-accelerated molecular dynamics have been shown to be capable of generating free energy surfaces of quite high dimension by sampling the associated marginal probability distribution via full sweeps over the CV landscape. These approaches repeatedly visit conformational basins, producing a scattering of points within the basins on each visit. Consequently, they are particularly amenable to synergistic combination with regression machine learning methods for filling in the surfaces between the sampled points and for providing a compact and continuous (or semicontinuous) representation of the surfaces that can be easily stored and used for further computation of observable properties. Given the central role of machine learning techniques in this combined approach, it is timely to provide a detailed comparison of the performance of different machine learning strategies and models, including neural networks, kernel ridge regression, support vector machines, and weighted neighbor schemes, for their ability to learn these high-dimensional surfaces as a function of the amount of sampled training data and, once trained, to subsequently generate accurate ensemble averages corresponding to observable properties of the systems. In this article, we perform such a comparison on a set of oligopeptides, in both gas and aqueous phases, corresponding to CV spaces of 2-10 dimensions and assess their ability to provide a global representation of the free energy surfaces and to generate accurate ensemble averages.
UR - http://www.scopus.com/inward/record.url?scp=85084379508&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084379508&partnerID=8YFLogxK
U2 - 10.1021/acs.jpcb.0c01218
DO - 10.1021/acs.jpcb.0c01218
M3 - Article
C2 - 32275148
AN - SCOPUS:85084379508
SN - 1520-6106
VL - 124
SP - 3647
EP - 3660
JO - Journal of Physical Chemistry B
JF - Journal of Physical Chemistry B
IS - 18
ER -