TY - JOUR
T1 - Identification of novel RNA design candidates by clustering the extended RNA-As-Graphs library
AU - Jain, Swati
AU - Zhu, Qiyao
AU - Paz, Amiel S.P.
AU - Schlick, Tamar
N1 - Funding Information:
This work bas been supported by the National Institute of General Medical Sciences , National Institutes of Health (NIH) grant R35GM122562 to T.S. Research in this article was supported (in part) by Philip Morris USA Inc. and Philip Morris International . The funding institutes did not have any say in the design of the study, analysis of the results, or the decision to publish.
Publisher Copyright:
© 2020 Elsevier B.V.
PY - 2020/6
Y1 - 2020/6
N2 - Background: We re-evaluate our RNA-As-Graphs clustering approach, using our expanded graph library and new RNA structures, to identify potential RNA-like topologies for design. Our coarse-grained approach represents RNA secondary structures as tree and dual graphs, with vertices and edges corresponding to RNA helices and loops. The graph theoretical framework facilitates graph enumeration, partitioning, and clustering approaches to study RNA structure and its applications. Methods: Clustering graph topologies based on features derived from graph Laplacian matrices and known RNA structures allows us to classify topologies into ‘existing’ or hypothetical, and the latter into, ‘RNA-like’ or ‘non RNA-like’ topologies. Here we update our list of existing tree graph topologies and RAG-3D database of atomic fragments to include newly determined RNA structures. We then use linear and quadratic regression, optionally with dimensionality reduction, to derive graph features and apply several clustering algorithms on our tree-graph library and recently expanded dual-graph library to classify them into the three groups. Results: The unsupervised PAM and K-means clustering approaches correctly classify 72–77% of all existing graph topologies and 75–82% of newly added ones as RNA-like. For supervised k-NN clustering, the cross-validation accuracy ranges from 57 to 81%. Conclusions: Using linear regression with unsupervised clustering, or quadratic regression with supervised clustering, provides better accuracies than supervised/linear clustering. All accuracies are better than random, especially for newly added existing topologies, thus lending credibility to our approach. General significance: Our updated RAG-3D database and motif classification by clustering present new RNA substructures and RNA-like motifs as novel design candidates.
AB - Background: We re-evaluate our RNA-As-Graphs clustering approach, using our expanded graph library and new RNA structures, to identify potential RNA-like topologies for design. Our coarse-grained approach represents RNA secondary structures as tree and dual graphs, with vertices and edges corresponding to RNA helices and loops. The graph theoretical framework facilitates graph enumeration, partitioning, and clustering approaches to study RNA structure and its applications. Methods: Clustering graph topologies based on features derived from graph Laplacian matrices and known RNA structures allows us to classify topologies into ‘existing’ or hypothetical, and the latter into, ‘RNA-like’ or ‘non RNA-like’ topologies. Here we update our list of existing tree graph topologies and RAG-3D database of atomic fragments to include newly determined RNA structures. We then use linear and quadratic regression, optionally with dimensionality reduction, to derive graph features and apply several clustering algorithms on our tree-graph library and recently expanded dual-graph library to classify them into the three groups. Results: The unsupervised PAM and K-means clustering approaches correctly classify 72–77% of all existing graph topologies and 75–82% of newly added ones as RNA-like. For supervised k-NN clustering, the cross-validation accuracy ranges from 57 to 81%. Conclusions: Using linear regression with unsupervised clustering, or quadratic regression with supervised clustering, provides better accuracies than supervised/linear clustering. All accuracies are better than random, especially for newly added existing topologies, thus lending credibility to our approach. General significance: Our updated RAG-3D database and motif classification by clustering present new RNA substructures and RNA-like motifs as novel design candidates.
KW - Graph clustering
KW - RAG-3D database
KW - RNA design
KW - RNA-like motifs
KW - Tree and dual graph topologies
UR - http://www.scopus.com/inward/record.url?scp=85079902324&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85079902324&partnerID=8YFLogxK
U2 - 10.1016/j.bbagen.2020.129534
DO - 10.1016/j.bbagen.2020.129534
M3 - Article
C2 - 31954797
AN - SCOPUS:85079902324
SN - 0304-4165
VL - 1864
JO - Biochimica et Biophysica Acta - General Subjects
JF - Biochimica et Biophysica Acta - General Subjects
IS - 6
M1 - 129534
ER -