TY - JOUR
T1 - Application of a gene modular approach for clinical phenotype genotype association and sepsis prediction using machine learning in meningococcal sepsis
AU - Rashid, Asrar
AU - Anwary, Arif R.
AU - Al-Obeidat, Feras
AU - Brierley, Joe
AU - Uddin, Mohammed
AU - Alkhzaimi, Hoda
AU - Sarpal, Amrita
AU - Toufiq, Mohammed
AU - Malik, Zainab A.
AU - Kadwa, Raziya
AU - Khilnani, Praveen
AU - Shaikh, M. Guftar
AU - Benakatti, Govind
AU - Sharief, Javed
AU - Zaki, Syed Ahmed
AU - Zeyada, Abdulrahman
AU - Al-Dubai, Ahmed
AU - Hafez, Wael
AU - Hussain, Amir
N1 - Funding Information:
The authors thank the anonymous reviewers for their insightful comments and suggestions. We are extremely grateful, to Professor Delawar Uddin, Professor Harish Vyas, Dr. David Thomas, the Charity For Lucie, and Dr. Mark Peters at the Institute of Child Health. We thank Dr. Ege Ulgen for assistance with the data pre-processing. Professor Hussain acknowledges the support of the UK Engineering and Physical Sciences Research Council (EPSRC) - Grants Ref. EP/M026981/1 , EP/T021063/1 , EP/T024917/1 . To the team at NMC Royal Hospital Abu Dhabi, Dr. Mouhamad Al Zoubhi, Dr. Husam Saleh, Dr. Maki Hamad, Dr. Ali Nawaz, Mr. Juju Thomas, and NMC Corporate, Mr. Frank Delisi, Dr. Alan Stewart, Ms. Kate Hoffman, and Mr. David Hadley for supporting International research at NMC Healthcare. Also, thanks to Dr. Ege Ulgen for aiding in pre-processing of data. Finally, and certainly, not least, Professor Hector Wong, whose decades-long contribution to the field of sepsis genomics remains an enduring legacy, may he rest in peace.
Funding Information:
The dataset contained 29 instances of survival class (23) and non-survival class (6), which was an unequal distribution of classes. In machine learning, unequal data distribution is one of the major causes of decreasing accuracy of classification models. Due to the imbalance instances in the dataset, machine learning models could not effectively learn the patterns for survival and non-survival classes. As the non-survival class was less in number, the results generated by this class would become ineffective. To overcome this challenge, a synthetic minority oversampling technique (SMOTE) was applied to handle the imbalanced data [24]. This popular approach is often used in classification problems of imbalanced datasets. SMOTE is considered one of the most powerful, reliable, and adaptable pre-processing techniques in machine learning [25]. After balancing the dataset, it is important to identify patterns in the data series and express them so that the similarities and differences can be observed and reduce the dimensionality without losing too much information. Principal component analysis (PCA) is a multivariate technique to reduce the complexity of the input variables. This analyses extremely interrelated components in the dataset and decreases the complexity and dimension. Thus extracting the most significant information in the dataset. Therefore, PCA was applied to strip out the low-influence features from the dataset. After the preprocessing of data, six popular machine learning techniques, Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbor (KNN), Random Forest, Naïve Bayes, and Artificial Neural Network (ANN), were applied to understand the impact of each technique on the classification of the given survival and non-survival datasets. SVM is a supervised machine learning algorithm that identifies different classes by separating the classes with the help of a decision boundary known as a hyperplane (a line that distinguishes two classes). DT is a classifier that uses a tree-like structure based on knowledge gained on classification. KNN is a classifier technique where the training is predicated on “how similar” one dataset is from another based on the distances between a point and all the examples within the data, selecting the required number of examples (K) closest to the point, incorporating votes for the frequent leading label. The random forest creates many trees that achieve their output through ensemble learning methods for classification. Naïve Bayes is a classification technique that uses a simple probability that applies Bayes Theorem with high independent assumptions. Bayes theorem is used in statistics to calculate the probability of a class of each attribute group present to determine which class is optimal. ANN is another classification technique that mimics the functioning of a human brain with the basic principle that a number of parameters as inputs are processed in such a way as in the hidden layer (multiplication, addition, division, etc.), then processed again in the output layer to produce an output. For these machine learning techniques, the pre-processed data were partitioned into training and testing with a ratio of 70%:30%. The training dataset is fitted to the machine learning classifier, and later predictions were obtained using the testing dataset. These six machine-learning techniques were applied, and the results were obtained.The authors thank the anonymous reviewers for their insightful comments and suggestions. We are extremely grateful, to Professor Delawar Uddin, Professor Harish Vyas, Dr. David Thomas, the Charity For Lucie, and Dr. Mark Peters at the Institute of Child Health. We thank Dr. Ege Ulgen for assistance with the data pre-processing. Professor Hussain acknowledges the support of the UK Engineering and Physical Sciences Research Council (EPSRC) - Grants Ref. EP/M026981/1, EP/T021063/1, EP/T024917/1. To the team at NMC Royal Hospital Abu Dhabi, Dr. Mouhamad Al Zoubhi, Dr. Husam Saleh, Dr. Maki Hamad, Dr. Ali Nawaz, Mr. Juju Thomas, and NMC Corporate, Mr. Frank Delisi, Dr. Alan Stewart, Ms. Kate Hoffman, and Mr. David Hadley for supporting International research at NMC Healthcare. Also, thanks to Dr. Ege Ulgen for aiding in pre-processing of data. Finally, and certainly, not least, Professor Hector Wong, whose decades-long contribution to the field of sepsis genomics remains an enduring legacy, may he rest in peace.
Publisher Copyright:
© 2023 The Authors
PY - 2023/1
Y1 - 2023/1
N2 - Sepsis is a major global health concern causing high morbidity and mortality rates. Our study utilized a Meningococcal Septic Shock (MSS) temporal dataset to investigate the correlation between gene expression (GE) changes and clinical features. The research used Weighted Gene Co-expression Network Analysis (WGCNA) to establish links between gene expression and clinical parameters in infants admitted to the Pediatric Critical Care Unit with MSS. Additionally, various machine learning (ML) algorithms, including Support Vector Machine (SVM), Naive Bayes, K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Artificial Neural Network (ANN) were implemented to predict sepsis survival. The findings revealed a transition in gene function pathways from nuclear to cytoplasmic to extracellular, corresponding with Pediatric Logistic Organ Dysfunction score (PELOD) readings at 0, 24, and 48 h. ANN was the most accurate of the six ML models applied for survival prediction. This study successfully correlated PELOD with transcriptomic data, mapping enriched GE modules in acute sepsis. By integrating network analysis methods to identify key gene modules and using machine learning for sepsis prognosis, this study offers valuable insights for precision-based treatment strategies in future research. The observed temporal-spatial pattern of cellular recovery in sepsis could prove useful in guiding clinical management and therapeutic interventions.
AB - Sepsis is a major global health concern causing high morbidity and mortality rates. Our study utilized a Meningococcal Septic Shock (MSS) temporal dataset to investigate the correlation between gene expression (GE) changes and clinical features. The research used Weighted Gene Co-expression Network Analysis (WGCNA) to establish links between gene expression and clinical parameters in infants admitted to the Pediatric Critical Care Unit with MSS. Additionally, various machine learning (ML) algorithms, including Support Vector Machine (SVM), Naive Bayes, K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Artificial Neural Network (ANN) were implemented to predict sepsis survival. The findings revealed a transition in gene function pathways from nuclear to cytoplasmic to extracellular, corresponding with Pediatric Logistic Organ Dysfunction score (PELOD) readings at 0, 24, and 48 h. ANN was the most accurate of the six ML models applied for survival prediction. This study successfully correlated PELOD with transcriptomic data, mapping enriched GE modules in acute sepsis. By integrating network analysis methods to identify key gene modules and using machine learning for sepsis prognosis, this study offers valuable insights for precision-based treatment strategies in future research. The observed temporal-spatial pattern of cellular recovery in sepsis could prove useful in guiding clinical management and therapeutic interventions.
KW - Artificial neural network
KW - Gene modular approach
KW - Machine learning
KW - Meningococcal septic shock
UR - http://www.scopus.com/inward/record.url?scp=85165250631&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85165250631&partnerID=8YFLogxK
U2 - 10.1016/j.imu.2023.101293
DO - 10.1016/j.imu.2023.101293
M3 - Article
AN - SCOPUS:85165250631
SN - 2352-9148
VL - 41
JO - Informatics in Medicine Unlocked
JF - Informatics in Medicine Unlocked
M1 - 101293
ER -