Feature selection for effective prediction of SARS-COV-2 using machine learning

Punacha, Gagan; Adiga, Rama

doi:10.1007/s13258-023-01467-6

Feature selection for effective prediction of SARS-COV-2 using machine learning

Research Article
Published: 20 November 2023

Volume 46, pages 341–354, (2024)
Cite this article

Genes & Genomics Aims and scope Submit manuscript

177 Accesses
Explore all metrics

Abstract

Background

With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.

Objective

With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.

Methods

All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong’s test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.

Results

Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.

Conclusion

The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Article Open access 02 January 2020

Comparing different supervised machine learning algorithms for disease prediction

Article Open access 21 December 2019

Perseus: A Bioinformatics Platform for Integrative Analysis of Proteomics Data in Cancer Research

Availability of data and materials

There will be data transparency for the present research work. All downloaded data are available in public databases for which due acknowledgement and citations have been provided.

References

Ali L, Khan SU, Golilarz NA, Yakubu I, Qasim I, Noor A, Nour R (2019) A feature-driven decision support system for heart failure prediction based on statistical model and gaussian naive bayes. Comput Math Methods Med
Batra R, Olivieri LG, Rubin D, Vallari A, Pearce SK, Olivo A, Prostko JC, Nebbia G, Douthwaite ST, Rodgers MA, Cloherty GA (2020) A comparative evaluation between the abbott panbio™ covid-19 igg/igm rapid test device and abbott architect™ sars cov-2 igg assay. J Clin Virol 132:104645–104645
Article CAS PubMed PubMed Central Google Scholar
Berrar DP (2019) Bayes’ theorem and naive bayes classifier. In: Encyclopedia of Bioinformatics and Computational Biology
Cunningham P, Delany SJ (2021) k-nearest neighbour classifiers—a tutorial. ACM Comput Surv (CSUR) 54:1–25
Article Google Scholar
de Fátima Cobre A, Stremel DP, Noleto GR, Fachi MM, Surek M, Wiens A, Tonin FS, Pontarolo R (2021) Diagnosis and prediction of covid-19 severity: can biochemical tests and machine learning be used as prognostic indicators? Comput Biol Med 134:104531. https://doi.org/10.1016/j.compbiomed.2021.104531
Article CAS Google Scholar
Duerr R, Dimartino D, Marier C, Zappile P, Levine S, François F, Iturrate E, Wang G, Dittmann M, Lighter J, et al. (2021) Clinical and genomic signatures of rising sars-cov-2 delta breakthrough infections in new york. medRxiv
Duerr R, Dimartino D, Marier C, Zappile P, Levine S, Francois F, Iturrate E, Wang G, Dittmann M, Lighter J, Elbel B, Troxel AB, Goldfeld KS, Heguy A (2022) Clinical and genomic signatures of SARS-CoV-2 Delta breakthrough infections in New York. EBioMedicine 82:104141. https://doi.org/10.1016/j.ebiom.2022.104141
Dutta NK, Mazumdar K, Gordy JT (2020) The nucleocapsid protein of sars-cov-2: a target for vaccine development. J Virol 94(13): e00647–20. https://doi.org/10.1128/JVI.00647-20.
Dutta N, Mazumdar K, Lee B, Baek M, Kim D, Na Y, Park S, Lee H, Kariwa H, Mai L, Park J (2008) Search for potential target site of nucleocapsid gene for the design of an epitope-based sars dna vaccine. Immunol Lett 118(1):65–71. https://doi.org/10.1016/j.imlet.2008.03.003
Article CAS PubMed PubMed Central Google Scholar
Emms D, Kelly S (2018) Orthofinder2: fast and accurate phylogenomic orthology analysis from gene sequences. BioRxiv 466201
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 36(1):3–42
Article Google Scholar
Goraichuk IV, Arefiev V, Stegniy BT, Gerilovych AP (2021) Zoonotic and reverse zoonotic transmissibility of sars-cov-2. Virus Res 302:198473
Article CAS PubMed Google Scholar
Grant R, Charmet T, Schaeffer L, Galmiche S, Madec Y, Von Platen C, Chény O, Omar F, David C, Rogoff A, Paireau J, Cauchemez S, Carrat F, Septfons A, Levy-Bruhl D, Mailles A, Fontanet A (2022) Impact of sars-cov-2 delta variant on incubation, transmission settings and vaccine effectiveness: Results from a nationwide case-control study in france. Lancet Reg Health - Europe 13:100278. https://doi.org/10.1016/j.lanepe.2021.100278
Article PubMed Google Scholar
Gussow AB, Auslander N, Faure G, Wolf YI, Zhang F, Koonin EV (2020) Genomic determinants of pathogenicity in sars-cov-2 and other human coronaviruses. Proc Natl Acad Sci 117(26):15193–15199
Article ADS CAS PubMed PubMed Central Google Scholar
Jackins V, Vimal S, Kaliappan M, Lee MY (2021) Ai-based smart prediction of clinical disease using random forest classifier and naive bayes. J Supercomput 77(5):5198–5219
Article Google Scholar
Jiang G, Wang W (2017) Error estimation based on variance analysis of k-fold cross-validation. Pattern Recognit. 69:94–106
Article ADS Google Scholar
Johnson MC, Lyddon TD, Suarez R, Salcedo B, LePique M, Graham M, Ricana C, Robinson C, Ritter DG (2020) Optimized pseudotyping conditions for the sars-cov-2 spike glycoprotein. J Virol 94(21):e01062-20
Article CAS PubMed PubMed Central Google Scholar
Kang S, Yang M, Hong Z, Zhang L, Huang Z, Chen X, He S, Zhou Z, Zhou Z, Chen Q, Yan Y, Zhang C, Shan H, Chen S (2020) Crystal structure of sars-cov-2 nucleocapsid protein rna binding domain reveals potential unique drug targeting sites. Acta Pharm Sin B 10(7):1228–1238. https://doi.org/10.1016/j.apsb.2020.04.009
Article CAS PubMed PubMed Central Google Scholar
Katoh K, Standley DM (2016) A simple method to control over-alignment in the mafft multiple sequence alignment program. Bioinformatics 32(13):1933–1942
Article CAS PubMed PubMed Central Google Scholar
Laatifi M, Douzi S, Bouklouz A, Ezzine H, Jaafari J, Zaid Y, El Ouahidi B, Naciri M (2022) Machine learning approaches in covid-19 severity risk prediction in morocco. J Big Data 9(1):1–21
Article Google Scholar
Marrocco C, Tortorella F (2016) Exploiting coding theory for classification: an ldpc-based strategy for multiclass-to-binary decomposition. Inf Sci 357:88–107
Article Google Scholar
Miao F, Cai YP, Zhang YX, Fan XM, Li Y (2018) Predictive modeling of hospital mortality for patients with heart failure by using an improved random survival forest. IEEE Access 6:7244–7253. https://doi.org/10.1109/ACCESS.2018.2789898
Article Google Scholar
Mlcochova P, Kemp S, Dhar M, Papa G, Meng B, Ferreira I, Datir R, Collier D, Albecka A, Singh S, et al (2021) Cov-2 genomics consortium (insacog). In: Genotype to Phenotype Japan (G2P-Japan) Consortium, pp 114–119
Oronsky B, Larson C, Caroen S, Hedjran F, Sanchez A, Prokopenko E, Reid T (2022) Nucleocapsid as a next-generation covid-19 vaccine candidate. Int J Infect Dis 122:529–530. https://doi.org/10.1016/j.ijid.2022.06.046
Article CAS PubMed PubMed Central Google Scholar
Palimkar P, Shaw RN, Ghosh A (2022) Machine learning technique to prognosis diabetes disease: Random forest classifier approach. In: Bianchini M, Piuri V, Das S, Shaw RN (eds) Advanced Computing and Intelligent Technologies, Singapore. Springer, Singapore, pp 219–244
Chapter Google Scholar
Peacock TP, Penrice-Randal R, Hiscox JA, Barclay WS (2021) Sars-cov-2 one year on: evidence for ongoing viral adaptation. J Gen Virol 102(4)
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet Google Scholar
Pisner D, Schnyer DM (2020) Support vector machine. Mach Learn
Prasad AM, Iverson LR, Liaw A, Liaw A (2006) Newer classification and regression tree techniques: Bagging and random forests for ecological prediction
Preto AJ, Moreira IS (2020) Spotone: Hot spots on protein complexes with extremely randomized trees via sequence-only features. Int J Mol Sci 21(19). https://doi.org/10.3390/ijms21197281
Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3):e0118432
Article PubMed PubMed Central Google Scholar
Schwenker F, Trentin E (2014) Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recognit Lett 37:4–14. https://doi.org/10.1016/j.patrec.2013.10.017
Article ADS Google Scholar
Sigal A (2022) Milder disease with omicron: is it the virus or the pre-existing immunity? Nat Rev Immunol 22(2):69–71
Article CAS PubMed PubMed Central Google Scholar
Tallei TE, Alhumaid S, AlMusa Z, Kusumawaty D, Alynbiawi A, Alshukairi AN, Rabaan AA (2022) Update on the omicron sub-variants ba. 4 and ba. 5. Rev Med Virol e2391
Tang F, Hammel IS, Andrew MK, Ruiz JG (2022) Covid-19 mrna vaccine effectiveness against hospitalisation and death in veterans according to frailty status during the sars-cov-2 delta (b. 1.617. 2) variant surge in the usa: a retrospective cohort study. Lancet Healthy Longev
Torres-Vásquez M, Chávez-Bosquez O, Hernández-Ocaña B, Hernández-Torruco J (2020) Classification of guillain-barré syndrome subtypes using sampling techniques with binary approach. Symmetry 12(3). https://doi.org/10.3390/sym12030482
Tuekprakhon A, Nutalai R, Dijokaite-Guraliuc A, Zhou D, Ginn HM, Selvaraj M, Liu C, Mentzer AJ, Supasa P, Duyvesteyn HM et al (2022) Antibody escape of sars-cov-2 omicron ba. 4 and ba. 5 from vaccine and ba. 1 serum. Cell 185(14):2422–2433
Article CAS PubMed PubMed Central Google Scholar
van den Hoogen LL, Smits G, van Hagen CC, Wong D, Vos ER, van Boven M, de Melker HE, van Vliet J, Kuijer M, Woudstra L, Wijmenga-Monsuur AJ, GeurtsvanKessel CH, Stoof SP, Reukers D, Wijsman LA, Meijer A, Reusken CB, Rots NY, van der Klis FR, van Binnendijk RS, den Hartog G (2022) Seropositivity to nucleoprotein to detect mild and asymptomatic sars-cov-2 infections: a complementary tool to detect breakthrough infections after covid-19 vaccination? Vaccine 40(15):2251–2257. https://doi.org/10.1016/j.vaccine.2022.03.009
Article CAS PubMed PubMed Central Google Scholar
Wang K, Zuo P, Liu Y, Zhang M, Zhao X, Xie S, Zhang H, Chen X, Liu C (2020) Clinical and laboratory predictors of in-hospital mortality in patients with coronavirus disease-2019: a cohort study in Wuhan, China. Clin Infect Dis 71(16):2079–2088
Article CAS PubMed Google Scholar
Wu C, Qavi AJ, Hachim A, Kavian N, Cole AR, Moyle AB, Wagner ND, Sweeney-Gibbons J, Rohrs HW, Gross ML et al (2021) Characterization of sars-cov-2 nucleocapsid protein reveals multiple functional consequences of the c-terminal domain. Iscience 24(6):102681
Article ADS CAS PubMed PubMed Central Google Scholar
Wu P, Ye H, Cai X, Li C, Li S, Chen M, Wang M, Heidari AA, Chen M, Li J et al (2021) An effective machine learning approach for identifying non-severe and severe coronavirus disease 2019 patients in a rural chinese population: the wenzhou retrospective study. Ieee Access 9:45486–45503
Article PubMed Google Scholar
Zareapoor M, Shamsolmoali P (2015) Application of credit card fraud detection: Based on bagging ensemble classifier. Procedia Comput Sci 48:679–685
Article Google Scholar
Zeng H, Edwards MD, Liu G, Gifford DK (2016) Convolutional neural network architectures for predicting dna-protein binding. Bioinformatics 32(12):i121–i127
Article CAS PubMed PubMed Central Google Scholar
Zhou JT, Tsang IWH, Ho SS, Müller KR (2019) N-ary decomposition for multi-class classification. Mach Learn 108:809–830
Article MathSciNet Google Scholar

Download references

Acknowledgements

The author wishes to thank Dr. Anirban Chakraborthy, Director, Nitte University Centre for Science Education and Research (NUCSER), and the Management of Nitte (Deemed to be University), Deralakatte, Mangalore, Karnataka, India for the support in establishing the center, providing facilities and continuous encouragement in research, including the present work.

Funding

This research received no external funding. The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations

Nitte (Deemed to be University), Department of Molecular Genetics & Cancer, Nitte University Centre for Science Education & Research (NUCSER), Mangalore, Karnataka, India
Gagan Punacha & Rama Adiga

Authors

Gagan Punacha
View author publications
You can also search for this author in PubMed Google Scholar
Rama Adiga
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The first draft of the manuscript and primary analysis was performed by Gagan Punacha. Rama Adiga performed all other analysis. Both the authors commented on previous versions of the manuscript and read and approved the final manuscript.

Corresponding author

Correspondence to Rama Adiga.

Ethics declarations

Conflict of interest

The authors declare no competing interests and have no relevant financial or non-financial interests to disclose.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors; since all the information is freely available in public domain and no patient/clinical information was used in the study which violates research ethics. Hence, no formal consent is required.

Consent to participate

All the authors have given the consent to participate in the present research concept.

Consent to publish

All authors have read the final manuscript and given the consent for publishing the manuscript.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (CSV 6 KB)

Supplementary file2 (ZIP 195 KB)

Supplementary file3 (DOCX 57 KB)

Supplementary file4 (DOCX 45 KB)

Supplementary file5 (PPTX 57 KB)

Supplementary file6 (DOCX 46 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Punacha, G., Adiga, R. Feature selection for effective prediction of SARS-COV-2 using machine learning. Genes Genom 46, 341–354 (2024). https://doi.org/10.1007/s13258-023-01467-6

Download citation

Received: 06 May 2023
Accepted: 01 October 2023
Published: 20 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s13258-023-01467-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection for effective prediction of SARS-COV-2 using machine learning