A robust innovative pipeline-based machine learning framework for predicting COVID-19 in Mexican patients

Farnoosh, Rahman; Abnoosian, Karlo

doi:10.1007/s13198-024-02354-3

A robust innovative pipeline-based machine learning framework for predicting COVID-19 in Mexican patients

ORIGINAL ARTICLE
Published: 14 May 2024

(2024)
Cite this article

International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

25 Accesses
Explore all metrics

Abstract

The emergence of COVID-19 in late 2019 in Wuhan, China, has led to a global health crisis that has claimed many lives worldwide. A thorough understanding of the available COVID-19 datasets can enable healthcare professionals to identify cases at an early stage. This study presents an innovative pipeline-based framework for predicting survival and mortality in patients with COVID-19 by leveraging the Mexican COVID-19 patient dataset (COVID-19-MPD dataset). Preprocessing plays a pivotal role in ensuring that the framework delivers high-quality outcomes. We deploy various machine learning models with optimized hyperparameters within the framework. Through consistent experimental conditions and dataset utilization, we conducted multiple experiments employing diverse preprocessing techniques and models to maximize the area under the receiver operating characteristic curve (AUC) for COVID-19 prediction. Given the considerable dimensions of the dataset, feature selection is crucial for identifying factors influencing COVID-19 mortality or survival. We employ feature dimension reduction methods, such as principal component analysis and independent component analysis, in addition to feature selection techniques such as maximum relevance minimum redundancy and permutation feature importance. Impactful features related to patient outcomes can significantly aid experts in disease management by enhancing treatment efficacy and control measures. Following various experiments with standardized data and AUC assessment using the k-nearest neighbor algorithm with four components, the proposed framework achieves optimal results, attaining an AUC of 100%. Given its effectiveness in COVID-19 prediction, this framework has the potential for integration into medical decision support systems.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Projecting COVID-19 disease severity in cancer patients using purposefully-designed machine learning

Article Open access 04 May 2021

A Real-World Clinical Data Mining of Post COVID-19 Patients

Maximally informative feature selection using Information Imbalance: Application to COVID-19 severity prediction

Article Open access 10 May 2024

Data availability

Data were used from a publicly available dataset [Online] Available: https://www.gob.mx/salud/documentos/datos-abiertos-152127

References

Abbas NAM, Salman HM (2020) Enhancing linear independent component analysis: comparison of various metaheuristic methods. Iraqi J Electr Electron Eng 16(1)
Abdulkareem NM, Abdulazeez AM, Zeebaree DQ, Hasan DA (2021) COVID-19 world vaccination progress using machine learning classification algorithms. Qubahan Acad J 1(2):100–105
Article Google Scholar
Abnoosian K, Farnoosh R, Behzadi MH (2023a) A pipeline-based framework for early prediction of diabetes. J Health Biomed Inform 10(2):125–140
Google Scholar
Abnoosian K, Farnoosh R, Behzadi MH (2023b) Prediction of diabetes disease using an ensemble of machine learning multiclassifier models. BMC Bioinformatics 24(1):337
Article Google Scholar
Aguirre AA, Catherina R, Frye H, Shelley L (2020) Illicit wildlife trade, wet markets, and COVID-19: preventing future pandemics. World Medical & Health Policy 12(3):256–265
Article Google Scholar
Akila A, Parameswari R, Jayakumari C (2022) Big data in healthcare: management, analysis, and future prospects. Handbook of Intelligent Healthcare Analytics: Knowledge Engineering with Big Data Analytics. https://doi.org/10.1002/9781119792550.ch14
Article Google Scholar
Alkady W, ElBahnasy K, Leiva V, Gad W (2022) Classifying COVID-19 based on amino acids encoding with machine learning algorithms. Chemom Intell Lab Syst 224:104535
Article Google Scholar
Althouse LA, Ware WB, Ferron JM (1998) Detecting departures from normality: a monte carlo simulation of a new omnibus test based on moments.
Bakar NA, Rosbi S (2020) Effect of coronavirus disease (COVID-19) to tourism industry. Int J Adv Eng Res Sci 7(4):189–193
Article Google Scholar
Barut Z, Altuntaş V (2023) Comparison of performance of different k values with k-fold cross validation in a graph-based learning model for incrna-disease prediction. Kırklareli Üniversitesi Mühendislik Ve Fen Bilimleri Dergisi 9(1):63–82
Article Google Scholar
Charbuty B, Abdulazeez A (2021) Classification based on decision tree algorithm for machine learning. J Appl Sci Technol Trends 2(01):20–28
Article Google Scholar
Choo SW et al (2020) Are pangolins scapegoats of the COVID-19 outbreak-CoV transmission and pathology evidence? Conserv Lett 13(6):e12754
Article Google Scholar
Claesen M, Simm J, Popovic D, Moreau Y, De Moor B (2014) Easy hyperparameter search using optunity. arXiv preprint arXiv:1412.1114
Cleff T (2014) Exploratory data analysis in business and economics. Explor Data Anal Bus Econ. https://doi.org/10.1007/978-3-319-01517-0
Article Google Scholar
Dash S, Shakyawar SK, Sharma M, Kaushik S (2019) Big data in healthcare: management, analysis and future prospects. J Big Data 6(1):1–25
Article Google Scholar
Davenport T, Kalakota R (2019) The potential for artificial intelligence in healthcare. Future Healthc J 6(2):94
Article Google Scholar
Dsouza J (2020) Using exploratory data analysis for generating inferences on the correlation of COVID-19 cases. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, pp 1–6
Faraggi D, Reiser B (2002) Estimation of the area under the ROC curve. Stat Med 21(20):3093–3106
Article Google Scholar
Forte GF, Bauza JMT, de Pau V, Vall M, Camps A (2013) Experimental study on the performance of RFI detection algorithms in microwave radiometry: toward an optimum combined test. IEEE Trans Geosci Remote Sens 51(10):4936–4944
Article Google Scholar
Garg M et al (2021) Computed tomography chest in COVID-19: when & why? Indian J Med Res 153(1–2):86
Article Google Scholar
Habehh H, Gohel S (2021) Machine learning in healthcare. Curr Genomics 22(4):291–300
Article Google Scholar
Hong SR, Hullman J, Bertini E (2020) Human factors in model interpretability: Industry practices, challenges, and needs. Proc ACM on Human-Comput Interact 4(CSCW1):1–26
Article Google Scholar
https://data.who.int/dashboards/covid19/cases
https://www.gob.mx/salud/documentos/datos-abiertos-152127
Hulsen T et al (2019) From big data to precision medicine. Front Med 6:34
Article Google Scholar
Hymer C, Smith AD (2022) Harnessing the positive side of negative cases: Exemplars and queries for qualitative researchers. Academy of management proceedings, 2022(1) Academy of Management Briarcliff Manor, NY 10510 Academy of Management, 202(1):14341
Jamwal S, Gautam A, Elsworth J, Kumar M, Chawla R, Kumar P (2020) An updated insight into the molecular pathogenesis, secondary complications and potential therapeutics of COVID-19 pandemic. Life Sci 257:118105
Article Google Scholar
Karpievitch YV, Dabney AR, Smith RD (2012) Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinformatics 13(16):1–9
Google Scholar
Kim ES et al (2020) Clinical course and outcomes of patients with severe acute respiratory syndrome coronavirus 2 infection: a preliminary report of the first 28 patients from the Korean cohort study on COVID-19. J Korean Med Sci 35(13):e142
Article Google Scholar
La Rosa G, Bonadonna L, Lucentini L, Kenmoe S, Suffredini E (2020) Coronavirus in water environments: occurrence, persistence and concentration methods-A scoping review. Water Res 179:115899
Article Google Scholar
Lei H-Y et al (2021) Potential effects of SARS-CoV-2 on the gastrointestinal tract and liver. Biomed Pharmacother 133:111064
Article Google Scholar
Linnenbrink J, Milà C, Ludwig M, Meyer H (2023) kNNDM: k-fold nearest neighbour distance matching cross-validation for map accuracy estimation. Egusphere 2023:1–16
Google Scholar
Magge A et al (2021) Proceedings of the sixth social media mining for health (#SMM4H) workshop and shared task. In: Proceedings of the sixth social media mining for health (# SMM4H) workshop and shared task
Maleki M, Mahmoudi MR, Wraith D, Pho K-H (2020) Time series modelling to forecast the confirmed and recovered cases of COVID-19. Travel Med Infect Dis 37:101742
Article Google Scholar
Mehta N, Pandit A, Shukla S (2019) Transforming healthcare with big data analytics and artificial intelligence: a systematic mapping study. J Biomed Inform 100:103311
Article Google Scholar
Mohamad IB, Usman D (2013) Standardization and its effects on K-means clustering algorithm. Res J Appl Sci Eng Technol 6(17):3299–3303
Article Google Scholar
Munazhif NF, Yanris GJ, Hasibuan MNS (2023) Implementation of the K-nearest neighbor (kNN) method to determine outstanding student classes. Sinkron: Jurnal Dan Penelitian Teknik Informatika 8(2):719–732
Article Google Scholar
Nadarajan R, Sulaiman N (2023) Evaluation of K-fold value in breast cancer diagnosis technique using SVM and bioinspired optimization algorithm (JA-ABC5). In: 2023 IEEE 13th symposium on computer applications & industrial electronics (ISCAIE). IEEE, pp 130–135
Nielsen SH et al (2021) 31,600-year-old human virus genomes support a Pleistocene origin for common childhood infections. BioRxiv. https://doi.org/10.1101/2021.06.28.450199
Article Google Scholar
Oja E, Yuan Z (2006) The fastica algorithm revisited: convergence analysis. IEEE Trans Neural Netw 17(6):1370–1381
Article Google Scholar
Ortiz-Prado E et al (2020) Clinical, molecular, and epidemiological characterization of the SARS-CoV-2 virus and the coronavirus disease 2019 (COVID-19), a comprehensive literature review. Diagn Microbiol Infect Dis 98(1):115094
Article Google Scholar
Oyedele O (2023) Determining the optimal number of folds to use in a K-fold cross-validation: a neural network classification experiment. Res Math 10(1):2201015
Article MathSciNet Google Scholar
Pandeva T, Forré P (2023) Multi-view independent component analysis with shared and individual sources. In: Uncertainty in artificial intelligence, PMLR, pp 1639–1650
Pattnayak P, Panda AR (2021) Innovation on machine learning in healthcare services—An introduction. IN: Technical advancements of machine learning in healthcare. Springer, pp 1–30
Pleil JD (2016) QQ-plots for assessing distributions of biomarker measurements and generating defensible summary statistics. J Breath Res 10(3):035001
Article Google Scholar
Ramírez-Gallego S et al (2017) Fast-mRMR: fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int J Intell Syst 32(2):134–152
Article Google Scholar
Ramosaj B, Pauly M (2023) Consistent and unbiased variable selection under indepedent features using random forest permutation importance. Bernoulli 29(3):2101–2118
Article MathSciNet Google Scholar
Raoult D, Roux V (1997) Rickettsioses as paradigms of new or emerging infectious diseases. Clin Microbiol Rev 10(4):694–719
Article Google Scholar
Sahlol AT, Yousri D, Ewees AA, Al-Qaness MA, Damasevicius R, Elaziz MA (2020) COVID-19 image classification using deep features and fractional-order marine predators algorithm. Sci Rep 10(1):1–15
Article Google Scholar
Sakar CO, Kursun O, Gurgen F (2012) A feature selection method based on kernel canonical correlation analysis and the minimum redundancy-maximum relevance filter method. Expert Syst Appl 39(3):3432–3437
Article Google Scholar
Schmidt JM, de Manuel M, Marques-Bonet T, Castellano S, Andrés AM (2019) Evidence that viruses, particularly SIV, drove genetic adaptation in natural populations of eastern chimpanzees. bioRxiv. https://doi.org/10.1101/582411
Article Google Scholar
Sebe N, Lew MS, Cohen I, Garg A, Huang TS (2002) Emotion recognition using a cauchy naive bayes classifier. International conference on pattern recognition 1. IEEE, pp 17–20
Serrano CO et al (2020) Pediatric chest X-ray in covid-19 infection. Eur J Radiol 131:109236
Article Google Scholar
Sethy PK, Behera SK (2020) Detection of coronavirus disease (covid-19) based on deep features
Sun X, Qourbani A (2023) Combining ensemble classification and integrated filter-evolutionary search for breast cancer diagnosis. J Cancer Res Clin Oncol 149(12):10753–10769
Article Google Scholar
Tabaghi P, Khanzadeh M, Wang Y, Mirarab S (2023) Principal component analysis in space forms. arXiv preprint arXiv:2301.02750
Tebit DM et al (2020) Elucidating the viral and host factors enabling the cross-species transmission of primate lentiviruses from simians to humans. bioRxiv. https://doi.org/10.1101/2020.10.13.337303
Article Google Scholar
Tsatsakis A et al (2020) SARS-CoV-2 pathophysiology and its clinical implications: an integrative overview of the pharmacotherapeutic management of COVID-19. Food Chem Toxicol 146:111769
Article Google Scholar
Warren CJ, Sawyer SL (2023) Identifying animal viruses in humans. Science 379(6636):982–983
Article Google Scholar
White J, Power SD (2023) k-fold cross-validation can significantly over-estimate true classification accuracy in common EEG-based passive BCI experimental designs: an empirical investigation. Sensors 23(13):6077
Article Google Scholar
Woan Ching SL et al (2022) Multiclass convolution neural network for classification of COVID-19 CT images. Comput Intell Neurosci. https://doi.org/10.1155/2022/9167707
Article Google Scholar
Xu Y et al (2021) Artificial intelligence: a powerful paradigm for scientific research. The Innovation 2(4):100179
Article MathSciNet Google Scholar
Yachou Y, El Idrissi A, Belapasov V, Ait Benali S (2020) Neuroinvasion, neurotropic, and neuroinflammatory events of SARS-CoV-2: understanding the neurological manifestations in COVID-19 patients. Neurol Sci 41(10):2657–2669
Article Google Scholar
Yang S, Rothman RE (2004) PCR-based diagnostics for infectious diseases: uses, limitations, and future applications in acute-care settings. Lancet Infect Dis 4(6):337–348
Article Google Scholar
Zarzoso V, Comon P, Kallel M (2006) How fast is FastICA?. In: 2006 14th European signal processing conference. IEEE, pp 1–5

Download references

Acknowledgements

Finally, we express our gratitude to Dr. Mitra Esmaeili Azad, MD, from Shahid Beheshti University of Medical Sciences, for helping with the medical aspects of this research.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

The School of Mathematics and Computer Science, Statistics, Iran University of Science and Technology, Tehran, 1684613114, Iran
Rahman Farnoosh & Karlo Abnoosian

Authors

Rahman Farnoosh
View author publications
You can also search for this author in PubMed Google Scholar
Karlo Abnoosian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Karlo Abnoosian and Rahman Farnoosh conceived the method. Karlo Abnoosian developed the algorithm and performed the simulations. Karlo Abnoosian and Rahman Farnoosh analysed the results and wrote the paper. All the authors have read and approved the final manuscript.

Corresponding author

Correspondence to Rahman Farnoosh.

Ethics declarations

Conflict of interest

The authors declare no competing interests On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethics approval

This article is exempt and does not require ethics approval.

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent to publish

The authors affirm that human research participants provided informed consent for publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Farnoosh, R., Abnoosian, K. A robust innovative pipeline-based machine learning framework for predicting COVID-19 in Mexican patients. Int J Syst Assur Eng Manag (2024). https://doi.org/10.1007/s13198-024-02354-3

Download citation

Received: 30 August 2023
Revised: 03 April 2024
Accepted: 16 April 2024
Published: 14 May 2024
DOI: https://doi.org/10.1007/s13198-024-02354-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A robust innovative pipeline-based machine learning framework for predicting COVID-19 in Mexican patients