1 Introduction

The worldwide pandemic of coronavirus disease 2019 (COVID-19) has been the most serious health problem in the last 3 years [1]. With considerably higher transmission and mortality rates, COVID-19 has become the most widespread and lethal disease in human history [2, 3]. Nevertheless, most patients present with mild symptoms including fever, myalgia, cough, headache, gastrointestinal problems, shortness of breath, loss of smell and taste, and even consciousness disorders [4, 5]. Up to 30% of the patients will need hospitalization during the course of the disease, with an intensive care unit (ICU) admission rate of 23% accompanied by mechanical ventilation in most of them. These rates could be increased with age and other pre-existing high-risk comorbidities [6,7,8]. Timely detection and treatment of severe cases of COVID-19 are essential to reduce the mortality rate and to avoid unnecessary hospital/ICU beds being occupied. Artificial intelligence (AI) algorithms with Machine learning (ML) and deep learning (DL) models may have a role to distinguish these patients from patients who can be treated with hospitalization.

With the development of digital systems and the emergence of big data, AI algorithms have been widely integrated into healthcare systems and have obsoleted classical statistical analysis [9]. ML as a branch of AI can be divided into two main categories of supervised and unsupervised learning approaches. In the supervised learning method, there is a data set with pre-defined labels. Methods of classification and regression, two main types of supervised learning, can be used for the prediction of categorical and continuous outputs. By contrast, the unsupervised algorithms attempt to identify patterns in unlabeled data sets for clustering and dimensionality reduction of large data sets [10]. AI algorithms have been used to automate the analysis of various data types of text, signal, sound, images, and videos in medical applications [11,12,13,14].

ML algorithms may have a great impact on various aspects of the COVID-19 management plan, from the diagnostic point by the early prediction of disease severity, intensive care unit (ICU) admission, and mortality risk to the therapeutic point by evaluation of treatment response, drug discovery, and social control [15,16,17,18,19]. In addition to ML algorithms, DL models have been used in the diagnosis and treatment of COVID-19, with applications including outbreak prediction, virus spread tracking, and vaccine and drug discovery research. DL models, with the ability to extract features from the image, have received much attention in the early diagnosis and spread of the COVID-19 disease based on lung computed tomography and radiography images [20].

Early severity and mortality predictions in COVID-19 patients could optimize the decision-making process, improve healthcare outcomes, and facilitate effective usage of healthcare resources during peaks [21]. The mortality prediction in COVID-19 patients has been investigated in some studies with various data sets, ML and DL algorithms, and feature selection methods [22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48]. Although there are no obvious special limitations in the related studies, each of them used various models and features of different study populations for mortality prediction of COVID-19 disease. Reconfirmation of these findings in other populations with various features is demanded by the healthcare systems.

This study set out to predict the mortality risk of patients with COVID-19 disease by investigating 14 ML models using extensive clinical, laboratory, and image-based features.

2 Materials and Methods

The following flowchart outlines the step-by-step procedures employed in this study to ensure replicability and transparency. It provides a visual representation of the study’s processes, guiding researchers through the various stages from data collection to analysis and interpretation (Fig. 1).

Fig. 1
figure 1

The flow chart of study for data cleaning, statistical data analysis, training, optimizing of machine learning models and also feature selection

2.1 Data of the Study

In this study, data from 252 COVID-19 patients have been used for building, comparison, and evaluation of the applied models. All diagnoses were confirmed by reverse transcription-polymerase chain reaction (RT-PCR) diagnostic assay. The data from patients were collected, retrospectively, during the 5th peak of the COVID-19 pandemic (July 2021–September 2021) from Vasei Hospital of Sabzevar in Iran with an Ethics Committee of Sabzevar University of Medical Sciences (IR.MEDSAB.REC.1400.132). No algorithm is considered for data correction. At the beginning of the study, the patients’ data were checked in terms of the recorded features, some features that were not reported for all patients and in cases where the patient information was not complete, it was excluded from the study in the data-cleaning step. After that, we had 42 clinical, laboratory, and image features for each patient that have been used for the training of ML models.

2.2 Statistical Data Evaluation

Raw data were checked for null features. Statistical analysis and correlations of various features were evaluated with SciPy library of python. Finally, all quantitative features scaled for training of ML models.

2.3 Training of ML Models

In this study, the performance of 14 various ML models of Logistic Regression (LR), Passive Aggressive Classifier (PAC), Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), Decision Tree Classifier (DTC), Random Forest Classifier (RFC), Gradient Boosting Classifier (GBC), Hist Gradient Boosting Classifier (HGBC), AdaBoost Classifier (ABC), Bagging Classifier (BC), Extra Trees Classifier (ETC), Multi-Layer Perception Classifier (MLPC), Gaussian Naive Bayes (GNB), and Linear Discriminant Analysis (LDA) have been investigated in the prediction of mortality risk in COVID-19 patients. The theoretical explanations of the used ML models have been explained on the Python Machine Learning Library website (Scikit Learn) and related articles [50, 51].

Models were developed on randomly drawn 70% of the training-validation data set and its performance was evaluated on the 30% of the test data set. Due to the use of the K-fold cross-validation data splitting method and GridSearchCv technique for the selection of optimum hyperparameters, the train-validation dataset did not split at the first step. GridSearchCV is the process of performing hyperparameter tuning to determine the optimal values for a given model. There is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values. Doing this manually could take a considerable amount of time and resources and thus we used GridSearchCV to automate the tuning of hyperparameters.

In the first step, the GridSearchCV technique along with the fivefold cross-validation method was used to control the learning process of models and optimized their parameters based on the highest accuracy on the training data set. Then each model has been trained by applying dedicated optimized parameters.

2.4 Evaluation of ML Models

The evaluation of designed models has been performed using independent test data set. The confusion matrix has been calculated and some metric parameters such as accuracy, precision, sensitivity, specificity, AUC, and F1 score have been reported for each model.

2.5 Feature Importance

To reduce the number of features, complexity of models, and improve their accuracy, various feature selection methods can be used to comprise various models and build the final classification model [31, 32]. The feature importance technique indicates the contribution of each feature in the model’s prediction. Feature selection methods could be a guide for removing the features which have a low impact on the model’s predictions and focusing on model improvement based on significant features. In simple terms, the main purpose of feature selection techniques is to maximize the performance of a model with a minimized number of features [33].

Feature importance for all of the models has been investigated by three methods including (a) coef, for linear models of LR and PAC, (b) feature_importances, for DTC, RFC, GBC, ABC, and ECT models, and (c) permutation_importance for SVC, KNN, HGBC, BC, MLPC, GNB, and LDA models.

All 14 ML models have been trained and evaluated by using all 42 features. Then, for the assessment of feature numbers on the accuracy of models, four models with the higher AUC have been chosen. The wrapper-type feature selection algorithm of Recursive Feature Elimination (RFE) has been used to select 20, 10, and 5 features for these four models. According to the methods that have been explained in Sects. 2.2 and 2.3, the training and evaluation of 4 models have been performed by using 20, 10, and 5 features.

3 Results

3.1 Data Description

Description of continuous and categorical features have been shown in Tables 1 and 2, respectively. The distribution of continuous variables has been shown in Fig. 2.

Table 1 The statistical description of quantitative continuous features (output 1 and 0 are for died and alive patients, respectively)
Table 2 The statistical description of categorical features
Fig. 2
figure 2

Statistical distribution of quantitative continuous variable (features). sBP: systolic blood pressure, dBP: diastolic blood pressure, HR: heart rate, RR: respiratory rate

3.2 Data Correlations

The assessment of data correlation was performed at two stages including (a) full data correlation between all of the features and (b) data correlation of features just with the outcome or target. The correlation between features has been shown in Fig. 3 using a heatmap.

Fig. 3
figure 3

Heatmap correlation of 42 clinical, laboratory, and image features. DM: diabetes mellitus, HTN: hypertension, IHD: ischemic heart disease, COPD: chronic obstructive pulmonary disease, LOC: level of consciousness, sBP: systolic blood pressure, dBP: diastolic blood pressure, Spo2: peripheral oxygen saturation, HR: heart rate, RR: respiratory rate, GOO: gastric outlet obstruction

Data correlation assessment just with the outcome has been shown in Fig. 4 and has been sorted in Table 3. The features that had five high correlation values with the outcome was including mechanical ventilation, ICU admission, malignancy, steroid therapy, and level of consciousness (LOC).

Fig. 4
figure 4

The correlation of outcome (dead or alive) with all 42 clinical, laboratory, and image features. DM: diabetes mellitus, HTN: hypertension, IHD: ischemic heart disease, COPD: chronic obstructive pulmonary disease, LOC: level of consciousness, sBP: systolic blood pressure, dBP: diastolic blood pressure, Spo2: peripheral oxygen saturation, HR: heart rate, RR: respiratory rate, GOO: gastric outlet obstruction

Table 3 The correlation values of output (died or alive) with all 42 clinical, laboratory, and image features

3.3 Evaluation of ML Models

Performance evaluation of considered models using the test dataset has been reported in Table 4. The highest values of accuracy, precision, sensitivity, specificity, AUC, and F1 score were seen for LDA, KNN, GNB, KNN, PAC, and LDA, respectively. According to similar studies, the AUC has been considered the most analysis parameter for model evaluation. It has been shown in Table 3 that, 4 models including PAC, SVC, LDA, and ETC have better performance than other ML models based on AUC values. The ROC curve of these four models is shown in Fig. 5.

Table 4 Evaluation metrics of 14 optimized ML models using all 42 clinical, laboratory, and image features
Fig. 5
figure 5

Illustration of the ROC curves related to the PAC, SVC, ETC, and LDA ML models with AUC > 90%. PAC: Passive Aggressive Classifier, SVC: Support Vector Classifier, ETC: Extra Trees Classifier, LDA: Linear Discriminant Analysis

3.4 Feature Importance and Feature Selection

To evaluate how each variable (feature) influences mortality prediction in ML models, we performed feature importance analysis for PAC, SVC, LDA, and ETC as they were the best-performing models (based on AUC scores) among all 14 ML models for mortality prediction.

Based on the results of Fig. 6, the 10 most important features that had the largest impact on the performance of the PAC, SVC, ETC, and LDA models have been reported in Table 5.

Fig. 6
figure 6

Depicted feature importance for PAC, SVC, ETC, and LDA models (AUC > 90%). PAC: Passive Aggressive Classifier, SVC: Support Vector Classifier, ETC: Extra Trees Classifier, LDA: Linear Discriminant Analysis

Table 5 10 important features with more impact on the performance of 4 ML models of PAC, SVC, ETC, and LDA (AUC > 90%)

Performance evaluation of the four selected models based on 20, 10, and 5 features have been reported in Table 6. Interestingly, feature reduction has no significant influence on the performance of models. Even in some models, the reduction of features has led to an increase in evaluation parameters. The highest AUC of 93.40% was obtained for the SVC model with 10 features.

Table 6 Evaluation metrics (Accuracy, Precision, Sensitivity, Specificity, AUC, F1 score) of PAC, SVC, ETC, and LDA ML models using 20, 10, and 5 selected features

4 Discussion

In this study, we evaluated the ability of 14 ML algorithms to predict the clinical outcome of mortality using data available from 252 COVID-19 patients.

Based on the AUC evaluation parameter, we found that the linear-based method of PAC has slightly better performance than other ML algorithms by using all 42 features. KNN and DTC models with 72% and 75% AUC, respectively, have the lowest AUC and the other models with a slight difference showed an AUC of more than 82% in predicting patient death.

There have been some studies that predicted the severity of the disease or the possibility of death using biomarkers of LDH, hs-CRP, ferritin, and IL-10 [23, 29], but in the current study, the mechanical ventilation, ICU admission, LOC, malignancy, steroid therapy, calcification, consolidation, and fatigue were introduced as important features in early prediction of patient death based on the used features in the training of ML models and top 4 models selected based on AUC.

Evaluation metrics for PAC, SVC, ETC, and LDA models which were reported in Table 5 using 20, 10, and 5 features showed that feature reduction leads to reducing the complexity and increasing the performance of the models. Among these four models, the SVC model using 10 features has the most AUC of 93.40%. Just with five features, an AUC of 92.07% has been acquired for the LDA model. The best models for the prediction of mortality risk with 42, 20, 10, and 5 features based on the AUC metric have been shown in Table 7.

Table 7 The best ML models in the prediction of mortality risk of COVID-19 patients with 42, 20, 10, and 5 features

Among all the models evaluated with varying numbers of features, the SVC model was found to be the best-performing, achieving an AUC of 93.40% using 10 features. The SVC is a machine learning model that constructs a maximum-margin hyperplane to separate data points belonging to different classes. This hyperplane is positioned to maximize the distance between the decision boundary and the nearest data points from each class, creating a robust classification model. The SVC algorithm identifies the optimal hyperplanes that divide the input data into distinct classes. It then determines the boundaries between the input classes, with the input elements defining these boundaries. The resulting maximum-margin hyperplane provides the best separation between the training data samples belonging to the different classes [48].

Mortality prediction of COVID-19 patients has been carried out in some studies by various machine learning algorithms [22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49]. Various studies with the investigation and comparison of different models were reviewed in the text and the others have shown in Table 8.

Table 8 The prediction of mortality risk in COVID-19 patients in similar studies with various data sets and ML models in comparison with the current study

An et al. investigated the performance of the least absolute shrinkage and selection operator (LASSO), linear support vector machine (SVM), random forest (RF), and k-nearest neighbors (KNN) models for mortality prediction of 10,237 COVID-19 patients within 14 and 30 days after the initial diagnosis. In their study, linear SVM achieved the highest performance of AUC = 0.962 compared to the other models. They also found that age, chronic lung disease, diabetes mellitus, and cancer can increase the risk of mortality [34]. Pourhomayoun et al. used the Artificial Neural Networks (ANN), Decision Tree (DT), Logistic Regression (LR), SVM, RF, and KNN models for mortality prediction using 57 features in three categories of symptoms, pre-existing conditions (or comorbidities), and demographics. In the best way, ANN with tenfold cross-validation could predict the mortality of patients with an accuracy of 89.98% [33].

Aljameel et al. developed LR, RF, and extreme gradient boosting (XGB) to predict the severity of disease in COVID-19 patients. Their findings indicated that the best model was RF, which achieved an accuracy of 0.952 and an AUC of 0.99 [38]. Yu et al. compared the performance of CatBoost, a novel gradient-boosting algorithm, with the XGBoost model in predicting mechanical ventilation and mortality. The study found that CatBoost achieved an accuracy of 86.2% for predicting mechanical ventilation and 80% accuracy for predicting mortality, which was either comparable to or better than the performance of the XGBoost model [36]. Using two datasets from Wolfram and GitHub, Li et al. examined the accuracy of autoencoder, LR, RF, and SVM models, all of which achieved accuracy values above 0.9 according to their findings [37].

Subudhi et al. was compared 18 models for the prediction of severity and mortality. They indicated that ensemble-based models with mean F1 scores ≥ 0.8 have the highest performance. After that, LR, DT, LDA, Quadratic Discriminant Analysis (QDA), and MLPClassifier (MLPC) also had high F1 scores between 77 and 79%. In contrast, PAC, perceptron, and linear SVC models had relatively low F1 [38].

In the other study by Jamshidi et al., the efficiency of LR, RF, ANN, KNN, LDA, and Naive Bayes models have been tested to predict the risk of mortality in 23,749 patients. They have shown RF method outperformed the other methods with an AUC value of 0.79 for the test data [9]. Rustam et al. in a time series study, investigated the four regression models of LR, LASSO, SVM, and Exponential Smoothing (ES) for the prediction of recovery and death rates. According to their findings, the ES algorithm demonstrated the best performance, followed by LR and LASSO, whereas SVM exhibited poor performance across all prediction scenarios using the available dataset [3]. Table 8 shows some studies and relevant AUC scores related to the prediction of COVID-19 mortality compared to the current study.

In our study, suffering from malignancies, being unconscious at admission, and undergoing mechanical ventilation, ICU admission, and steroid therapy during the course of treatment were the most predictors of mortality in patients hospitalized due to the COVID-19 infection. Previous research has indicated that patients with pre-existing conditions, including cancer, who contract COVID-19 are at an increased risk of mortality. Despite the possibility of atypical symptoms, the risk of death is significant for these patients [52,53,54]. This finding can be attributed to several factors, including delayed diagnosis resulting from patients being asymptomatic or presenting uncommon symptoms. Additionally, the presence of other confounding factors, such as older age, is more common in this population, which may contribute to the delayed diagnosis [52, 55]. Moreover, while altered consciousness is not a frequent symptom in patients with COVID-19, its presence shows the severity of the disease and potential underlying multi-organ failure which are collectively accompanied by higher mortality rates [56, 57].

Regarding the increased mortality rates in COVID-19 patients receiving steroid therapies, it is worth mentioning that before the recent clinical trials showing the importance of earlier initiation of corticosteroids not only in the management of the acute phase of the disease but also in the mitigation of long term complication of COVID-19, corticosteroids had been only prescribed in severely clinically ill patients [58, 59]. Therefore, along with mechanical ventilation and ICU admission, all of these three therapies were signs of suffering from more severe COVID-19 infections. Severe COVID-19 infections are considered as the main prognostic and predictive factor of COVID-19 infection by numerous investigators [60,61,62].

There are a few limitations in our study from a clinical aspect. One of the most important limitations of this study is the measurement of the features used at the time of the patient’s initial hospitalization after 2 until 14 days from observation of the initial symptoms. While time-series studies have shown that using features at different times can help predict the patient’s acute condition [22].

It is also worth noting that the limited data availability and incomplete patient information necessitated the exclusion of some participants from the study. Future research focusing on training neural network and DL models with larger, more comprehensive datasets, potentially including time-series data, may enable earlier identification of high-risk individuals. This, in turn, could lead to changes in treatment regimens that could positively impact the mortality prediction capabilities of the ML models.

5 Conclusion

Performance evaluation of 14 ML models showed that the highest values of accuracy (87.30%), precision (100%), sensitivity (77.27%), specificity (100%), AUC (91.90%), and F1 score (77.99%) have been seen for LDA, KNN, GNB, KNN, PAC, and LDA models in training with all 42 features. By using feature selection techniques, the SVC model can predict the mortality risk of COVID-19 patients with an AUC of 93.40% with 10 features of mechanical ventilation, consolidation, fatigue, malignancy, dry cough, LOC, gender, diarrhea, O2 therapy, and SpO2.

From a technical point of view, correct prediction in dividing COVID-19 patients into low-risk and high-risk groups can reduce unnecessary hospital visits, costs, and psychological and physical stress to the medical staff. On the other hand, it can speed up the treatment process and ultimately decrease mortality. There are several potential research directions to build upon the findings of this study including: collect and analyze larger datasets of COVID-19 patients across multiple healthcare settings to further validate the predictive performance of the developed ML models, incorporate time-series data and track changes in patient features over the course of illness to improve the early identification of high-risk individuals, explore the use of deep learning algorithms, which may have greater capacity to extract complex patterns from the clinical, laboratory, and imaging data compared to traditional ML models, investigate the incremental value of adding novel biomarkers or other data modalities (e.g., wearable sensor data) to further enhance the mortality prediction capabilities.

By addressing these future research directions, the predictive models developed in this study could be further refined and optimized to provide more accurate and actionable insights for improving COVID-19 patient outcomes.