Background

Complications associated with coronavirus disease (COVID-19) are a major global health concern [1]. COVID-19 leads to upper respiratory infections, resulting in acute respiratory syndrome, pneumonia, cardiac, liver, and kidney injuries, secondary infections, sepsis, and even death with a mortality rate of 2–3% [2,3,4]. Common symptoms include fever, dry cough, myalgia, anorexia, diarrhea, nausea/vomiting, and anosmia [5,6,7]. As of February 2023, there has been more than 757 million cases of infection and 6.8 million cases of death worldwide [8]. Reports demonstrated higher mortality and disease severity among active or former tobacco smokers compared to non-smokers [9,10,11,12], due to higher likelihood of developing respiratory diseases in smoker populations [13].

A large number of hospitalizations associated with COVID-19 have put an unexpected burden on healthcare systems and resource shortages [14, 15]. Timely and effective healthcare service delivery is an important factor in COVID-19 management [16]. In this regard, machine learning (ML) models have shown great promise for predicting disease prognosis, complication prediction, and, improved patient management [17,18,19].

ML algorithms have been explored in many aspects of COVID-19 management such as detecting epidemiological outbreaks, identification, and diagnosis of COVID-19, and severity or mortality prediction [20,21,22,23,24]. These ML models are beneficial tools for the management of COVID-19 patients [20, 25,26,27].

Iran was among the first countries facing widespread COVID-19 and had one of the highest mortality rates [28]. The higher prevalence of infections and scarce healthcare resources warrants a further need for an effective predictive model trained on data from the patients, considering the features of the Iranian population [29]. Furthermore, previous mortality prediction models which were developed during the early period of the pandemic showed low prediction performance and recent models usually suffer from selection bias and training using unbalanced data, which could attribute the high performance of these models in accurately identifying negative cases and excluding positive cases [30, 31]. Additionally, ML models may have a bias in subpopulations with different rates of mortality [20] such as smokers.

To our knowledge, designing a mortality prediction model for COVID-19 patients with a focus on smoking patients has been scarcely investigated. The current study aims to develop ML models for mortality prediction in COVID-19 patients with a history of smoking in the Iranian population. Models in this study were developed for use at the time of admission (at admission) and after patient admission during hospitalization (post-admission).

Methods

Data Source and Study Population

Retrospective cohort data were extracted from the Imam Khomeini hospital complex COVID-19 registry, which collects data from hospitalized patients from six hospitals in Tehran. The data is collected when patients are hospitalized and when a change in the level of care occurred (for example admission to the ICU). Eight trained nurses and health information technology specialists collect data from patients’ medical records using a documented protocol and enter the data into the registry software. The cohort included active/former smoker patients with a COVID-19 diagnosis who were admitted to one of six hospitals between 18 and 2020 and 15 March 2022. Patients were included based on positive diagnoses with reverse transcriptase-PCR test or CT scan results.

Features were excluded that based on past evidence were irrelevant to COVID-19 mortality, features that had more than a 30% missing rate, and features that had more than 95% of data distributed in one class. Finally, a dataset comprised of 678 smoker patients with 183 features were extracted and after applying inclusion and exclusion criteria, a total of 678 patients with 31 features were finally analyzed. Table S1 (Additional file 1) lists the 183 variables included in the dataset collected from the registry.

Data preprocessing

A data point was considered as an outlier if the data had equal to or more than ± 3 standard deviation from the mean of the feature. The outliers were replaced with the upper and lower boundary of the interquartile range.

The numerical values were scaled using normalization and the categorical values were encoded (1 and 0 for “Yes” and “No” values, respectively).

The missingness of 11 variables ranged between 0.15% and 27.64%. For numerical variables that had a skewed distribution, missing values were imputed with the median, and the rest were imputed with the mean. Categorical values were imputed using the highest frequency value. Table S2 (Additional file 1) presents the missing rate of features.

Features and feature selection

The main outcome is confirmed COVID-19-related in-hospital mortality which was collected as binary (yes/no). The dataset consists of 31 variables including patient demographics (e.g., age, sex, and BMI), sign and symptoms, comorbidities, medication history and medication prescribed in hospitals, and lifestyle factors (e.g. tobacco/narcotic consumption).

Eight different feature sets were developed based on 3 main approaches:

  1. 1.

    Univariate analysis using Chi-square tests for categorical variables and T-test for numerical variables (Feature set 1). Features with p-value less than 0.2 were selected.

  2. 2.

    Applying feature importance algorithms such as recursive feature elimination with cross-validation (RFECV) and Gini importance criteria (Feature set 2–7):

    Feature vectors were used as inputs for RFECV with logistic regression, random forest, and gradient boosting and the top 20 Gini importance criteria for extratreesclassifier, random forest, and gradient boosting were selected. Figures S1-S6 (Additional file 2) show the results of selected features based on Gini importance for “at admission” and “post-admission” models.

  3. 3.

    Physician opinion (Feature set 8):

    We developed and distributed a questionnaire among 32 specialists (including infectious disease specialists, pulmonologists, intensive care specialists, and anesthesiologists) who were asked to identify the mortality risk factors. The Kuder Richardson 20 test was used for testing the reliability of questionnaires (reliability = 0.96). Specialists were asked to identify a factor as important or not important (Yes/No). Factors with more than 60% of the specialists’ agreement were included in this feature set.

Data balancing

Initially, the base models were developed using XGBoost on different feature sets. As a result, the poor performance of these models due to the imbalanced number of deaths (79.9% surviving vs. 20.1% death cases, ratio of 3.98) was discovered. Table S3 (Additional file 1) shows model performance before balancing the minority class. As a solution, we oversampled the minor class using the synthetic minority oversampling technique (SMOTE) and found an improvement compared to the base models. SMOTE is an oversampling technique which the minority class is synthetically oversampled by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line [32]. This method has been used for application of machine learning methods for mortality prediction [33]. Subsequently, all models were developed using balanced datasets.

Model Development, evaluation, and explainability

Figure 1 depicts the study process. Our binary classification models were developed with eight feature sets utilizing XGBoost, support vector machine (SVM), multi-layer perceptron (MLP), k-nearest neighbor (KNN), random forest (RF), decision tree, logistic regression, and naive Bayes with 10-fold cross-validation.

Fig. 1
figure 1

Study Process

Logistic regression is a statistical method that uses the sigmoid function as its core method and is used for building machine learning models where the target variable is binary (e.g. death/alive) [34,35,36]. This algorithm is easy to implement, interpret and train, however, it overfits on high dimensional data and fails to capture complex relationships [37].

Naive Bayes is a binary and multi-class classification algorithm based on the Bayes theorem [38, 39]. This algorithm is a statistical classifier that predicts the probability of membership of a given sample in a specific class. It has a high speed and robust performance on large databases [40, 41].

Furthermore, SVM is a supervised machine learning algorithm used both for classification and regression. SVM will try to find a hyperplane in an n-dimensional space that distinctly classifies the data points [42, 43]. SVM can deal with complex non-linear data points such as health data and is less prone to overfitting [44]. In addition to linear kernel function, SVM can be used as a non-linear kernel function. The most general kernels used in SVM are linear, polynomial, and radial basis function (RBF) [44, 45].

MLP is a type of feed-forward neural network algorithm that consists of interconnected neurons transferring information to each other [46, 47]. To each of the connections between the neurons, a weight has been assigned during training; the weights will be adjusted to learn how to predict the output [44]. MLP is simple and works well with both small and large datasets, however, its computations are complex and time-consuming [48].

Decision tree is a supervised machine learning algorithm used for classification and regression. It has a hierarchical, tree structure which consists of a root node, branches, internal nodes, and leaf nodes [49, 50]. The purpose of this algorithm is to display the structural information stored in the data. This algorithm is fast, easy to use and can handle high dimensional data [44].

Random forest is an ensemble learning algorithm that operates by constructing multiple decision trees and the output is decided by voting [51, 52]. This combined output makes the random forest less prone to noise and outliers compared to a single decision tree [53]. However, computation is very complex and the result could change with a small change occurring in the data [53, 54].

K-nearest neighbor is also a supervised machine learning algorithm used both for classification and regression. This algorithm uses proximity to make classifications or predictions about the grouping of an individual data point [55, 56]. This algorithm is fast and easy to use and understand, however, it has a high computational cost, and it is sensitive to structure of data and requires a large storage space [44].

XGBoost stands for extreme gradient boosting algorithm which is a type of ensemble learning algorithm. It is designed for speed, ease of use, and performance on large datasets [57, 58]. In XGBoost, decision trees are created sequentially and a weight is assigned to all the independent variables which then are given as input to a decision tree. Based on the prediction result, the weights will be adjusted and given as input to another decision tree. This ensemble prediction method will result in more precise and robust model [59].

Furthermore, ensemble models were also developed using aforementioned algorithms on each feature set using Scikit learn ML library and Python (version 3.9.7). Hyperparameters were optimized by creating a parameter list based on each algorithm and using GridSearchCV for identifying the best parameters for each model.

Models were evaluated and compared based on accuracy, the area under the receiver operating characteristics curve (AUC ROC), precision, recall, F1 score, logistic loss, and brier score. To select the best-performing model, models were compared based on their F1 score and AUC. Afterwards, the top five models were selected from the at-admission and post-admission models and their probabilities were calibrated.

Finally, Shapely additive explanation (SHAP) was applied to provide explainability of the models. SHAP is an approach that is based on cooperative game theory which explains the output of ML models by calculating the contribution of each feature to the prediction [60].

Results

Descriptive data

In total, 542 (79.9%) patients survived until discharge from hospitals, and 136 (20.1%) patients expired. Age, oxygen saturation percent (SpO2%), duration of intubation, sweating, abnormal lung signs, hypertension, cancers, cardiovascular diseases, CKD, anti-hypertensive drugs, using pantoprazole, hospitalization 14 days before current admission and admission in an intensive care unit (ICU) were significant factors contributing to patients’ death. Table 1 depicts the basic characteristics of patients.

Table 1 Characteristics of surviving vs. non-surviving patients

Feature selection

Tables S4 and S5 (Additional file 1) show the details of the feature sets created for “at-admission” and “post-admission” death prediction based on the different feature selection methods. Features including cancers, CKD, oxygen saturation percent, BMI, age, hypertension, abnormal lung signs, and drug history were among the most prevalent features chosen by different feature selection methods. Furthermore, active smoking is considered important by many of our feature selection methods. According to our results, feature set 7 on “at admission” models and feature set 8 on “post admission models” had the best performance. The details of these feature sets are presented in Table 2.

Table 2 Best performing feature sets for “at admission” and “post admission” models

Model performance and evaluation

Details of our “at admission” models on different feature sets are reported in Table S6-S13 (Additional file 1). Comparing these models indicates that XGBoost outperformed the rest of the models in the majority of feature sets (except in feature set two which the random forest model outperformed the rest). Throughout feature sets, the weakest performance was for naive Bayes and logistic regression.

Tables S14-S21 (Additional file 1) present details of “post-admission models’ performance on different feature sets. The XGBoost outperformed the rest of the algorithms except for in feature set 6 which the ensemble model had better results. Furthermore, naive Bayes and logistic regression had the weakest performance throughout feature sets.

The probabilities of the top five models were calibrated. After calibration, accuracy, AUC, and F1 slightly decreased; however, logistic loss and brier score improved, showing improvement in the overall predictions of models.

The best “at admission” model was XGBoost which was trained using feature set seven (accuracy = 0.875, F1 score = 0.862). In addition, among “post admission” models, XGBoost trained on feature set eight (accuracy = 0.905, F1 score = 0.899) had the highest performance after calibration. Tables 3 and 4 report the performance of the top five calibrated and uncalibrated models. Figure 2 depicts the AUC of the top five “at admission” and “post admission” models. Figure 3 also shows the calibration curve for the best “at admission” and “post admission” XGBoost models.

Table 3 Performance results of top five “at admission” models
Table 4 Performance results of top five “post-admission” models
Fig. 2
figure 2

ROC AUC for the top “at admission” and “post admission” models

Fig. 3
figure 3

Calibration curve of the XGBoost model for “at admission” and “post admission” mortality prediction

Feature importance

Based on the SHAP method, in order, age, hospitalization in a 14-day period prior to admission, current smoking, SpO2%, BMI, diastolic and systolic blood pressure, respiratory rate, diabetes, and sex had the highest contribution in “at admission” mortality prediction. Figure 4 depicts the contribution of each feature to “at admission” XGBoost prediction model based on SHAP.

Fig. 4
figure 4

SHAP-based feature importance of “at admission” XGBoost model

As presented in Fig. 5, older age, having cancer and CKD will lead to current smoking having higher SHAP value. While on the contrary, lower SpO2%, having diabetes, COPD and use of pantoprazole will result in lower SHAP value for current smokers. There are mixed effects for relationship between current smoking and other features (Figure S7, 2).

Fig. 5
figure 5

Current smoking SHAP dependence plots for at admission model

As presented in Fig. 6, admission in ICU, age, current smoking, duration of intubation, BMI, SpO2%, systolic blood pressure, fever, and diastolic blood pressure had the highest contribution to the “post admission” XGBoost model’s mortality prediction.

Fig. 6
figure 6

SHAP-based features importance of “post admission” XGBoost model

According to Fig. 7, older age, having cancer and CKD will lead to current smoking having higher SHAP value. While having fever, dyspnea, chest pain, diabetes and a history of hookah consumption will lead to current smoking having lower SHAP value. As presented in Figure S8 in 2, there are mixed effects for relationship between current smoking and other features.

Fig. 7
figure 7

Current smoking SHAP dependence plots for post admission model

Error analysis

There were 140 errors in our “at admission” model, of which 52 cases were false positive, and 88 cases were false negative. Most of the errors were related to males (93.2%). Additionally, most of them had no COPD (76.7%), previous hospitalization (74.6%), diabetes (79.9%), drug history (72.4%), abnormal lung signs (75.4%), and CKD (76.1%).

There were also 103 errors in our “post admission” model of which 43 were false positive and 60 were false negative. Most of the cases were male (90.3%). The majority of these cases had no history of hookah consumption (92.2%), chest pain (93.2%), diabetes (77.7%), cancers (73.8%), CKD (75.7%), COPD (86.4%) and using immunosuppressant drugs (99%).

Discussion

In the current study, multiple ML models were developed for the prediction of in-hospital mortality of COVID-19 patients with history of smoking. Furthermore, the models were evaluated and the highest-performing models for predicting patients’ chances of survival were identified.

Our results demonstrate that the best model for predicting mortality using patients’ information at admission is XGBoost (accuracy = 0.875, F1 score = 0.862) trained on the feature set seven (20 features). In addition, the best model for predicting mortality during hospitalization was also XGBoost (accuracy = 0.905, F1 score = 0.899) trained on the feature set eight (24 features). Naive Bayes and logistic regression performed substantially worse compared to XGBoost, random forest, and ensemble models.

These ML-based tools can assist clinicians and providers in patient triage [20, 61], resource allocation [26, 27], and providing the best possible care for patients [25, 62]. Input data of these models consist of patient demographics, comorbidities, medications, and levels of care that can be easily collected.

Results of the study suggest that active smoking, age, sex, ICU admission, hospitalization in a 14-day period prior to admission, SpO2%, duration of intubation, BMI, diastolic and systolic blood pressure, fever, respiratory rate, diabetes, CKD, COPD, cancers and drug history were among most important predictors of COVID-19 mortality. This is in line with previous studies which showed that age, sex, oxygen saturation, diabetes, use of opioids, respiratory diseases, CKD, and cancers could increase mortality [63,64,65,66,67]. Another study similarly indicated that age and SpO2% are independent markers of survival in COVID-19 patients [68]. Moreover, SpO2% was identified as an important feature in predicting in-hospital mortality in another study [69]. Yanyan et al. [70] indicated that age, sex, and diabetes are important mortality risk factors in COVID-19 patients, which is in accordance with our results. These studies are not specifically on smokers; therefore, it can be concluded that these are important risk factors among both smokers and non-smokers.

In contrast to previous studies suggesting a lack of association between prior smoking history and mortality in COVID-19 patients [71,72,73] or potential protective effects [74, 75], our results indicate that smoking is an important risk factor in COVID-19 mortality. This was according to previous studies which believed smoking is an important risk factor of mortality due to impairment of lung and respiratory diseases [9,10,11,12,13].

Based on our results, active smoking was among the most important features in predicting mortality (the third most important feature in both models). Salah et al. [76] suggest that patients which were either active smokers or former smokers have a higher mortality risk and patients that are active smokers have twice the mortality risk compared to those who were former smokers. Bellan et al. [11], using cohort data from Italian patients, identified smoking as an independent mortality predictor in COVID-19 patients. A meta-analysis [77], which included 60 studies and 51,225 patients from 13 countries, found smoking was one of the major predictors of mortality in COVID-19 patients. Parra-Bracamonte et al. [78], after analyzing a huge dataset from Mexico, found that smoking was not a risk factor for mortality. Our results indicate that active smoking may have a mixed effect on mortality. According to Figs. 4 and 6, in some cases, active smoking contributes to the mortality of patients and in some cases, it does not have such a contribution. Thus, further research is needed to prove the role of smoking in patient mortality.

Kar et al. [79] developed a COVID-19 prediction model for patients at admission using retrospective cohort data. However, their model had a higher accuracy (97%) than our model which could be due to their greater sample size (2370 patients). In addition, they did not consider smokers. Fink et al. [80] developed a prediction model using data from 24 h after admission. Our best models outperformed their model (AUC = 0.85). In a previous study [68], the mortality prediction model reached an accuracy of 89% and an AUC of 86%, which is lower than our best models. Our models also outperform another in-hospital mortality prediction model which was developed by Shiri et al. [69]. Using the XGBoost algorithm and demographic, clinical, imaging, and laboratory results, they were able to achieve 88% accuracy, which was lower than our post-admission model. However, they did not use features relating to smoking and opioid use in their models.

Limitations

Due to the small number of patients that have a history of smoking registered in our database, we were not able to perform external validation. Furthermore, due to our small sample size, we could not train separate models for different subpopulations such as age groups. Future studies are necessary for developing models to predict mortality in smoking COVID-19 patients for different age groups and levels of care. Some of the features that were identified as important predictors of COVID-19 mortality had high missing rates (including BMI, hospitalization in a 14-day period prior to admission, respiratory rate, and systolic and diastolic blood pressure), thus further studies are needed to investigate the role of these features on patient mortality.

Conclusion

In the present study, multiple mortality predictive models were developed and evaluated for use at admission and after admission during patients’ stay in hospitals. The best-calibrated models for admission and post-admission are XGBoost (accuracy = 0.875, F1 score = 0.862) and XGBoost (accuracy = 0.905, F1 score = 0.899), respectively. Additionally, the current study reported the explainability of models in terms of SHAP-based feature importance that identified variables strongly associated with mortality. Previous studies indicate that mortality prediction models have some biases for subpopulations that have different risks of mortality, such as smokers [20]. The current study demonstrates the potential of ML-based predictive models for quantification pre and post-admission COVID-19 mortality rates, facilitating effective decision making in management of patients with history of smoking.