Background

The COVID-19 disease has resulted in a substantial cause of morbidity and mortality across the world [1]. COVID-19 disease presents with a wide range of clinical features spanning from no symptoms to multi-organ failure [2]. Although SARS-CoV-2 mainly affects the lungs and is associated with developed acute respiratory distress syndrome (ARDS), it can impact cardiovascular, neurological, renal, and vascular complications associated with high mortality [3]. The precise prognostication of COVID-19 clinical outcome is more challenging due to the high variability in disease severity that could essentially be helpful for effective triage and efficient allocation of limited resources (i.e., beds, ventilators). More accurate subclassification of COVID-19 is essential for prognostication and identification of severity [4].

It has been shown that the pathological, physiological, and immunological responses do not sufficiently discriminate patients with non-severe and severe form due to the high level of complexity of these features [4]. A combination of clinical features and biochemical markers has been studied to identify the clinical subtype of COVID-19. Data mining and machine learning (ML) approach could potentially be applied to such diverse multimodal data for the classification of patients with COVID-19 [4]. Therefore, AI has been used for the diagnosis of COVID-19 pneumonia, stratification of patients and developing a prediction model of patterns of spread [5]. AI- and ML-based approach can be used as either diagnostic tool or a prognostic model to predict outcome [6]. Many studies have characterized the association of major risk factors with the COVID mortality such as higher age, cardiovascular disease, chronic respiratory disease, diabetes, hypertension, smoking history, and obesity [7]. However, they could not be strong individual predictors mainly through using conventional statistical analysis due to high degree of complexity and collinearity among the data.

In the present study, we aimed to apply ML-based algorithms to generate a mortality prediction model for hospitalized COVID-19 patients as well as classification of patients to verify the low- and high-risk groups.

Methods and materials

Data collection

In a retrospective study, we used clinical data from 400 patients with a polymerase chain reaction (PCR) test confirmed patients with COVID-19. Data were collected from patients admitted at the University of Miami Hospital, Miller School of Medicine, Miami, FL, USA, since June 2020. A total of 250 variables including biochemical and clinical data were collected at various times (hospital admission, ICU admission, hospital discharge). The admission time data were considered as the data at presentation. These data including demographic variables in addition to comorbidities, patients’ vitals, anthropometric measurements, chronic treatments, and laboratory works were obtained from the patient’s electronic records. In the processing dataset, the missing values level of each variable were found among the current cohort. The maximum level of missing values was 7% among the variables. Using imputation methods, new data were created by replacing all missing values with the estimated values using mean imputation. Continuous variables were median fold normalized, log-transformed, and univariance scaled before statistical analysis.

Definitions of variables

Table 1 summarizes patients’ demographics, clinical variables, comorbidities, and their association with hospital mortality and survival of patients with COVID-19.

Table 1 Distribution of patients’ demographics, clinical variables, and comorbidities between hospital mortality and survival of patients with COVID-19

In this table, the patient’s level of consciousness, when it was available, is shown based on Glasgow Coma Scale (GCS). We mentioned the patient’s temperature in Fahrenheit. Respiratory rate (RR) indicates the number of breaths per minute, and the heart rate (HR) demonstrates the number of heart beats per minute. The patients’ systolic and diastolic blood pressure (BP) is presented in millimeters of mercury. The percentage of oxygen-saturated hemoglobin to the total hemoglobin is displayed by O2 saturation, and ynO2 shows whether the patient was on oxygen during the hospitalization. The percentage of the oxygen that the patient inhales is presented by FiO2 (the fraction of inspired oxygen). O2 flow (lpm) indicates the required oxygen flow in liters per minute. Nursing home shows whether the patient was in a nursing home or long-term care facility before hospitalization. Patient delay ≥ 7 is used to define patients who delayed at least seven days to seek medical assistance after the onset of symptoms.

Smoking and alcohol are used to show the patient’s history of exposure to these toxins. The patient’s vaccination status against influenza (flu vaccine) and pneumonia (pneumonia vaccine) is included as per medical records or informed by the patient at the time of inclusion in the study.

Altered Mental Status (AMS) refers to any decline in the patient’s mental capacity noted through the physical exam. The loss of sense of smell and taste is displayed as anosmia and ageusia. We collected data related to the use of any chronic treatments or chemotherapy. Home O2 shows whether the patient was on supplemental oxygen therapy at home. We have also determined whether the patients are on local (inhaled steroids) or systemic corticosteroids (prednisone). ACE inhibitors indicate that the patient was on chronic treatment with angiotensin-converting enzyme inhibitors, and ARBs refer to the chronic use of the angiotensin ll receptor blockers. To evaluate the predictive value of imaging tests, we have collected data about radiological findings in the patient’s chest X-ray. Consolidation on the imaging refers to the existence of dense material in the alveoli and small airways. The presence of excess fluid accumulation in pleural space is listed as pleural effusion on the imaging, and the existence of dense material in the interstitium is mentioned as pulmonary infiltrates on the imaging.

The chronic health conditions of participants were collected to determine the impact of comorbidities on the outcome. These conditions include diabetes, chronic obstructive pulmonary disease (COPD), emphysema, pulmonary embolism (PE), bronchiectasis, interstitial lung disease (ILD), congestive heart failure (CHF), coronary artery disease (CAD), acute myocardial infarction (AMI), atrial fibrillation (AFib), hypertension, peripheral vascular disease, stroke, dementia, any stage of chronic renal failure (CRF), liver disease, peptic ulcer disease (PUD), connective tissue disorder, leukemia, lymphoma, dependence on hemodialysis, and asthma.

Statistical analysis

To establish a prediction model, we used the statistically inspired modification of partial least square (SIMPLS) analysis for the clinical data and blood markers collected at admission time. SIMPLS, an algorithm of PLS (a linear machine learning method) [8, 9], was carried out with two training and validation sets. To develop the best prediction model, SIMPLS-based prediction model was built using all variables as primary model. SIMPLS predicts the outcome response to variables by fitting a regression model (Y = XB) that is derived using the variables. Since all variables were not important to predict outcome, secondly variable reduction in SIMPLS was done to characterize useful predictor in explaining variation in the predictor variable as well as their correlation to outcome. Variable reduction was applied to remove out the factors that were not useful in predicting outcome according to the variable important for the projection (VIP) value of each variable. VIP values were obtained through weighted sum of squares of the weights using SIMPLS analysis [10]. Thus, the contribution of variables in the SIMPLS models was assessed using VIP score. Based on the general agreement, the variables with the VIP values more than 1.0 were considered as important predictors [11]. The variables with lack of predictive ability (VIP < 1.0) were removed from the basic prediction model.

The prediction model was created using the most differentiating clinical and biochemical variables (VIP > 1.0). The validation set automatically and randomly was created including 35% of out 250 hospitalized patients. In the absence of external validation cohort, splitting study cohort into training and validation sets is most known approach for internal validation of multivariate and machine-learning-based prediction mode.

SIMPLS was performed using the leave-one-out method of cross-validation (CV). The CV method is also known as internal validation. SIMPLS analysis was assessed using Q2, the goodness for predictability, and R2Y, the goodness of variability. The best model was selected based on the number of factors for which Q2 was larger and had not started decreasing with the highest R2Y. The range of R2 and Q2 varies between 0 and 1, the higher level showing higher predictive accuracy. Depending on data, the thresholds for the model performance change, generally R2 greater than 0.67 and 0.33, are considered as high and moderate predictive accuracy, respectively. Although Q2 value greater than zero shows the model is predictive, Q2 value with a range 0.2–0.4 is considered as a model with moderate predictability. Close R2 and Q2 show a lack of overfitting and the SIMPLS model works independently of the specific data [12, 13].

The Q2 and R2Y were computed using the training set and were verified using the validation set that make the model more realistic. Validation set was randomly selected from study cohort in a blinded approach.

Also, the partition analysis was used to creating a decision tree of the partition of data according to a relationship between the outcome and predictors. The data were partitioned into training and validation sets. The partition algorithm was to search all possible splits of predictors to best predict the response. The most differentiating clinical predictors obtained by SIMPLS were used for the partition analysis. AUC were obtained for both training and validation sets through the partition analysis based on the most important variables that were selected strong predictors in the SIMPLS-based prediction model.

We also used the partition analysis to obtain cutting value for either continuous or categorical (nominal or ordinal) variables such as age, heart rate, respiratory rate, and BMI. PCA and clustering were performed to identify subgroups particularly survivor subgroups. PCA was carried out in two steps. The first step was based on all variables to find outliers and trends and the step was using the most differentiating predictors obtained by SIMPLS. PCA and clustering were to help to find a subgroup of survivors that tends to hospital death. Latent class analysis (LCA) was carried to cluster the patients with COVID-19. Clustering was to help to identify the high-risk patients for dying. All paraclinical variables were normalized and transformed to use independently or in combination with clinical data for predicting hospital mortality.

Results

Patients’ characteristics

A total of 250 hospitalized patients with RT-PCR confirmed COVID-19 enrolled in the study, and 31 (12.4%) patients died in hospital. Table 1 shows the demographic characteristics, comorbidities, and outcomes of patients with COVID-19 that were admitted to MICU. The table shows, age, respiratory rate, FiO2%, O2 flow (lpm), having been in nursing home, chest pain, Altered Mental Status (AMS), having been on home supplemental O2 therapy, pulmonary consolidation on the imaging, chronic heart failure (CHF), coronary artery disease (CAD), acute myocardial infarction (AMI), dementia, hypertension, and diabetes mellitus were significantly different between the two cohorts. Table 2 shows the laboratory variables among survived and died patients.

Table 2 Distribution of patients’ laboratory variables between hospital mortality and survival of patients with COVID-19

Predicting hospital mortality using clinical and paraclinical data

The multivariate approach showed that patients’ demographics, clinical variables, comorbidities, and biochemical markers can be used for predicting hospital mortality outcomes. SIMPLS analysis was carried using most differentiating variables (VIP > 1.0) [11] to establish the prediction model. The prediction model was developed on 172 patients in the training set and 78 patients in the validation set. Two-factor-based SIMPLS models had moderate predictability (Q2 = 0.24) with the variability of R2 = 0.37 using a total of 21 variables that contributed to the prediction models. Table 3 also shows that CAD is the most important variable associated with mortality followed by diabetes mellitus, AMS, and age > 65.

Table 3 Importance values (VIP) of 21 most differentiation among 108 variables used in the primary model

Further, the coefficient plot revealed that the age > 65, nursing home, headache, dyspnea, AMS, consolidation, O2 saturation < 88, yno2, CAD, diabetes, alcohol, hypertension, stroke, dementia, prothrombin, and CRP were positively correlated with mortality among patients with COVID-19. On the other hand, chest pain, smoking, hypertension, atrial fibrillation, and peripheral vascular disease were negatively correlated with mortality. Scatterplot using two factors is characterized by adequately discriminating between patients who died and those who survived from COVID-19 in hospital ensuring accurate prediction of clinical variables (Fig. 1).

Fig. 1
figure 1

SIMPLS-based scatter plot shows a good separation between hospital mortality of patients with COVID-19 from survivors. The figure illustrates only the training set-based scatter plot

Further multivariate correlation analysis (Table 3) showed that CAD, diabetes, hypertension, AMS, dementia, stroke, atrial fibrillation, O2 saturation < 88, yno2, nursing home, and age > 65 are correlated together and mortality. Also, O2 saturation < 88, lactate, dyspnea, consolidation in chest images, AMS, respiratory rate > 20 and yNO2 were correlated together. Age > 65, dementia, hypertension, and nursing home were closely intercorrelated. Also, the correlation analysis showed that alcohol and headache had a more negative correlation with most variables such as nursing home, diabetes, dementia, hypertension, CAD, and AMS. Only prothrombin and CRP were correlated only together, and lactate was correlated with O2 saturation < 88, yno2 and atrial fibrillation (Table 3). Predictive partition analysis verified that the above-mentioned most differentiating clinical and blood maker variables are strong predictors to partition hospital mortality and survivors according to AUC = 0.95 and AUC = 0.91 for the training and validation sets, respectively (Fig. 2). The sensitivity, specificity, and accuracy were 80%, 92%, and 90% for the training set and 75%, 90%, and 87% for the validation set, respectively.

Fig. 2
figure 2

AUC for the separation of hospital mortality and survivors from COVID-19

Decision tree-based partition analysis revealed that age < 65 and either absence or presence of diabetes were involved to partition at least 50% of survivors. Also, age > 65, the O2 saturation condition, chest pain, and CAD had the highest portion for the partitioning of hospital death from survivors (Fig. 3).

Fig. 3
figure 3

Predictive partition platform analysis shows the decision tree that predicts the hospital mortality in patients with COVID-19 from survivors. Blue square: survivors, red square: hospital mortality

Identification of high-risk patients with COVID-19

Further investigations using PCA and LCA showed that patients with COVID-19 can be clustered to identify the high-risk patients (Fig. 4) based on the clinical data.

Fig. 4
figure 4

PCA plot illustrates the LCA-based clustering of patients with COVID-19. Clusters 2 and 3 are associated with a higher rate of mortality. Black circle: Survivors, red square: Hospital mortality

LCA was performed using most differentiating clinical variables obtained by SIMPLS prediction models. LCA-based clustering revealed three main clusters among the patients with COVID-19 cohorts (survivors and non-survivors). LCA-based clustering revealed that cluster 3 and cluster 2 had a 38% and 12.5% mortality rate. Cluster 1 was with the lowest rate of mortality (0–1.3%) compared to clusters 2 and 3. All 3 clusters were well depicted through a PCA plot that can verify the clustering using two unsupervised methods. Table 4 shows that although variables had different contributions to each cluster, several variables markedly impact clustering. Hence, age < 65, lack of hypertension, lack of diabetes, alcohol consumption, and headache were highly correlated with cluster 1 and with a lower rate of mortality. On the other hand, age > 65, nursing home, AMS, stroke, atrial fibrillation, CAD, and dementia were the most important variables correlated with cluster 3; chest pain and dyspnea were the most important variables correlated with cluster 2. Also, hypertension, yno2, consolidation, O2 saturation < 88, and diabetes were variables that had a similarly high probability for clusters 2 and 3. This result showed that nursing home, dementia, O2 saturation < 88, diabetes, hypertension, age > 65 are risk factors for COVID-19 survivors in clusters 2 and 3. Table 4 shows the probability of all 18 variables for each cluster in the analysis. Multivariate correlation analysis of 19 most differentiating clinical and comorbidities predictor was obtained by SIMPLS. The correlation values > 0.2 are in red with highlighted cells (Table 5).

Table 4 The conditional probabilities for each cluster are shown for each response category of 20 variables in the analysis
Table 5 Multivariate correlation analysis of 19 most differentiating clinical and comorbidities predictor obtained by SIMPLS

Further analysis showed that three clusters are separated from each other using a very good predictive (Q2 = 0.69) with high variability (R2Y = 0.81) SIMPLS-based model using most differentiating variables (Fig. 5).

Fig. 5
figure 5

SIMPLS-based scatter plot shows a very good separation between three clusters obtained by LCA. Clusters 1 includes the patients with a lower risk of dying, and clusters 2 and 3 include patients with a higher risk of dying

More investigations revealed that the prognosis of hospital mortality was poorly predicted using paraclinical data such as blood cell characteristics (i.e., numbers of leukocytes, neutrophils, lymphocytes, eosinophils, hemoglobin) and biochemical measures (i.e., BUN, creatine, sodium, CRP, procalcitonin [PCT], lactate, etc.) compared to clinical data and comorbidities.

Discussion

In the current study, machine learning algorithms were applied to predict hospital mortality using a prediction model based on the demographic, clinical predictors, comorbidities, and biochemical markers of patients with COVID-19. The two-component SIMPLS-based prediction model had moderate predictive power Q2 = 0.24 to predict hospital mortality. The prediction model was associated with high accuracy (AUC score of 0.91–0.95) using training and validation sets of the patient cohort. The prediction model was developed based on the 18 clinical and comorbidities, and 3 paraclinical biochemical markers uncovering most differentiating predictors that some have not been recognized through conventional statistical methods. Hence, CAD showed the highest predictive importance for in-hospital death, followed by diabetes, age > 65, Altered Mental Status, dementia, and O2 saturation < 88%. Also, LCA clustering was successful to identify high- and low-risk clusters in COVID-19 survivors. The clusters were discriminated against based on the high predictive power model Q2 = 0.69. Age < 65, lack of hypertension, and lack of diabetes were highly correlated with a lower rate of mortality among survivors while residing in the nursing home, age > 65, AMS, stroke, atrial fibrillation, CAD, and dementia were risk factors for in-hospital mortality in COVID-19 survivors. Multivariate analysis demonstrated that there are some most differentiating predictors which are not included in the univariate method (Table 1) such as yno2, dyspnea, alcohol, O2 saturation, and stroke. Moreover, the multivariate analysis helped to determine the weight of the clinical predictors based on their importance in the prediction model (VIP) that is considered as the value of multivariate analysis compared to the univariate analysis. On the other hand, acute MI, CHF, O2 flow rate (lpm), Fio2, and blood pressure were significantly different between the two groups which were not selected as most differentiating predictors using SIMPLS. The combination of paraclinical data with patient demographics and comorbidities significantly improved the prediction of hospital mortality compared to when patient demographics and comorbidities or paraclinical data were independently poor predictors for the prognosis of hospital mortality. Lactate, CRP, and prothrombin were the most weighted biochemical variables that could be contributed to predicting hospital mortality.

Several other studies are published on COVID-19 mortality prediction model development. In a large cohort, Yadaw et al. developed a highly accurate (AUC = 0.91) ML-based mortality prediction model, using patient’s age, O2 saturation throughout their medical encounter, and type of patient encounter (inpatient versus outpatient and telehealth visits) [14]. Age and minimum O2 saturation during the encounter were the most predictive factors, which is in line with our results. Individuals aged 60 years and older represent nearly 85% of all deaths, in COVID-19 hot spots across the USA [15]. Not surprisingly, the severity of hypoxia at presentation has been extensively reported as a significant indicator of the severity of illness, specifically in acute respiratory distress syndrome, and carries strong justification to be an important predictive factor in the clinical course of COVID-19 [16, 17]. Although development and validation datasets were larger in this study, the collected data were limited to those routinely collected during hospital encounters and did not include the comprehensive list of demographics, comorbidities, biochemical tests, imaging, and omics data. Additionally, although they had large datasets, the number of dead participants was small. Knight et al. conducted a large prospective cohort, evaluating an 8-item scoring system (score range 0–21 points) for in-hospital mortality due to COVID-19 [18]. The variables included age, gender, number of comorbidities, respiratory rate, O2 saturation, level of consciousness, urea level, and CRP. This scoring system revealed high discrimination for mortality (derivation cohort: AUC 0.79; validation cohort: 0.77); however, some potentially relevant comorbidities such as hypertension, previous myocardial infarction, and stroke were not included in data collection. Moreover, regarding the 32.2% mortality rate and elderly patient population (median age of 73 years old), this model could function differently in younger patients and/or populations at lower risk of death.

LASSO and multivariate data analysis-based prediction models showed that higher age, coronary heart disease (CHD), percentage of lymphocytes (LYM%), procalcitonin (PCT), urea, CRP, and D-dimer (DD) could be potential risk factors for mortality of COVID. These variables could classify the COVID patients into low- and high-risk groups using a good prediction model (AUC = 0.91)[19].

Considerable heterogenicity exists among COVID-19 mortality prediction models. Unlike our results which showed paraclinical and biochemical data have limited predictive value, in the model developed by Zhao et al. (AUC 0.83), lactate dehydrogenase and procalcitonin were among the top mortality prediction factors [20], and the COVID-AID study showed that renal failure at presentation (defined by creatinine > 2 mg/dL), regardless of chronicity has a high impact on in-hospital mortality in hospitalized COVID-19 patients [21]. Recent studies have reported that prothrombin and CRP are associated with COIVD severity and mortality [22, 23]. In this study, we showed the correlation of decreased O2 and increased lactate that may indicate the higher level of the anaerobic metabolism [24] in patients with COVID-19 that are associated with mortality.

Late April 2020, a systematic review and meta-analysis showed a significantly higher rate of hypertension, diabetes, cardiovascular disease, and respiratory disease in critically ill COVID patients compared to non-critical patients [25]. Then, another systematic review and meta-analysis on risk for predicting mortality of COVID 19 patients demonstrated that dyspnea, chest tightness, hemoptysis, expectoration, and fatigue were the most significant clinical variables in association with increased risk of COVID-19 mortality. This study also showed significant increased leukocyte count and decreased lymphocyte count in non-survivors [26]. ML was successfully applied to determine COVID-19 severity by predicting the need for ICU (AUC = 0.80) and the need for mechanical ventilation (AUC = 0.82) [27]. Random forest analysis showed that PCT, DD, CRP, respiratory rate, SpO2, albumin, AST/SGOT, calcium, influenza-like symptoms, and ALT/SGPT are the most important variables to predict the need for ICU. Also, CRP, DD, PCT, SpO2, respiratory rate, creatinine, total protein, albumin, calcium, and age were the most important variables to predict the need for mechanical ventilation [27]. In a similar study, SpO2/FiO2, CRP, estimated glomerular filtration rate (eGFR), age, Charlson score, lymphocyte count, and PCT were the most important variables for the prediction COVID severity [28]. LASSO-based prediction model showed that lymphocyte percentage, lactic dehydrogenase (LDH), neutrophil count, and DD in combination with four quantitative CT findings including pneumonia percentage in the lateral basal segment of left lower lung, the volume of the whole lung with the density of -300 to -200 HU, pneumonia volume in both lungs and pneumonia volume in the right lung can be most important variables to prognosticate critical illness risk in hospitalized patients with COVID-19 pneumonia [29]. Age, PCT, CRP, LDH, DD, and lymphocytes were top mortality predictors and PCT, LDH, CRP, O2 saturation, temperature, and ferritin were important predictors for the ICU need with AUC 89% and 79%, respectively, in a cohort from New York [30].

Leon et al. applied the ML approach to cluster the patients with COVID into 3 groups including higher, moderate, and low rate of mortality. This study showed that the higher and lower AST, ALT, LDH, CRP, and number of neutrophils were associated with a higher and lower rate of mortality, respectively [31]. The percentages of monocytes and lymphocytes were negatively correlated with mortality [31]. Unlike our results, Leon’s study showed that age, sex, and comorbidities did not contribute to the above clustering model [31].

The strengths of our study include assessing a comprehensive list of demographic, clinical, and paraclinical variables, at all stages of hospitalization (admission, during hospital stay, and hospital discharge), development of an internally validated accurately discriminating in-hospital mortality prediction model, identification of high-risk and low-risk clusters of COVID patients whose healthcare needs are different, and enrollment of PCR-proven cases of SARS-CoV2, rather than possible COVID-19 patients. SIMPLS is considered a suitable multivariate method to investigate big and complex datasets that have a relatively small sample size and many variables [32]. External validation using an external cohort may help the results to be more practicable and achievable at any time with any cohorts. Current findings in this study may improve the precise prognostication of COVID-19 mortality, classification of low and high risk, and identification of potential risk factors.

Our study has a few limitations. First, this is a single-center retrospective study, which might impact the data quality and generalizability. Second, although we had an acceptable sample size, the subset of dead individuals was small (n = 31). A major reason for this concern is that the number of predictor parameters considered by ML approaches usually exceeds that for regression, even when the same set of predictors is applied, especially since multiple interaction terms are constantly examined and continuous predictors are routinely classified. Therefore, ML methodologies require “big data” to ensure their developed models have minimized overfitting and for their potential advantages (i.e., dealing with highly nonlinear relations and complex interactions) to reach fruition.

Conclusion

In conclusion, we presented an accurate ML-based in-hospital mortality prediction model for COVID-19, which can aid in clinical decision making and resource allocation. This model needs to be externally validated in larger populations and multicenter settings.