Background

The novel coronavirus disease 2019 (COVID-19) has become a pandemic. The most common symptoms of COVID-19 patients were fever, dry cough, fatigue, dyspnea, etc. [1, 2]. A small part of patients had digestive symptoms, such as nausea, vomiting and diarrhea [3, 4]. A study [5] by the Chinese Center for Disease Control and Prevention showed that about 81% COVID-19 patients were considered as mild. The proportion was 14% and 5% respectively, for severe and critical patients, who should be hospitalized or transferred to intensive care unit (ICU) for urgent treatment. The mortality in overall population was 3.2%, but it increased to 49% in critical population. Hence, how to use effective biomarkers to identify patients who are at high risk of poor clinical outcomes have caused extensive concern.

COVID-19 patients with comorbidities were considered to be prone to having poor clinical outcomes. A study revealed that COVID-19 patients with chronic obstructive pulmonary disease, diabetes, hypertension and malignancy had a higher risk of admission to an ICU, invasive ventilation or death [6]. Another study demonstrated that the risk factors included older age, high Sequential Organ Failure Assessment score, and higher D-dimer expression on admission [7].

During the early outbreak of COVID-19 in Wuhan, centre of early stage of the pandemic, medical resources were extremely scarce. It is of great clinical significance to use effective biomarkers to quickly identify patients with high risk of death, to whom should be given priority in accessing medical resources. In this study, we retrospectively enrolled patients from Taikang hospital and other temporary hospitals during the outbreak of COVID-19 in Wuhan, China. We analyzed the differences in clinical characteristics between severe and non-severe patients, as well as survivors and non-survivors. Furthermore, we developed a clinically operable and easy-to-interpret decision tree model to distinguish COVID-19 patients with high risks of death from those without.

Methods

Data sources

A total of 2169 adult patients (aged ≥ 18 years) were enrolled from Wuhan, China between February 10th and April 15th, 2020. All patients were confirmed with COVID-19 infection by real-time reverse-transcription polymerase-chain-reaction (RT-PCR) assay. In addition, medical records, including demographics, clinical characteristics and laboratory test results on admission of all patients were also collected. All our data were independent from other hospitals or different in periods from other studies, rather than a repetitive analysis. This study was approved by the Ethics Committee of the Taikang Hospital (TKTJLL-005, TKTJLL-007), and performed in accordance with the Declaration of Helsinki. The Ethics Committee of the Taikang Hospital waived the need for informed consent of each patient. This study was registered in the Clinical Trials Register (NCT04347369, https://clinicaltrials.gov/).

Study design

First of all, we performed a difference analysis of medical records between severe group and non-severe group. All the patients meeting the severity diagnosis criteria during hospitalization were assigned into the severe group. Disease severity was defined according to the Seventh Revised Trial Version of the COVID-19 Diagnosis and Treatment Guidance (2020) of China [8]. In detail, COVID-19 patients with respiratory rate more than 30 breaths per minute, or oxygen saturation lower than 93% in rest state, or oxygenation index less than 300 mmHg, or rapid progression in lung images within 24–48 h were regarded as severe patients. Next, we performed difference analyses of medical records between survivors and non-survivors. Survivors were defined as patients who were discharged from hospital or transferred to other local hospitals due to advanced age or other basic diseases, instead of COVID-19, at the end of study. Last, we developed a decision tree to predict death outcome.

Development of a clinically operable decision tree

Many machine learning methods are available to develop a helpful predictive model. However, most of them are difficult to interpret because of their internal model mechanisms of black-box modelling strategies. In this study, we chose the decision tree as the predictive model because it’s visible, clinically operable and easy to interpret due to its recursive tree-based decision system.

Before developing a decision tree, an appropriate data processing is needed. First, laboratory indexes with missing values over 20% were excluded, including interleukin-6 (IL-6), procalcitonin and D-dimer. We also excluded neutrophil count and lymphocyte count but retained neutrophil-to-lymphocyte ratio (NLR) because of a strong correlation. Then all missing values were input with mean value of each remaining laboratory index. Finally, factors including age, sex, smoking status, body temperature, oxygen saturation, heart rate, respiratory rate, number of comorbidities, number of system symptoms, white blood cell (WBC), NLR, monocyte count, eosinophilia count, basophilia count, red blood cell (RBC), hemoglobin, platelet count, lactic dehydrogenase (LDH) and C-reactive protein (CRP) were used in the development of decision tree.

All severe patients were randomly split into training dataset and test dataset with a ratio of 7:3. The training dataset, including 452 severe COVID-19 patients, was used to build the decision tree. And the test dataset, including 194 severe COVID-19 patients, was used to validate the decision tree.

The decision tree is built by a two-stage process and the resulting models can be represented as binary trees. First of all, we explore to find each variable which could best split the data into two groups. The data is separated by related variables recursively until the subgroups either reach a minimum size or until no improvement can be made. The impurity function we used was "Information". In this step, a certain but complex tree model was built. But not all the target variables in the complex model are essential. Hence, secondly, we used cross-validation with the 1-SE rule to trim back the full tree. In the next step, we set the max nodes of split no more than 4 and chose the smallest complexity parameter in order to obtain a simple and meaningful decision tree.

The performance of the model was evaluated by the area under the curve (AUC), accuracy and a confusion matrix which could describe how many results were correctly and incorrectly classified. These indexes were calculated both in the training dataset and the test dataset.

Statistical analysis

Continuous variables were described as median with interquartile range (IQR), the comparison was analyzed by the Mann–Whitney U test. Categorical variables were represented as frequencies and compared by Pearson’s Χ2 test. All statistical analyses were performed and the decision tree model was developed using R software (version 3.5.2). The following R packages were used: CBCgrps, rpart, rpart.plot, MICE and pROC. A two-sided p value < 0.05 was considered statistically significant.

Results

Of the 2169 COVID-19 patients confirmed by RT-PCR, the median age was 61 years (IQR 50–70; range 18–100 years). Male patients accounted for 48% (1036 cases) and female patients 52% (1133 cases). Approximately 8% of patients (184 cases) had smoking history. On admission, 117 (5%) patients had high body temperature (≥ 37.3 ℃), 270 (12%) had low oxygen saturation (≤ 93%), 359 (17%) had abnormal heart rates and 596 (27%) had faster respiratory rates (> 20 per minute). In total, 1134 (52%) patients had at least one comorbidity, and the common comorbidities were hypertension, diabetes and coronary heart disease. In addition, 728 (34%) patients had one system symptom, 1130 (52%) patients had two system symptoms and 218 (10%) patients had three or more system symptoms. The most common system symptoms were respiratory symptoms, systemic symptoms and digestive symptoms (Table 1).

Table 1 Demographics, clinical characteristics and laboratory findings of severe and non-severe COVID-19 patients

A total of 646 (29.8%) patients were diagnosed as severe illness during hospitalization. Compared to non-severe group, severe group had a significantly higher median age (68 vs. 58 years, p < 0.001) and a higher proportion of male patients (56% vs. 44%, p < 0.001). On admission, higher proportions of high body temperature (9%), low oxygen saturation (42%), abnormal heart rate (20%) and faster respiratory rate (47%) were found in severe group. Moreover, patients in severe group had higher proportions of comorbidities (70%) and system symptoms (98%). No difference was found in smoking history (Table 1). When comparing laboratory test results between the two groups, we found that the severe group had significantly higher WBC count, neutrophil count, NLR, CRP, LDH, IL-6, procalcitonin and D-dimer levels, but lower lymphocyte count, eosinophilia count, basophilia count, RBC count, hemoglobin and platelet count. No difference was found in monocyte count (Table 1).

From February 10th to April 15th, 2020, 75 patients died of COVID-19. Differences in demographics and clinical characteristics between survivors and non-survivors were similar to the differences between severe and non-severe groups. For laboratory test comparison, much higher WBC count, neutrophil count, NLR, higher CRP, LDH, IL-6, procalcitonin and D-dimer levels were found in non-survivors (Table 2). RBC count and hemoglobin level showed no difference between the two groups. Other laboratory indexes were lower in non-survivors (Table 2).

Table 2 Demographics, clinical characteristics and laboratory findings of survivors and non-survivors

To explore crucial predictive biomarkers of disease mortality in severe patients, we used a machine learning model, decision tree, to identify related biomarkers. A total of 452 patients were included in the training dataset, including 57 non-survivors. In this step, a decision tree model was developed to differentiate non-survivors from survivors. As shown in Fig. 1, three biomarkers were included in the decision tree model, including LDH, NLR and CRP. The threshold of each biomarker helped to classify each patient into survivor group or non-survivor group. The AUC of the receiver operating characteristic of this model was 0.96, which was higher than each single biomarker (Fig. 2). The associated confusion matrix of training dataset was presented in Additional file 1: Table S1. The accuracy of this model was 0.98. The precision, recall and F1 score for survivor prediction was 0.97, 1.00 and 0.98, respectively. For non-survivors, the precision, recall and F1 score was 1.00, 0.81 and 0.90, respectively (Table 3).

Fig. 1
figure 1

A decision tree model using three biomarkers and their thresholds in absolute value to predict death outcome in severe COVID-19 patients. Num, the number of patients in a class; T, the number of correctly classified patients; F, the number of misclassified patients; NLR, neutrophil-to-lymphocyte ratio; CRP, C-reactive protein; LDH, lactic dehydrogenase; COVID-19, novel coronavirus disease 2019

Fig. 2
figure 2

ROC curves for the decision tree model and each biomarker. A ROC curve for the decision tree model; B ROC curve for LDH; C ROC curve for NLR; D ROC curve for CRP. ROC, receiver operating characteristic; NLR, neutrophil-to-lymphocyte ratio; CRP, C-reactive protein; LDH, lactic dehydrogenase; AUC, area under the curve of ROC

Table 3 Performance of the decision tree on the training and test datasets

To validate the performance of the decision tree, we applied it to the test dataset, which included 194 severe patients. The associated confusion matrix of test dataset was presented in Additional file 1: Table S1. The accuracy in test dataset was 0.98. The precision, recall and F1 score for survivor prediction in test dataset was 0.98, 0.99 and 0.98, respectively. For non-survivor prediction in test dataset, the precision, recall and F1 score was 0.94, 0.83 and 0.88, respectively (Table 3).

Discussion

In this study, we found that COVID-19 patients in severe group or non-survivor group had a higher median age. Also, these patients had higher proportions of comorbidities and symptoms than their counterparts. Zhang et al. [9] reported that the median age in a small cohort of COVID-19 non-survivors was 72.5 years, similar to our findings. In the early outbreak in China, the case fatality ratio (CFR) of COVID-19 was 0.4%, 1.3%, 3.6%, 8% and 14.8% among patients aged 40 s or younger, 50 s, 60 s, 70 s and 80 s or older, respectively [10]. Some studies outside China also showed that the CFR of older patients was much higher than that of younger patients [11,12,13]. Impairment of immune defense against COVID-19 infection, immunosenescence, and increased risk for immunopathology were thought to be related to higher severity and mortality in older patients [14]. Other proposed hypothesis regarding the vulnerability to COVID-19 among aged patients including age-related chronic inflammation [15] or immunosenescence secondary to cytomegalovirus infection [16, 17]. Fortunately, COVID-19 vaccines might have high efficacy and safety to protect older people from COVID-19 infection [18].

We found that male COVID-19 patients accounted for the majority of severe patients and non-survivors. Previous study also demonstrated that approximately 60% of patients died of COVID-19 were male all over the world [19]. Male had a hazard ratio of 1.59 for COVID-19 related death compared to female [20]. The probable reason might be higher levels of several important proinflammatory innate immune chemokines and cytokines, such as IL-8, IL-18, and CCL5, but weaker T cell response in male patients in comparison with female patients [21]. Besides, behavioral/lifestyle risk factors, prevalence of co-morbidities, aging, and underlying biological sex differences might also contribute to the differences of CFR and severity between male and female patients [22].

Above all, this study proposed a simple and clinically operable decision tree model to quickly quantify the risk of COVID-19 related death based on three biomarkers (LDH, NLR and CRP), which could be easily obtained on admission. Take the training dataset as example (Fig. 1), the first biomarker LDH could divide all 452 patients with severe COVID-19 into two subgroups. Only 4 out of 378 (1.1%) patients with LDH < 330 IU/L died, while 53 out of 74 (71.6%) patients with LDH ≥ 330 IU/L died. Then next biomarker NLR could further stratify the subgroup of LDH ≥ 330 IU/L. Among this subgroup, those with NLR < 6.9 had relatively low risk of death compared to those with NLR ≥ 6.9 (16.7% vs. 89.3%). Moreover, among patients with LDH ≥ 330 IU/L and NLR ≥ 6.9, all those with CRP ≥ 27 mg/L died, 4 out of 10 of those with CRP < 27 mg/L died. In short, we recommend COVID-19 patients with LDH ≥ 330 IU/L and NLR ≥ 6.9 should be closely monitored or transfer to ICU. Those with LDH ≥ 330 IU/L but NLR < 6.9 also need to be carefully observed. This simple decision tree model helps physician quickly identify patients with high risk of death and priority of healthcare should be allocated accordingly, which is especially important in crowed hospital or during COVID-19 outbreak with shortage of medical resources.

Separately, these three biomarkers also have important clinical significance. The increase of LDH is a marker of tissue/cell damage. In patients with idiopathic pulmonary fibrosis, the LDH level could reflect the extent of lung injury [23]. For patients with severe COVID-19, the rise in LDH might indicate the activity of lung injury. Evidence proved that LDH was a biomarker of severe illness and poor prognosis in COVID-19 patients [24]. Zeng et al. found that LDH decreased within 10 days after admission in non-critical COVID-19 patients, but did not decrease obviously in critical patients or non-survivors [25]. NLR is one of the research hotspots of inflammatory biomarkers in infectious diseases. It can comprehensively reflect the inflammatory response and immune status in patients with infectious diseases [26,27,28]. In COVID-19 patients, elevated NLR on admission was reported to be significantly associated with disease severity [29, 30]. Liu and colleagues proposed a simple model based on NLR and age to stratify COVID-19 patients into four groups [31]. COVID-19 patients with age < 50 years and NLR < 3.13 or NLR ≥ 3.13 had no risk of severity, and these patients should be treated in a community hospital, home isolation or general isolation ward. While COVID-19 patients with age ≥ 50 and NLR < 3.13 or NLR ≥ 3.13 had a higher risk of severity, and these patients should be admitted to isolation ward or ICU with active treatment and care. In addition, Yang and coworkers found that approximately 46.1% of the mild COVID-19 patients could become severely ill in patients with age ≥ 49.5 and NLR ≥ 3.3 [30]. The dynamic change of NLR could also be used to distinguish severe patients from mild/moderate patients. A study demonstrated that NLR in severe group always kept a higher level on day 1, 4 and 14 compared with mild/moderate group [32]. CRP reflects a persistent inflammatory activity state, and helps in assessing the severity of infectious patients [33]. A few studies have demonstrated that a higher CRP expression on admission was observed in severe COVID-19 patients compared with non-severe COVID-19 patients [33, 34].

Some certain limitations should be acknowledged in this study. First, because of the limited data source, an external validation needs to be performed in further studies. Second, the dynamic changes of some important biomarkers should be followed up to better and timely identify patients at higher risks of death. Third, because some markers, such as IL-6, procalcitonin, D-dimer, etc. were not enough in the study, further study should consider more markers in the development of decision tree.

Conclusion

In summary, this study found that male COVID-19 patients were more prone to experience severe illness and death. Clinical characteristics and laboratory examinations were significantly different between severe and non-severe groups, as well as between survivors and non-survivors. Most importantly, we proposed a simple, clinically operable and easy-to-interpret decision tree based on three biomarkers (LDH, NLR and CRP) on admission which could easily be obtained in clinical, to help clinicians rapidly identify COVID-19 patients at high risks of death, to whom priority treatment and intensive care should be given.