Background

Chronic obstructive pulmonary disease (COPD) is a disease associated with chronic airway inflammation, characterized by persistent respiratory symptoms and airflow limitations [1]. COPD is the fourth major cause of death and may rise to the third leading cause of death by 2030 according to the prediction of the World Health Organization [2]. A national cross-sectional study in 2018 investigated the lung health status of adults > 20 years old in 10 provinces of China, and showed that the prevalence of COPD in adults > 20 years old was 8.6%, in adults > 40 years old was as high as 13.7%, causing a significant disease burden [3]. Acute exacerbation of COPD (AECOPD) refers to the aggravation of respiratory symptoms in patients, which is the main reason for hospitalization and medical expenditure of COPD patients [1, 4]. Approximately 63% of COPD patients have at least one readmission due to exacerbation within 1 year after hospitalization [5, 6]. AECOPD accelerate the progress of the disease, reduce the quality of life of patients and increase the risk of death [7]. Early identification of patients with high risk of AECOPD and readmission and timely interventions to reduce the incidence of AECOPD and readmission are of great clinical significance for improving the prognosis of COPD patients and delaying the progression of the disease.

Previously, the risk factors associated with AECOPD and readmission in patients were explored by several studies, which revealed that gender, hospital stay, medical aid care, duration of systemic steroid use were factors leading to the AECOPD and readmission [8]. Factors including age, tobacco use, diabetes mellitus, infections, obesity, and frequency of hospital visit were also reported to influencing the occurrence of AECOPD and readmission [9]. Currently, there was no international universal prediction model for predicting the readmission of AECOPD patients within one year after discharge. Prediction models of readmission in patients with AECOPD were established based on the data of USA or UK people and some of them were focused on predicting the risk of readmission of AECOPD patients within 30 days based on social factors or LACE index (length of stay, acuity of admission, co-morbidities, and emergency department visits within the last 6 months) [10, 11]. Additionally, the prediction models for readmission of AECOPD patients within 90 days were also established based on PEARL (previous admissions, eMRCD score, Age, Right-sided heart failure and Left-sided heart failure) or COPD-2-HOME score (CAT score, hyperinflation, obstruction, prior admission, eosinophilia) [12, 13]. A prediction model of readmission of AECOPD patients within 90 days considered the importance of multimorbidity, frailty and poor socioeconomic status in patients [14]. Njoku et al. [15] indicated that the prevalence of COPD-related readmission was about 2.6–82.2% within 30 days, 11.8–44.8% at 31–90 days, 17.9–63.0% at 6 months, and 25.0–87.0% at 12-month post-discharge [15], which suggested that the importance of not only predicted the readmission of AECOPD patients within 30 days or 90 days, but also one year. At present, a prediction model for one-year readmission of COPD patients was established but it had a low area under the curve (AUC) value and lacked validation of the results [16]. There was no prediction model for predicting the readmission of AECOPD patients within one year after discharge based on the data from Chinese population. To establish a prediction model for predicting the readmission of AECOPD patients within one year after discharge in China is of great value.

Gradient boosting machine (GBM) is a kind of machine learning algorithm helping assemble the weak learners into a strong learner. GBM increases the performance of the prediction model during the gradient descent process. The extreme gradient boosting (XGBoost) model is an extension of GBM, which combines several learning algorithms to achieve a better predictive performance than any of the constituent learning algorithms alone [17]. XGBoost applies a second-order Taylor expansion to the loss function and simultaneously implements the first derivative and the second derivative. Additionally, a regularization term is supplemented in the objective function to increase the generalizability of a single tree and decrease the complexity of the objective function [18]. XGBoost model is widely used for disease diagnosis and prediction due to its fast speed, excellent classification effect.

In our study, we collected the data of 650 patients with AECOPD from the Second Affiliated Hospital of Nanjing Medical University from Jan. 2016 to Dec. 2019 to investigate the risk factors and construct XGBoost model and logistic regression model to compare the predictive performance for readmission in AECOPD patients within one year after treatment and discharge.

Methods

Study population

In the current study, 650 patients with AECOPD were recruited from the Second Affiliated Hospital of Nanjing Medical University between Jan. 2016 and Dec. 2019. The data of the patients were retrospectively extracted from a broad coding records search and review of COPD assessments routinely completed by clinicians or nurses. After excluding 12 patients who readmitted into hospitals because of pneumonia, 1 patient who readmitted into hospitals due to congestive heart failure and 1 patient who readmitted into hospitals due to perianal condyloma acuminatum, 636 participants were finally included. A hospitalization for AECOPD was identified through International Classification of Diseases-10 codes (J44.1) [19]. Readmission one year after discharge means within one year from their first day of discharge to readmission day [20]. All subjects were divided into readmission group (n = 187) and non-readmission group (n = 449). This study got the approval from the Ethics Committee of from the Second Affiliated Hospital of Nanjing Medical University, the approval number was (No. [2021]-KY-091-01). The screen process of the participants was shown in Fig. 1.

Fig. 1
figure 1

The screen process of the participants in this study

Data collection

The data of participants were collected to analyze the risk factors of readmission within one year. The gender, age (years), body mass index (BMI, kg/m2), acute exacerbation in previous 1 year, smoking status, daily amount of smoking, duration of smoking, number of years of smoking packets, the application of long-acting muscarinic antagonist (LAMA), long-acting β agonist (LABA), short-acting muscarinic antagonist (SAMA), short-acting β agonist (SABA), phosphodiesterase 4 inhibitor (PDE4I), inhaled corticosteroids (ICS), history of congestive heart-failure, diabetes, hypertension, atrial fibrillation and combined with other disease were collected during the recruit time. The systolic blood pressure (mmHg), diastolic blood pressure (mmHg), respiratory rate (time/minute), temperature (℃), heart rate (time/minute), hemoglobin (Hb, g/L), red blood cells (RBC, 1012/L), white blood cells (WBC, 109/L), platelets (PLT, 109/L), neutrophil count (NEUT, 109/L), percentage of neutrophils (%), lymphocyte count (LYM, 109/L), percentage of lymphocytes (%), monocytes count (MONO, 109/L), percentage of monocytes (%), eosinophil count (EOS, 109/L), percentage of eosinophils (%), red blood cell distribution width (RDW), glutamic-pyruvic transaminase (ALT, μ/L), glutamic oxalacetic transaminase (AST, μ/L), total bilirubin (TBIL, μmol/L), albumin (ALB; g/L), blood urea nitrogen (BUN; μmol/L), creatinine (Cr, μmol/L), uric acid (μmol/L), hypersensitive C-reactive protein (mg/dL), modified Medical Research Council (mMRC) grade, and total COPD assessment test (CAT) score were collected at admission. The data on forced expiratory volume in 1 s (FEV1, mL), forced vital capacity (FVC, mL), FEV1/FVC, FEV1 in predicted value FEV1, the use of systemic glucocorticoid, antibacterial agents, oxygen therapy, and mechanical ventilation were collected during the treatment. During the 12 months ± 30 days’ follow-up, if readmission occurred, the readmission frequency and reasons were recorded.

Definitions of the variables

Congestive heart failure referred to the symptoms and/or signs of present heart failure, and left ventricular ejection fraction (LVEF) < 40%; or LVEF ≥ 40% with elevated brain natriuretic peptide and meeting at least one of the following requirements: (1) left ventricular hypertrophy and/or left atrial enlargement; (2) abnormal diastolic function.

Diabetes was defined as patients with blood glucose level ≥ 11.1 mmol/L at any time after meal or fasting blood glucose level ≥ 7.0 mmol/L, or having blood glucose level ≥ 11.1 mmol/L in 2-h glucose tolerance test or glycosylated hemoglobin ≥ 6.3%.

Hypertension was defined as systolic blood pressure ≥ 140 mmHg and/or diastolic blood pressure ≥ 90 mmHg when blood pressure was measured three times on different days without using antihypertensive drugs; For patients with a history of hypertension and currently taking antihypertensive drugs, they were diagnosed with hypertension although the blood pressure was lower than 140/90 mmHg.

Smoking status: including never smoking, former smoking and current smoking. Non-smoking was defined as less than 100 cigarettes in a lifetime, former smoking referred to more than 1 year of smoking cessation, and number of years of smoking packets = number of smoking packets per day (20 cigarettes are counted as 1 packet) × smoking years [21].

XGBoost model

XGBoost model is an ensemble learning algorithm based on the gradient-boosted tree algorithm. XGBoost model processes sparse data via a sparsity-aware learning algorithm and weights quantile sketch to approximate tree learning [22].

Statistical analysis

All statistical tests were conducted by two-sided test. The measurement data of normal distribution were described by Mean ± standard deviation (Mean ± SD), the independent sample t test was applied for comparisons between groups. The non-normal distributed data were expressed by median and quaternary spacing [M (Q1, Q3)], and differences between groups were compared by the Mann–Whitney U rank sum test. The enumeration data were shown as n (%). Chi-square test or Fisher’s exact probability method was used for comparison between groups. Random forest filling method was applied for filling in missing values with 100 trees via the missForest package in R© Version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria) [23]. Sensitivity analysis was performed before and after interpolation. To explore the risk factors for readmission in AECOPD patients within one year, the differences were firstly analyzed between groups, and the variables with statistical significance were included in the multivariate logistic model. Backward stepwise regression method was used to analyze the risk factors for readmission. For the establishment of prediction models, 70% of the samples were involved as the training set for construction of the models, and 30% of the samples were used as the testing set to test the diagnostic efficiency of the models [24, 25], and the equilibria analysis was conducted between the training set and the testing set. Variables with P < 0.1 were included in the logistic model and the extreme gradient boosting (XGBoost) model, and the parameters were adjusted. After establishing the models, the area under the curve (AUC) value, kolmogorov–smirnov (KS), sensitivity, specificity and accuracy were used to evaluate the performance of models. The receiver operator characteristic (ROC) curves were plotted. SAS 9.4 and R 3.6 were employed for data analysis in our study, and P < 0.05 referred to be statistical significant.

Results

The manipulation of missing data

Variables with a missing value ratio of more than 25% were removed (most of them were data related to discharge including partial arterial oxygen pressure, partial pressure of carbon dioxide in artery, arterial oxygenation, pH, and medications at discharge), and random forest filling method was used to fill in missing values for selected data. Sensitivity analysis before and after interpolation was shown in Table 1. There was no bias after interpolation of the missing data.

Table 1 Sensitivity analysis of the data before and after interpolation

Comparisons of baseline data between readmission group and non-readmission group

As exhibited in Table 2, the age (72.21 years vs. 70.24 years, t =  − 2.295, P = 0.022), MONO counts (0.48 109/L vs. 0.43 109/L, Z = 2.438, P = 0.015), mMRC grade (t = 5.963, P < 0.001), total CAT score (22,04 vs. 19.55, t =  − 5.475, P < 0.001), the proportions of patients with acute exacerbation in previous 1 year (54.01% vs. 23.39%, χ2 = 56.542, P < 0.001), patients using LAMA (32.09% vs. 21.60%, χ2 = 7.802, P = 0.005), LABA (47.06% vs. 30.07%, χ2 = 16.741, P < 0.001), ICS (43.32% vs. 29.62%, χ2 = 11.089, P < 0.001), and patients receiving systemic glucocorticoids (32.62% vs. 22.94%, χ2 = 6.465, P = 0.011), oxygen therapy (85.03% vs. 78.17%, χ2 = 3.903, P = 0.048) and mechanical ventilation (2.67% vs. 0.67%, χ2 = 4.276, P = 0.039) in the readmission group were higher than in the non-readmission group. The percentage of lymphocytes (17.60 vs. 19.80, Z =  − 2.031, P = 0.042), ALT (13.30 μ/L vs. 16.22 μ/L, Z =  − 3.176, P = 0.002), AST (16.50 μ/L vs. 19.00 μ/L, Z =  − 2.896, P = 0.004), FEV1/FVC (63.36 vs. 68.87, t = 4.845, P < 0.001) and the predicted value of FEV1 (52.60 vs. 58.40, Z =  − 3.076, P = 0.002) in the readmission group were lower than in the non-readmission group.

Table 2 Comparisons of characteristics of patients between readmission group and non-readmission group

Risk factors of readmission in patients with AECOPD within one year

Variables with statistical significance in comparisons of the baseline data were included in the multivariate logistic regression analysis. Backward stepwise regression method was adopted, and age and gender were adjusted. The results delineated that patient with acute exacerbations within the previous 1 year had a 4.086-fold higher risk of readmission than those without acute exacerbations within the previous 1 year (OR = 4.086, 95% CI 2.723–6.133, P < 0.001). Patients using LABA had a 4.550-fold higher risk of readmission than those not using LABA (OR = 4.550, 95% CI 1.587–13.042, P = 0.005). Patients receiving ICS decreased the risk of readmission by 0.773 times (OR = 0.227, 95% CI 0.076–0.672, P = 0.007) compared with those not receiving ICS. The risk of readmission was reduced by 0.015 times with the per unit increase of ALT (OR = 0.985, 95% CI 0.971–0.999, P = 0.042). Each 1-point increase in total CAT score was associated with a 1.091-fold increased risk of readmission (OR = 1.091, 95% CI 1.048–1.136, P < 0.001) (Table 3).

Table 3 Predictors analysis of readmission

The equilibrium test of training set and testing set

All samples were randomly divided into the training set and the testing set (7:3). The results of equilibrium analysis after division showed that there was no statistical significance in the differences of variables between the training set and the testing set (Table 4).

Table 4 The equilibrium test of training set and testing set

Construction of logistic model and validation of the predicative value via the testing set

Variables with statistical differences were included in the logistic model. The stepwise backward method was used, and age and gender were included. The results were shown in Table 5. Patients with acute exacerbations within the previous 1 year had a 3.863 times higher risk of readmission than those without acute exacerbations within the previous 1 year (OR = 3.863, 95% CI 2.349–6.351, P < 0.001). LABA usage in patients increased the risk of readmission by 4.556 times (OR = 5.556, 95% CI 1.577–19.577, P = 0.008). ICS treatment decreased the risk of readmission by 0.753 than patients without ICS treatment (OR = 0.247, 95% CI 0.067–0.908, P = 0.035). Each 1-point increase in total CAT score was correlated with a 1.110-fold increased risk of readmission (OR = 1.110, 95% CI 1.055–1.168, P < 0.001).

Table 5 Construction of logistic model

The AUC value of the logistic model was 0.743 (95% CI 0.692–0.795) in the training set and 0.699 (95% CI 0.617–0.780) in the testing set (Fig. 2). The sensitivity was 0.702 (95% CI 0.621–0.782) in the training set and 0.667 (95% CI 0.550–0.783) in the testing set. The specificity was 0.726 (95% CI 0.677–0.775) in the training set and 0.664 (95% CI 0.582–0.746) in the testing set. The accuracy was 0.719 (95% CI 0.677–0.761) in the training set and 0.665 (95% CI 0.598–0.732) in the testing set. The cutoff point in the training set was 0.282 (Table 6).

Fig. 2
figure 2

The ROC curve of showing the AUC value of the logistic model

Table 6 The predicative value of logistic model

Construction of XGBoost model and validation of the predicative value via the testing set

Variables with P < 0.1 were selected into the XGBoost model, and age and gender were also involved in. After GridSearchCV search tuning, the optimal parameters of the model were: tree depth: 2, number of trees: 50, learning rate: 0.01. The weight method was used to evaluate the importance of variables via the number of split nodes in the model tree. The results depicted that variables including acute exacerbation in previous 1 year, the CAT score, and SABA and LABA application were more important in the XGBoost model (Fig. 3).

Fig. 3
figure 3

The weight method revealing the importance of variables in the XGBoost model

As delineated in Table 7, the AUC value of XGBoost model was 0.814 (95% CI 0.812–0.815) in the training set and 0.722 (95% CI 0.720–0.725) in the testing set (Fig. 4). The sensitivity was 0.702 (95% CI 0.621–0.782) in the training set and 0.635 (95% CI 0.516–0.754) in the testing set. The specificity was 0.826 (95% CI 0.784–0.867) in the training set and 0.750 (95% CI 0.675–0.825) in the testing set. The accuracy was 0.791 (95% CI 0.753–0.829) in the training set and 0.712 (95% CI 0.648–0.776) in the testing set. The cut of point in the training set was 0.451.

Table 7 The predictive value of XGBoost model
Fig. 4
figure 4

The ROC curve of showing the AUC value of the XGBoost model

Comparisons of the predictive abilities of the two prediction models

The logistic model and XGBoost model were used to establish the prediction model, and the performance of the models were compared. The AUC value of the logistic model was good in the training set but poor in the testing set while the XGBoost model showed good predictive abilities in both the training set and the testing set.

Discussion

This study collected the data of 650 patients with AECOPD and evaluated the risk factors for readmission within one year after treatment and discharge and constructed logistic model and XGBoost model to predict the risk of readmission within one year in AECOPD patients. The data revealed that acute exacerbation within the previous 1 year, LABA application, and the total CAT score were risk factors for readmission of AECOPD patients within one year while ICS application and higher ALT level were protective factors for readmission of AECOPD patients within one year. Additionally, we compared the predictive performances of logistic model and XGBoost model in predicting the risk of readmission within one year in AECOPD patients. The data delineated that the XGBoost model showed better predictive value.

Previously, the history of exacerbation was reported to an independent predictor for future exacerbations in patients [26]. A study of Bernabeu-Mora et al. indicated that the number of hospitalizations due to exacerbations in the previous year increased the risk of readmission by 4.44 times [27]. The results of these studies supported the findings in this study, showing that patients with acute exacerbations within the previous 1 year had a 4.086-fold higher risk of readmission than those without acute exacerbations within the previous 1 year. The CAT score is a questionnaire as a simple, and quick instrument for measuring the severity and impact of symptoms in COPD patients and determining of the appropriate treatment for those patients in clinical practice [28]. Multiple studies have revealed that CAT score had good internal consistency and test–retest reliability for both stable and exacerbating COPD [29]. The CAT scores were higher in patients with a history of frequent exacerbations [30]. Herein, patients with higher CAT scores were associated with a higher risk of readmission within one year. This maybe because higher CAT scores were correlated with higher concentrations of serum C-reactive protein and plasma fibrinogen, demonstrating that systemic inflammation were more serious in patients with higher CAT scores [31]. For AECOPD patients with high CAT score, timely intervention should be provided and after discharge, follow-up should be conducted regularly to prevent the occurrence of readmission. At present, the aim of the treatment for COPD patients was to prevent the deterioration of lung function and alleviate symptoms to decrease the risk of exacerbations [32] and short-acting bronchodilators and long-acting bronchodilators including LAMA, LABA, SAMA, SABA and ICS are frequently applied in the treatment of COPD [33]. SABA is often applied on an as-needed basis for symptom relief in COPD patients [34]. However, previous studies also identified that the application of SABA might have a higher risk of readmission for than other therapies including arformoterol tartrate [35, 36]. In our study, SABA was identified as an important variable influencing the risk of readmission in AECOPD patients within one year after discharge. ICS application reduce the frequency and severity of exacerbations [37]. In the present study, patients receiving ICS treatment decreased the risk of readmission than patients without ICS treatment, which was supported by previous studies. For patients with AECOPD, ICS treatment should be appropriately provided to prevent the occurrence of readmission of those patients [38]. Another study also delineated that the incidence of adverse events was higher in patients with LABA treatment (50.2%) than patients without LABA treatments and exacerbation of COPD was the most commonly reported adverse events [39]. This may result in the increase of readmission of patients, which gave support to the results of our study, depicting that LABA treatment may cause a higher risk of readmission in patients [34, 35]. For AECOPD patients, LABA usage should be applied with caution.

This study measured the predictors for readmission in patients with AECOPD within one year and established two prediction models including logistic model and XGBoost model. Random forest filling method was used for dealing with the missing values via constructing multiple decision trees, and the data after filling have randomness and uncertainty, which can better reflect the real distribution of these unknown data. Random forest filling method can be well applied to high-dimensional data filling because each branch node selects random partial features instead of all features in the process of decision tree construction. High accuracy and reliability of the data after filling were ensured. The validation of the predictive values of the models were performed in the testing set. Due to the small sample size in our study, the training set included 70% of the samples and the testing set included 30% of the participants. This split ensured enough samples for construction of more reliable models and meanwhile, there were still some samples to validate the performance of the model. The ROC curves were drawn to display the results of respective models. The AUC values of the logistic model were good in the training set but poor in the testing set while the XGBoost model showed good predictive abilities in both the training set and the testing set. This indicated that XGBoost model was better than logistic model in predicting the risk of readmission in patients with AECOPD within one year. Currently, the prediction of the risk of readmission in patients with AECOPD was focused on 30 days and 90 days after treatment and discharge [10,11,12, 40], but 30 days or 90 days could not actually represent the disease procession. The readmission of AECOPD patients within one year indicated the long-term prognosis of patients. A previous study established a prediction model for one-year readmission of COPD patients, but the AUC value was only 0.703 and the results were not validated [16]. Compared with the former prediction models, we compared logistic model and XGBoost model and found the AUC values of the XGBoost model were higher in both the training set and the testing set. The variables involved in our model were common for clinicians to collect, and based on these variables, XGBoost model can quickly predict the possibility of readmission in AECOPD patients after discharge within one year. In addition, our prediction model was uploaded to the GitHub with a free access for everyone (https://github.com/shipingchenmedicine123/XGBoost-model). The instructions for using the model was shown in Additional file 1: File 1. We welcome more clinicians to use our model to validate the results of our study. The findings of the current study might help early identify patients with a high risk of relapse and readmission, especially patients with moderate exacerbations, and provide timely intervention measures to reduce the incidence of AECOPD and readmission to improve the prognosis of COPD patients and delay the progression of the disease.

The strengths of this study were that we dealt with the missing data and no bias were obtained, and the results might be more reliable. Internal validation was also conducted to verify the results of the present study. There were several limitations in our study. Firstly, the sample size was small and collected from a single center, which might decrease the statistical power especially in some variables with limited samples, such as patients SABA usage. Therefore, results of our study might be interpreted with caution. Secondly, external validation of the findings was not performed. Thirdly, subgroup analysis was not performed on patients with mild exacerbations, moderate exacerbations and severe exacerbations. In the future, well-designed studies with large scale of sample size from muti-centers and external validations were required to verify the results of the present study.

Conclusions

In the current study, we constructed two models to predict the risk of readmission within one year in the AECOPD patients based on the predictors including acute exacerbation within the previous 1 year, LABA, ICS application, ALT level and the total CAT score. The results showed that the XGBoost model showed better predictive value in predicting the risk of readmission within one year in AECOPD patients than the logistic model. Variables including acute exacerbation in previous 1 year, the CAT score, and SABA and LABA application were more important in the XGBoost prediction model. The findings of our study might help identify patients with a high risk of readmission in patients with AECOPD within one year and provide timely interventions and treatment to prevent the reoccurrence of AECOPD.