Introduction

Hip fractures (HF) are devastating osteoporotic fractures, as they are closely associated with high morbidity, high mortality, and poor prognosis [1, 2]. HF, one of the most common fractures in older adults, accounts for more than 14% of fractures in older adults [3]. Although its incidence has declined in developed countries, the absolute incidence of HF is growing as population aging progresses worldwide [4,5,6,7,8]. The study indicates that the number of people with HF will increase to 6.3 million by 2050 [9]. Although different types of HF have different surgical options [10, 11], surgical treatment can significantly improve patient prognosis. The poor prognosis of HF is closely related to postoperative complications [12]. Effective perioperative management of patients with hip fractures can significantly reduce the number of postoperative complications [13, 14]. The most frequent postoperative complication in HF is postoperative pneumonia (POP), which increases mortality and length of hospital stay [15, 16]. In patients with POP, the risk of death increased to 3 times 43% at 30 days and to 2.4 times 71% at 1 year [15]. To improve patient prognosis, it is crucial to identify patients who are at high risk for developing postoperative pneumonia early and to take appropriate action. Machine learning (ML) algorithms are often used in the construction of clinical predictive models. As an important subfield of artificial intelligence, ML can learn from databases and have better predictive results for metrics than traditional linear models [17].

The aim of this study is to develop a machine learning algorithm prediction model to early identify patients at high risk of POP based on data collected at the time of patient admission to assist clinicians in decision-making, which can provide early intervention for high-risk patients and reduce the incidence of POP.

Materials and methods

Data collection

This study collected patients who were hospitalized for hip fractures at a university hospital from May 2016 to November 2022. Relevant medical record data information was extracted from the electronic medical record system. Inclusion criteria: (1). Patients admitted to the hospital for hip fracture; (2). The patient's age was not less than 60 years old. Exclusion criteria: (1). Not treated surgically; (2). Preoperative diagnosis of lung infection; (3). Multiple injuries; (4). Missing data information > 20%; (5). With acute cardiovascular or cerebrovascular disease, cancer, or other diseases that have a serious impact on the patient's prognosis; (6). Pathological fractures.

The diagnosis of POP is based on the Centers for Disease Control and Prevention's diagnostic criteria for POP [18]. In this study, the diagnosis of POP was based on the presence of the following events identifiable in the electronic medical record system in the time period after 24 h after surgery and before discharge: (1) new pulmonary infiltrative shadows, solid lesions, or cavity formation on imaging (X-ray or CT); (2) exclusion of other causes of fever (> 38 oC), leukopenia (leukocyte count < 4 × 109/L) or leukocytosis syndrome (leukocyte count > 12 × 109/L), or for adults over 70 years of age with altered mental status excluding other recognized causes; (3) changes associated with increased respiratory secretions, coughing and sputum, dyspnea, pulmonary rales, or bronchial breath sounds were documented in the medical record system.

The general patient characteristics, prevalent geriatric chronic diseases, and prevalent laboratory test results available within 24 h of admission were among the variables we extracted. The specific items are detailed in Table 1. Since different testing reagents and modalities can produce different normal values for laboratory results, we converted all laboratory test results combined with clinical data into dichotomous variables based on whether they exceeded the upper limit or fell below the lower limit.

Table 1 Characteristics of patients in the training set

Two authors independently extracted the data, and a third author confirmed the veracity of the data. The study was approved by the hospital ethics review committee (number: KYXM-202302-005). An informed consent waiver was obtained because the study was retrospective and the personal information of the patients was withheld during the analysis. All procedures performed in this study were in accordance with the 1964 Declaration of Helsinki and its amendments.

Statistical analysis

We use multiple interpolations to interpolate the missing data, which is done through the "mice" package in R. The median (interquartile range) was used to represent non-normally distributed continuous variables, and categorical variables were expressed as percentages. Continuous variables were analyzed using the Mann–Whitney U; categorical variables were analyzed using the chi-square test or Fisher test.

All patients included in the analysis were randomly divided into training and validation sets according to 70:30. To avoid the effect of multicollinearity among variables, we will use the Least Absolute Shrinkage and Selection Operator (LASSO) technique to perform screening of variables [19]. The screened variables were then subjected to correlation tests to clarify the presence of multicollinearity among the variables, and the correlation heatmap was drawn. The correlation coefficients were taken as [− 1,1], the larger the absolute value, the stronger the correlation, and greater than 0.4 indicated the existence of a significant correlation. The filtered variables are incorporated as final features in the model of the machine learning algorithm. Using Classification and Regression Tree (CART), Gradient Boosting Machine (GBM), k-Nearest Neighbors (KNN), Logistic Regression (LR), Neural Network (NNet), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost), the seven machine learning algorithms to build prediction models. Ten times tenfold cross-validation resampling was used to ensure the stability and reproducibility of the model performance. The receiver operating characteristic (ROC) curve was used to evaluate the predictive performance of the model, and the higher the area under the curve (AUC) of the ROC, the better the model discrimination. Accuracy, sensitivity, specificity, Kappa value, and Matthews correlation coefficient (MCC) values were used as additional descriptions of the predictive ability of the model. The Kappa value is a metric to evaluate the consistency between the predicted and actual values of the model, and it takes the value of [− 1,1], the closer to 1, the better the consistency [20]. If it is > 0.75, the consistency is excellent, if it is between 0.40 and 0.75, the consistency is good, and if it is < 0.4, the consistency is poor. Due to the low incidence of positive events in this study, Matthews correlation coefficient (MCC) values provide a more balanced reflection of the model's predictive accuracy for a dataset with this imbalance problem. [21]. Its value is taken in [− 1,1], the closer to 1 the more perfect the prediction accuracy, above 0.5 is better, and greater than 0.7 indicates a high accuracy. The Brier Score is used to evaluate the calibration of the model and takes values in [0,1], the closer to 0 the better the calibration of the model, and less than 0.25 indicates that the calibration is acceptable [22]. Calibration curves were used as a complementary illustration of the calibration degree of the model. Decision curve analysis (DCA) is used to evaluate the clinical utility of the model in decision-making. Various evaluation metrics were combined to select the best machine learning algorithm prediction model. Shapely Additive exPlanations (SHAP) values were used to interpret the best machine learning models [23].

All statistical analyses, model construction and validation in this study were based on R software (version 4.1.3).

Results

After screening based on inclusion and exclusion criteria, 805 patients were finally included in the study, and 75 (9.3%) patients suffered from POP, and the entire process of screening and analysis is shown in the flow chart (Fig. 1). The entire dataset was randomly divided 70:30 into a training set (n = 563) and a validation set (n = 242), and there were roughly no statistically significant differences between the two data (Additional file 1: S1.). We extracted 28 variables from each patient, and the patient characteristics in the training set are shown in Table 1.

Fig. 1
figure 1

Flowchart of data screening and analysis

To avoid multicollinearity among the variables included in the model, LASSO regression was used to screen the features included in the model, and the results showed that when the lambda value was chosen as lambda.min (0.01331355), a total of nine features with nonzero coefficients were screened (Fig. 2), namely Age, CI, COPD, WBC, HB, GLU, STB GLOB, and Ka. Further correlation analysis was performed to analyze the correlations among these nine variables and a correlation heatmap was drawn (Fig. 3). The correlations of all variables were less than 0.4, indicating that there were no significant correlations among the screened variables. The screened variables were used as features to construct prediction models using seven machine learning algorithms (CART, GBM, KNN, LR, NNet, RF, XGBoost).

Fig. 2
figure 2

The potential risk factors were selected using the LASSO regression. a Trend graph of variance filter coefficients. Each color curve represents a trend in variance coefficient change. b Graph of cross-validation results. The vertical line on the left side represents λ min, and the vertical line on the right side represents λ 1se. λ min refers to the λ value corresponding to the minimum mean squared error (MSE) among all λ values; λ 1se refers to the λ value corresponding to the simplest and best model obtained after cross-validation within a square difference range of λ min

Fig. 3
figure 3

Heatmap of correlation analysis between variables

The performance of the models constructed by each algorithm was determined by resampling with ten times tenfold cross validation. The AUC values were calculated based on the ROC curves.The AUC values (95% confidence interval) of CART, GBM, KNN, LR, NNet, RF, and XGBoost algorithms in the training set (Fig. 4a) were 0.981 (0.971, 0.991), 0.965 (0.945, 0.985), 0.969 (0.956, 0.983) 0.983), 0.784 (0.72, 0.849), 0.849 (0.794, 0.904), 0.978 (0.96, 0.996), and 0.996 (0.992, 0.999); the AUC values (95% confidence interval) in the validation set (Fig. 4b) were 0.997 (0.993, 1), 0.991 ( 0.982, 1), 0.983 (0.968, 0.997), 0.75 (0.658, 0.841), 0.907 (0.855, 0.958), 0.99 (0.979, 1), and 0.998 (0.994, 1), respectively (Table 2). The ROC curves of the models constructed by each algorithm are shown in Additional file 1: S2-S15. Accuracy, sensitivity, specificity, Kappa value, and MCC value as additional descriptions of the predictive ability of the models are shown in Table 2. For datasets with unbalanced distribution of results, the MCC value reflects the actual predictive ability of the model better than the AUC value. The MCC values show that only the models constructed by KNN and XGBoost algorithms have good accuracy. The Brier scores of CART, GBM, KNN, LR, NNet, RF, and XGBoost algorithms in the training set are: 0.038, 0.038, 0.047, 0.075, 0.065, 0.041, and 0.017, respectively; in the validation set are: 0.023, 0.029, 0.051, 0.081, and 0.058, 0.042, 0.016, respectively (Table 2). The Brier Score of each model is less than 0.25, indicating that the calibration degree of each model is fine. The calibration curves, as a supplement to the calibration degree, are shown in Additional file 1: S16–S29 for the models constructed by each algorithm. The DCA curves show that in both the training set (Fig. 4c) and the validation set (Fig. 4d), the models achieve higher net returns than the "all-intervention" or "no-intervention" strategies over a wide range of thresholds. The DCA curves of the models constructed by each algorithm are shown in S30-S43. Combining the results of each model performance evaluation, the model constructed by the XGBoost algorithm shows the best performance. We further plotted a summary plot of SHAP values to interpret the XGBoost model results (Fig. 5). For each feature, a point corresponds to a patient, and the position of the point on the x-axis (i.e., the actual SHAP value) indicates the effect of the feature on the model output for that particular patient. The vertical coordinates show the importance of the features, with Age, STB, and GLU being the top three variables important to the model.

Fig. 4
figure 4

ROC curves and DCA curves for each model in the training and validation sets. a ROC curves in the training set. b ROC curves in the validation set. c DCA curves in the training set. d DCA curves in the validation set

Table 2 Evaluation metrics of the models constructed by each algorithm
Fig. 5
figure 5

Summary plot of SHAP values for the model constructed by XGBoost algorithm. The vertical coordinates show the importance of the features, sorted by the importance of the variables in descending order, with the upper variables being more important to the model. For the horizontal position "SHAP value" shows whether the impact of the value is associated with a higher or lower prediction. The color of each SHAP value point indicates whether the observed value is higher (purple) or lower (yellow)

Discussion

Predictive tools are becoming increasingly common in clinical practice, and these tools are often developed based on data sets to be used for clinical prognosis and diagnosis prediction [24,25,26,27]. Traditional linear regression and supervised machine learning algorithms are commonly used to construct models. In this study, predictive models of machine learning algorithms were constructed to predict the risk of pneumonia after hip fracture surgery in elderly people based on the early admission data of patients. This study constructed models based on seven commonly used machine learning algorithms, and the model based on the XGBoost algorithm performed best in terms of model performance. The model constructed in this study can identify patients at high risk of postoperative pneumonia after hip fracture at an early stage of hospital admission. Early intervention in high-risk patients can prevent postoperative pneumonia to a certain extent, improve patient prognosis and reduce the medical burden.

It has been reported that China's aging population over 60 years of age has reached 249 million as of 2018, and this population is expected to exceed 450 million by 2050 [28]. As the population ages, the number of hip fractures will continue to increase. As the "last fracture of life", hip fractures are essential to receive surgical treatment. Surgical treatment can significantly reduce the 1-year mortality rate of patients [29]. POP is the most common postoperative complication of hip fracture in the elderly, with an incidence of 4.9% to 15.2% [30,31,32]. POP is strongly associated with many short-term and long-term prognoses, including prolonged hospital stays, ICU admissions, readmission rates, and mortality [33]. Therefore, the ability to reduce the incidence of POP by intervening earlier would provide many benefits to the patient, the patient's family, and the social health care system. A number of variables have been found to be risk factors for POP after hip fracture, such as preoperative hypoproteinemia, COPD, CI, age, male, anemia, and diabetes mellitus [30, 32, 34]. In addition, surgery-related factors, such as time from injury to surgery, duration of surgery, and type of anesthesia, have also been shown to be high-risk factors for POP [32, 35]. The results of our analysis were similar to theirs. The variables we included in the models have been shown to be associated with postoperative pneumonia in many studies [36, 37].

A nomogram has been constructed by Zhang et al. [38] and Xiang et al. [35] for predicting pneumonia after hip fracture surgery in the elderly. Both of their nomograms have good AUC values (0.84 and 0.905, respectively), however simply reporting AUC values for data with an unbalanced distribution of dichotomous results is not sufficient. Even more important is the ability of the model to predict positive events. Our model addresses this issue well by reporting MCC values. Furthermore, the variables they included in their analysis all included variables related to surgery, such as time of surgery, time from injury to surgery, and type of surgery. This prevents the identification of high-risk patients early in their admission and timely intervention. However, the variables included in the model for this study could be collected quickly and easily in all regions, which facilitated the use of the tool. More importantly, we are the first study to apply machine learning algorithms to predict pneumonia after hip fracture surgery. Based on the current exploration in the field of artificial intelligence, it is necessary to apply common machine learning algorithms to this field for experimentation.

However, there are still some limitations in this study. (1). This is a retrospective study, and retrospective bias and selection bias of the data are difficult to avoid. (2). All data were obtained from a single center, and there was some bias in the selection of the population, so there may be some limitations in the application of the model to populations in other regions. (3). The amount of data included in the analysis was small. Although the sample size of this study met the basic requirements for constructing the model, the sample size was still not large enough [39, 40]. In particular, a sufficiently large sample size is required to construct models for machine learning algorithms. Based on these, we need a large sample size of multicenter prospective studies for further validation of this study.

Summary

In this study, seven machine learning algorithms, CART, GBM, KNN, LR, NNet, RF, and XGBoost, were used to construct models to predict postoperative pneumonia in elderly people with hip fracture. The model based on XGBoost algorithm has excellent performance and can be used to clinically assist physicians in decision making to identify high-risk patients early in hospital admission and intervene earlier.