Introduction

As people live longer, the incidence of hip fracture is increasing, and it is estimated that there will be approximately 4.5 million cases per year worldwide by 2050 [1]. Osteoporotic hip fractures severely adversely affect the quality of life of older adults, resulting in substantially higher mortality and disability, and a markedly reduced quality of life [2]. Approximately 25% of older adults who sustain a hip fracture die within the first year [3], and this rate is 8 times higher than the mortality rate in the general population of older adults [4]. Advanced age, male sex, clinical comorbidities, cognitive impairment, type of fracture, choice of treatment, and ambulatory status have been proposed as potential prognostic factors for mortality after hip fracture [4,5,6,7,8]. However, since the prediction of death is complex and multifactorial, mortality cannot be predicted using a single variable. Identification of patients at high risk of death facilitates the modification of associated perioperative factors that can reduce mortality.

The recent development of machine learning techniques enables the development of healthcare-related outcome prediction tools that include perioperative parameters and clinical variables [9]. These techniques can evaluate real-world data, which often have complex nonlinear relationships between variables [10], and are capable of building models with performances that exceed those of conventional prediction methods [11]. Many studies have evaluated the performances of machine learning methods, particularly mortality prediction algorithms that have been developed for cardiac surgery [12], liver resection following colorectal cancer metastasis [13], traumatic head injury [14], critically ill influenza patients [15], and surgery for hepatocellular carcinoma [16]. The studies found that these algorithms had a better performance than conventional regression techniques. Developing a high-performance prediction model is beneficial as the goal of predicting mortality is to identify high-risk patients and provide clinicians with opportunities to consider what to do next to improve outcomes in these patients.

The high rate of 1-year mortality among older adults with fragility hip fracture suggests the need for a similar machine learning approach to predict death in this vulnerable population. Several models are already available. Artificial neural networks and logistic regression are well-known methods and have been extensively studied [17,18,19,20,21,22]. Support Vector Machine [23, 24], Naive Bayes [20, 24] and Random Forests [22,23,24] have also been used to predict mortality after hip fracture. However, there are other novel methods that demonstrate good performance with high accuracy in predicting death [15, 25, 26], such as Gradient Boosting, which have not yet been thoroughly explored for use in patients with hip fracture.

Since osteoporotic hip fracture occurs in a highly vulnerable population [27], an accurate prediction method would help clinicians identify patients who require special attention and additional services. This study aimed to develop and internally validate 7 machine learning models to predict 1-year mortality after fragility hip fracture in patients for whom a treatment decision (i.e. type of surgery or conservative treatment) had already been made. The models were the Gradient Boosting Classifier (GB), Random Forests Classifier (RF), Artificial Neural Network Classifier (ANN), Logistic Regression Classifier (LR), Naive Bayes Classifier (NB), Support Vector Machine Classifier (SVM), and K-Nearest Neighbors Classifier (KNN). We hypothesized that machine learning models could predict 1-year mortality after fragility hip fracture with high predictive performance.

Methods

Study design and population

This retrospective cohort study included patients with fragility hip fractures from a single center (the Department of Orthopedic Surgery, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand) from July 2016 to October 2018. The research protocol was approved by the Siriraj Institutional Review Board (approval number 122/2021), and the study was registered in the Thai Clinical Trials Registry on 22 February 2021 (registration number TCTR20210222003).

We used the International Classification of Diseases, Tenth revision (ICD-10) diagnosis codes S7200 (neck fracture of the femur), S7210 (intertrochanteric fracture of the femur), and S7220 (subtrochanteric fracture of the femur) to retrieve and review patient data from electronic medical records. Patients with fragility hip fracture were eligible for inclusion if they were aged 50 years or older and had a minimum follow-up period of 1 year or until death. The exclusion criteria were multiple fractures or fractures caused by cancer that had been confirmed by pathological study.

Hip fracture treatment protocol

All patients with fragility hip fractures were attended by our fracture liaison service (FLS). The service provided a multidisciplinary care team consisting of orthopedic surgeons, metabolic bone disease specialists, anesthesiologists, geriatricians, physical therapists, physiatrists, and nurses. The team members provided post-fracture care programs and secondary-fracture prevention measures for the patients. Initially, the FLS team was alerted when patients with fragility hip fracture visited the emergency department. The acute pain management protocol was followed by anesthesiologists who specialized in pain medicine. The patients were then seen by the FLS team in an orthopedic ward. A geriatrician evaluated their medical condition and performed preoperative medical optimization. If surgery was decided, we operated as soon as the condition of each patient was suitable and an operating room was available. Conservative treatment was proposed for patients who already had a low probability of survival. A physical therapist started a rehabilitation program as early as possible to prevent complications from prolonged immobility. The multidisciplinary care team approach continued to play a key role in patient recovery, either after surgery or with conservative treatment. Physical therapists, in conjunction with psychiatrists in some cases, encouraged early mobilization for all patients. The physical therapists also assessed the risk of falling and planned appropriate home modification programs. FLS nurses facilitated the care process and reported each patient’s condition to other team members. As part of the secondary-fracture prevention program, metabolic bone specialists prescribed anti-osteoporosis medications, and osteoporosis education was given to patients and their families by orthopedic surgeons or nurses. The discharge planning process was carried out from the beginning of admission and was aimed at providing continuing care when patients were ready to leave the hospital. The entire care process was flexible and was adjusted according to the condition of each patient. FLS team meetings were scheduled every week for team members to review the status of each patient and discuss how to improve the care process. All patients were followed by telephone calls 3 and 12 months after discharge and yearly thereafter.

Data collection

Demographic and clinical data were collected from electronic medical records. The data related to age; sex; body mass index (BMI), as stratified by the World Health Organization expert consultation for Asian populations [28]; Charlson Comorbidity Index (CCI) score; underlying diseases (presence of stage 4 or 5 chronic kidney disease [CKD], heart disease, lung disease, cerebrovascular accident [CVA], or dementia); type of fracture (femoral neck fracture, intertrochanteric fracture, or subtrochanteric fracture); type of treatment (conservative, dynamic hip screw fixation, multiple screw fixation, cephalomedullary nailing, hemiarthroplasty, or total hip arthroplasty); time to surgery; pre-injury ambulatory status (bedridden, indoor dependent, outdoor dependent, indoor independent, or outdoor independent); and walking assistive device (no ambulation, without assistive device, wheelchair, walker, quad cane, tripod cane, or single cane). We interviewed the patients or their relatives by telephone to assess the living status of the patients (1-year mortality after hip fracture). Because these factors had been shown to be essential predictors of mortality after hip fracture, they were used to develop a prediction model [4, 8, 29,30,31,32].

Machine learning development process

Data preprocessing

A de-identified dataset of 492 patients was enrolled in the study. Fifteen variables (3 continuous and 12 categorical) were collected. The continuous variables were age (integer), BMI (decimal number), and CCI score (integer). The categorical variables were sex, pre-injury status, pre-injury gait aid, CKD, heart disease, CVA, lung disease, dementia, diagnosis, type of treatment, time to surgery, and 1-year mortality after hip fracture. Using standard dummy coding, 3 continuous and 11 categorical predictors of one-year mortality were included in the computational process (Fig. 1a). There were no missing data in the dataset.

Fig. 1
figure 1

Machine learning development process (a) 3 continuous and 11 categorical predictors of one-year mortality were taken into the computational process. (b) A stratified random sampling technique was applied to split patients in a 70:30 ratio to a training dataset and a testing dataset. (c) Training dataset was used to identify the optimal hyperparameters which provided the highest accuracy in a fivefold internal cross-validation of each model. (d) The performance of all algorithms were evaluated with another, unseen, testing dataset

Algorithm training and validation

We applied a stratified random sampling technique to split patients in a 70:30 ratio into a training dataset and a testing dataset (Fig. 1b). All variables were normalized to a scale of 0–1 to make the training process less sensitive to the scale of the variables. In this study, 7 machine-learning classifier algorithms [33] were used. They were the Gradient Boosting Classifier (GB), Random Forests Classifier (RF), Artificial Neural Network Classifier (ANN), Logistic Regression Classifier (LR), Naive Bayes Classifier (NB), Support Vector Machine Classifier (SVM), and K-Nearest Neighbors Classifier (KNN). Using the training dataset (344 patients), manual parameter tuning, grid search, and random search were conducted to identify the optimal hyperparameters [34] which provided the highest accuracy in a fivefold internal cross-validation of each model (Fig. 1c). Subsequently, we evaluated the performance of all algorithms by using another unseen testing dataset (148 patients; Fig. 1d). The confusion matrix and evaluation measures that were reported consisted of accuracy, positive predictive value, negative predictive value, specificity, sensitivity, calibration plots and area under the receiver operating characteristic curve (AUC). Calibration refers to how well the observed and the predicted outcomes match up. An optimal value of a slope and intercept for perfect calibration is 1 and 0, respectively. We evaluated the contribution of each characteristic to the prediction model using SHAP (Shapley values) [35, 36]. In these processes, the Python programming language (version 3.8.3; Python Software Foundation, Wilmington, DE, USA); and Scikit-Learn (version 0.24.2; Machine Learning library) [33] were used. All computational processes were performed in a Windows Server 2016 Datacenter (2.2 GHz × 4 virtual processors, with 15.9 GB of random-access memory).

Statistical analysis

Comparisons were made of the baseline characteristics of the patients in the training and testing groups (Table 1), and of those who died and those who survived (Table 2). Continuous data were compared using Student’s t-test; the results are presented as mean plus/minus standard deviation. Categorical data were compared using the chi-squared test or Fisher’s exact test; these results are given as number and percentage. Data analyses were performed using PASW Statistics for Windows (version 18; SPSS Inc., Chicago, IL, USA). Accuracy, sensitivity, specificity, and positive and negative predictive values of all models were calculated and compared using the DTComPair package (https://cran.r-project.org/web/packages/DTComPair/DTComPair.pdf). We also compared the AUC of all models by performing permutation testing using the coin package (https://cran.r-project.org/web/packages/coin/index.html) in R software version 4.1.1 (http://www.r-project.org/). A two-tailed P value < 0.05 was considered statistically significant.

Table 1 Comparison of the demographic and clinical characteristics of all patients, and of those in the training and testing groups
Table 2 Comparison of the demographic and clinical characteristics of patients who died and those who survived

Results

From July 2016 to October 2018, 498 patients with a hip fracture were admitted to our institution. Six patients had multiple fractures and were excluded from our study, leaving 492 for final analysis. Through telephone interviews, we were able to obtain the living status of all patients 1 year after the respective fragility hip fractures.

Baseline characteristics

The mean age of the study participants was 78.4 years (range, 50–101). Of the 492 enrolled patients, 72.2% were women and 27.8% were men (Table 1). Four hundred fifty-three patients (92.1%) had a CCI score ≥ 3, and 259 (52.6%) walked without an assistive device. The majority (55.5%) could ambulate outdoors independently before the hip fracture. Four hundred and sixty patients (93.5%) underwent operative treatment. During the study period, 50.7% of the patients were able to undergo surgery within 48 h of admission. The median length of stay was 11 days (interquartile range 8–17 days). The median time from admission to surgery was 2 days (interquartile range 1–4 days). The 1-year mortality rate after hip fracture was 12.6%.

We randomly assigned 344 and 148 patients to the training and testing datasets, respectively. There were no significant differences in the patient characteristics of the 2 datasets. A comparison of the characteristics of the patients who survived and those who died is presented in Table 2. It revealed that the deceased group was significantly older (P = 0.007), had a significantly higher proportion of male patients (P = 0.023), and had significantly higher prevalences of all 5 evaluated comorbidities (stage 4 or 5 CKD, heart disease, CVA, lung disease, and dementia; all P = 0.001 or P < 0.001). CCI score, pre-injury ambulatory status, type of fracture, and treatment were also significantly different.

Machine-learning performance comparisons

We used a dataset of 344 patients to train 7 machine learning models to predict 1-year mortality after fragility hip fracture. The performances of the 7 algorithms are detailed in Table 3. For the training dataset, the accuracies were GB model = 1.00, RF model = 0.97, ANN model = 0.99, LR model = 0.94, NB model = 0.90, SVM model = 0.94, and KNN model = 0.94. As to the testing dataset, all models achieved high AUCs (between 0.81 and 0.99; Fig. 2). The RF model provided high predictive performance, with an accuracy of 0.95, a positive predictive value of 0.93, and a sensitivity of 0.68. There were significant differences between the AUC of RF model and the ANN, LR, NB, SVM, KNN models (0.99 vs 0.92 vs 0.95 vs 0.91 vs 0.94 vs 0.81, respectively). The calibration of all models show intercepts ranging from -0.09 to 0.35 and slopes ranging from 0.55 to 1.32 (see Supplementary file 1). The calibration plot of the RF model was well calibrated with slope and intercept close to optimal value.

Table 3 Comparison of the performance of each model, by confusion matrix and evaluation measures
Fig. 2
figure 2

Receiver-operating characteristic curve (ROC) of (a) Random Forests algorithm (RF); (b) Gradient Boosting algorithm (GB); (c) Artificial Neural Network algorithm (ANN); (d) Logistic Regression algorithm (LR); (e) Naive Bayes algorithm (NB); (f) Support Vector Machine algorithm (SVM); (g) K-Nearest Neighbors algorithm (KNN); and (h) all algorithms

Machine-learning-model selection

The RF model demonstrated high accuracy, positive predictive value, and sensitivity. Due to our screening-test study design, we mainly focused on sensitivity to detect at-risk patients who might encounter mortality 1 year after their fracture. The sensitivity of RF model were not significant different from GB and ANN model. However, the RF model had higher AUC than the ANN model. The RF model also had good calibration. Accordingly, the RF algorithm was selected for model construction.

Analysis of clinical variable contribution

Figure 3 illustrates the impact of each characteristic on the entire dataset prediction by the trained RF algorithm. The 5 most influential clinical characteristics were CCI score, heart disease, BMI, dementia, and lung disease. The best-tuned hyperparameters for the RF obtained from hyperparameter optimization were max_depth = 60, max_features = ‘sqrt', min_samples_leaf = 4, min_samples_split = 5, n_estimators = 400, and random_state = 8. The best-tuned hyperparameters for all models are listed in Table 4. The receiver operating characteristic curve of the Random Forests algorithm is shown in Fig. 2a.

Fig. 3
figure 3

Characteristics of the selected model (Random Forests model): SHAP Value summary graph of top-20 variables and their impact on the prediction

Table 4 The best-tuned hyperparameters for each model

Machine learning model application

The trained RF algorithm subsequently used demographic and clinical information to construct a predictive model to estimate the probability of 1-year mortality of patients. A programming interface was developed to allow healthcare providers to access the application at www.hipprediction.com. By entering details of key characteristics into the prediction model, it was able to generate the probability of 1-year mortality of individual patients with fragility hip fractures. The characteristics were age; sex; BMI; pre-injury ambulatory status; assistive device usage; CCI score; type of fracture; type of operation; time to surgery; and the presence of CKD, heart disease, CVA, lung disease, and dementia.

Discussion

Patients with osteoporotic hip fracture are likely to suffer higher morbidity and mortality than non-fracture patients within the same age group [4, 37]. Adverse events after hip fracture can occur during hospitalization and the post-discharge period. They include events such as infection [38], heart failure [38], and thromboembolism [39]. As these events can lead to death in high-risk patients [40, 41], an attempt to identify those at risk of complications after hip fracture is a primary objective to reduce mortality. In patients predicted to be at risk for poor outcomes, interventions can be initiated to prevent complications and reduce the likelihood of death. Among the possible interventions, two are paramount. The first requires full and honest communication with the patient and family about the planned treatment and risks. The second requires prompt clinical decision-making by doctors, families, and caregivers to ensure that scarce resources, such as an intensive care unit, are effectively allocated, and that the need for additional services is determined and actioned. Examples of these services are intensive monitoring and optimization of patients’ medical problems, additional home visits and family nursing support, and a personalized exercise program. In this study, we evaluated machine learning methods to develop a model that would predict 1-year mortality after a fragility hip fracture. Our results showed that the tool we designed had high mortality-prediction accuracy.

Our experiment carried out a 3-step, general, machine learning approach: data preprocessing, algorithm training, and algorithm testing with an unseen dataset. We searched for the best performance of each algorithm by fivefold cross-validation using manual search and automated hyperparameter optimization with grid search and random search [34]. After comparing all algorithms, we selected the RF model. It provided the highest performance in predicting 1-year mortality, indicated by its highest sensitivity in detecting high-risk patients (Table 3).

GB and RF models are tree-based methods that gather the results from individual trees. The difference between the 2 models is how the trees build up and how the results are collected [42, 43]. The GB model adds each tree up sequentially and allows self-correction from the error at each step to improve the model, while the RF model builds all trees up simultaneously. GB collects the results during the whole process from start to finish, while RF sums up and averages the results when the process is finished. Theoretically, GB usually takes a longer time to train, but it can provide better performance than RF if the parameters are carefully tuned. GB and RF have also demonstrated their high performance in predictive modeling of health outcomes [15, 25, 26, 44].

The ANN models used in this study were standard feed-forward, multilayer perceptrons with back-propagation neural networks trained using a supervised training algorithm [45]. Each of the ANN models consists of 3 layers: 1 input layer, 1 hidden layer, and 1 output layer. The neural network takes input variables, which are then passed through the layer of hidden neurons to the output layer. The ANN model is a flexible system that allows complex modeling of nonlinear relationships. It is not adversely influenced by the interconnection of multiple variables, which is the case with patients with a hip fracture [46]. Moreover, the ANN model can automatically adjust the weight in the network and self-correct, which produces a better prediction accuracy [47].

In contrast, LR is commonly used to predict the probability of occurrence of an event. It assumes that the outcome has a linear relationship with the variables [48]. The LR method predefines the association among the predictors in a linear manner, which gives it the ability to explain the degree of causal relationship for each variable [17, 49, 50]. However, if there is interplay between or among the factors or a nonlinear relationship exists, the LR model may be a less appropriate modeling option [51] for our dataset.

The NB algorithm is a classification technique that applies Bayes’ theorem by assuming that each variable is independent of each other [52]. This assumption makes the learning phase easier and simple to implement. However, the NB algorithm might be inaccurate in scenarios with increased bias for nonlinear problems. NB might also have a better performance in datasets that have a small sample size [53].

SVM utilizes a geometrical relationship between variables and predicts outcomes by identifying the boundary (or hyperplane) between the data of 2 classes and separating them. It has a good performance in distinguishing between 2 classes, provides flexibility for both linear and nonlinear problems, and has a low risk of overfitting from its regularization feature [54, 55]. However, interpretation of the model is often difficult [56, 57].

The KNN creates decision boundaries to separate different classes [58]. Its advantage is that it is simple to implement and easy to understand. It also utilizes a memory-based approach and is capable of being quickly trained with a new dataset. However, if the K value is not appropriately chosen, the model has a high risk of overfitting [59, 60].

Although most machine learning approaches offer flexibility in solving sophisticated connections between variables and outcomes, interpretation problems can arise and present a challenge to implementation [61]. One way to explore how each predictor affects the outcome of interest is to apply the Shapley (SHAP) values to rank the predictors according to their contribution to a model [35, 36]. In Fig. 3, the SHAP-value graph illustrates the value of the top 20 variables and their impact on the predictions of the RF model. It explains why a high CCI score increased the predicted 1-year mortality. CCI scores are calculated by considering multiple comorbidities of patients, and the scoring system has been validated to predict 1-year mortality [62]. CCI scores have also been reported to be related to reduced survival in women after suffering hip fracture [63]. As a result, the CCI score became the most influential characteristic in our model. Furthermore, dementia, lung disease, heart disease, and BMI were found to be important predictive factors for 1-year mortality. When we compared the characteristics of the patients who died and those who survived, all characteristics differed significantly, other than BMI and gait aid. All of these statistically different characteristics had also been reported to be significantly associated with post-fragility fracture mortality by other studies. The significant differences we found in almost all factors support those previous findings and emphasize the essential predictive value of the factors. Advanced age and male sex were significant predictors of mortality [29,30,31]. A difference in mortality was observed between operative and nonoperative treatments [4]. Multiple comorbidities and preoperative mobility were also reported to be associated with death after hip fracture [8, 32]. However, despite a trend toward an improvement in the mortality rate and increased knowledge and awareness of these factors, a recent systematic review found that the mortality rate within 1 year of a hip fracture remains as high as 22% [64]. The development of a tool that can combine multiple variables into a single prediction model would be of great utility to clinicians. With advances in the machine learning approach, we can train and test models with datasets to recognize patterns that would otherwise be hidden in complex relationships between variables [65].

Various machine learning models for mortality prediction in hip fracture patients have been proposed [17,18,19,20,21, 23, 24, 66]. Unlike most previous studies, we investigated only hip fractures from low-energy trauma, and we did not exclude patients aged less than 65 years because osteoporotic hip fracture can occur at the age of 50 [67]. We also included variables, such as different kinds of hip fracture (ie, neck, intertrochanteric, and subtrochanteric fracture of the femur) and whether the patient received operative or nonoperative treatment. Although most hip fractures are currently managed operatively [68], there is a proportion of patients who are managed nonoperatively (eg, hip fracture patients whose pre-injury status was nonambulatory, and patients with an already low probability of survival). A recent systematic review showed that nonoperative treatment was associated with higher rates of morbidity and mortality [69], which further emphasizes the need for special attention to prevent complications in this group of patients.

Our machine-learning-developed model serves as a screening tool for the identification of high-risk patients and provides information that aids clinical decision-making. For instance, a prediction of death within 1 year would encourage physicians to develop an intensive treatment plan and prepare the resources needed for high-risk patients. These actions and interventions would be expected to reduce complications and improve patient survival. To identify at-risk patients with high accuracy, machine learning algorithms (eg, Gradient Boosting and Random Forests) can be used to develop models with acceptable predictive performance. The prediction tool can also be used to counsel patients and caregivers and encourage them to comply with the medical actions and interventions considered necessary.

Limitations

This study is not without limitations. Like other hospitals, our center admits patients directly from their homes after suffering a hip fracture. However, being a tertiary care center, our hospital also accepts patients with very high-risk comorbidities who have been transferred from primary hospitals that do not have the resources needed for their treatment. This could lead to a selection bias. First, patients with severe comorbidities who are transferred to our hospital have a higher risk of mortality than the general population. In addition, any increase in the duration before surgery resulting from delays in transferring the patients to our hospital only exacerbates the risk of mortality [70]. However, we did not incorporate the time gap between fracture and hospital admission into our model and this could contribute to another potential limitation.

Second, our data were drawn from only a single center with a relatively small sample. This may have resulted in overfitting of the models. Therefore, external validation is essential to confirm the predictive ability of our 1-year mortality prediction tool. It is important to note that each center has different protocols for treating patients with hip fractures. This may also help explain the heterogeneity in the outcomes of earlier studies.

Finally, it should be noted that there are other factors that influence outcomes that were not included in our study, such as complications in the hospital and after discharge. These may prove to have some value in predicting mortality after hip fracture.

Conclusions

Our machine learning approach facilitated the successful development of an accurate model to predict 1-year mortality after fragility hip fracture. Several machine learning algorithms (eg, Gradient Boosting and Random Forest) had the potential to provide high predictive performance, based on the clinical parameters of each patient. The 5 most influential clinical variables in the prediction model were the CCI score, heart disease, BMI, dementia, and lung disease. The web application is available at www.hipprediction.com. External validation in a larger group of patients or in different hospital settings is warranted to evaluate the clinical utility of this tool.