Introduction

The number of hip fractures continues to rise, and are predicted to an incidence of 6.26 million cases each year worldwide in 2050 [1]. Numerous patient and injury characteristics are associated with a high mortality rate after hip fracture, with incidences ranging from 14 to 35% in the first year [2,3,4]. The treatment decision for femoral neck fractures has been a frequent topic of discussion in the orthopedic literature, where the optimal treatment decision-making and implant choice remain challenging [5, 6].

Predicting mortality may guide which patient may benefit from arthroplasty surgery (hemi- or total hip arthroplasty), internal fixation (e.g. a sliding hip screw or cancellous screws) or nonoperative management [7, 8]. In patients aged 65 years or above, the decision between arthroplasty and internal fixation remains under debate, and optimal treatment may be individualized depending on patients’ preferences and goals, informed by the risk and benefits of treatment options [5, 6]. Long-term functional outcomes may be better in healthy older patients undergoing arthroplasty compared to internal fixation, with lower reoperation rates [9, 10]. A recent study showed that a shared decision-making process including non-operative management for a proximal femoral fracture might be a viable option for frail institutionalized patients with limited life expectancy [8]. Identifying patient and injury characteristics associated with mortality may aid surgeon, patients and family in shared decision-making and optimize care in femoral neck fracture patients [11]. In other words, a decision support tool to predict shorter- and longer-term mortality would allow for risk stratification of patients aged 65 years or above with femoral neck fractures to guide treatment decision-making.

Thus, an accurate preoperative prediction model may be required to efficiently target patients benefiting from a specific intervention and facilitate true shared decision-making based on personalized risks and benefits. Many mortality prediction models have been described in the geriatric trauma [12, 13] and hip fracture population [14,15,16], but only few studies predict mortality in the hip fracture patient beyond the 30-day period with good model performance [14]. Most hip fracture registries have a follow-up period of maximum 1 year [17], the use of institutionally collected data creates the opportunity to develop prediction models with longer follow-up. Prior prospective randomized controlled trials chose 2-year as the endpoint to account for longer follow-up for management of the acute hip fracture patient [6, 18]. In addition, clinical decision support using machine learning (ML) algorithms has been employed in the hip fracture population (e.g. 30 day mortality [16] or 30 day delirium [19] prediction), and has also shown to be useful in helping to predict outcomes in other areas including orthopaedic surgery [1,2,3,4, 20,21,22].

Therefore, this study aimed to develop and internally validate a clinical prediction model using machine learning algorithms for 90 day and 2 year mortality in femoral neck fracture patients aged 65 years or above.

Materials and methods

Data source

This retrospective cohort study was approved and registered with the institutional review board (IRB) prior study start-up. A search in the Research Patient Data Registry (RPDR) was performed to identify patients older than 65 years of age who underwent operative treatment for a femoral neck fracture, OTA type 31-B (as classified by the Orthopaedic Trauma Association (OTA) [23]), who presented to our institutions between January 2001 and December 2017. RPDR is a clinical data registry that collects medical records from institutions within the Partners Healthcare System and may be queried after IRB approval. Our institutions accounted for two level I trauma centers and three community (non-level I trauma) hospitals. Patients were excluded if presented with a pathological fracture.

Primary outcomes

The primary outcome was 90 day and 2 year mortality in patients sustaining a femoral neck fracture, OTA type 31-B. Mortality was assessed by cross-referencing the Social Security Death Index (a database of people whose deaths were reported to the Social Security Administration) and through manual chart review. The time endpoints of 90 day and 2 year mortality were chosen on the basis of prior studies [6, 18, 24].

Baseline data

The following preoperative variables were collected: age, gender, race, ethnicity, marital status, veteran status, side of injury, displacement of the fracture, Charlson Comorbidity Index, presence of comorbidities [myocardial infarction, congestive heart failure, peripheral vascular disease, cerebrovascular accident, dementia, chronic obstructive pulmonary disease, rheumatic disease, peptic ulcer disease, liver disease, diabetes, hemi- and paraplegia, renal disease, cancer, coagulopathy, drug abuse, alcohol abuse, depression], preoperative medication use [immunosuppressants, anti-coagulants, steroids, bisphosphonates, angiotensin converting enzyme inhibitors, angiotensin receptor blockers, beta blockers, beta-2 agonists, opioids] and laboratory characteristics [calcium(mg/dL), creatinine(mg/dL), hemoglobin(g/dL), potassium(mEq/L), platelet count(103/µL), prothrombin time(PT), International Normalized Ratio (INR), white blood cell count(103/µL), absolute lymphocyte count(103/µL), absolute neutrophil count(103/µL), neutrophil/lymphocyte ratio, platelet/lymphocyte ratio]. We did not assess peri- or postoperative variables as candidate input variables emphasizing the development of a preoperative prediction model to aid treatment decision-making.

Multiple imputation with the missForest methodology was used to impute variables with less than 30% missing data [25].

Variable selection

Variable selection was performed to identify and select those preoperative variables contributing most to our outcome variable, conducted by entering all relevant explanatory variables into random forest algorithms with recursive selection [26]. Given the rule of thumb for developing prediction models with a binary outcome (those with and without the outcome), we ensured at least 10 events for each predictor variable included in the model [27].

Development and internal validation of the clinical prediction model

The following ML algorithms were chosen for modeling based on prior research [19, 22, 28, 29]: Stochastic Gradient Boosting (SGM), Random Forest (RF), Support Vector Machine(SVM), Neural Network (NN) and Elastic-Net Penalized Logistic Regression (PLR).

Internal validation was carried out by performing a stratified 80:20 split of the dataset to create a training set (n = 1983) and a test set (n = 495). Subsequently, the algorithms were trained on the training set with ten-fold cross-validation repeated 3 times. Cross-validation means dividing data into a selected number of groups, named folds. First, the data are divided into 10 equally sized folds. Then, the algorithms were trained on 9 of the 10 folds (90% of the training data) and tested on the remaining fold (10% of the training data). Consecutively, performance was evaluated in the test dataset.

Model performance

Model performance was evaluated according to a proposed framework for evaluation of a clinical prediction model [30] that includes: discrimination with the c-statistic, calibration slope and intercept (in line with the method by Cox [31]) and the overall performance with the Brier score.

The c-statistic (area under the curve of a receiver operating characteristic curve) is a score ranging from 0.50 to 1.0 with 1.0 indicating the highest discrimination score and 0.50 indicating the lowest. The higher the discrimination score, the better the model’s ability to distinguish patients who got the outcome from those who did not [32].

A calibration plot plots the estimated versus the observed probabilities for the primary outcome. A perfect calibration plot has an intercept of 0 (< 0 reflects overestimation, > 0 reflects underestimating the probability of the outcome) and a slope of 1 (model is performing similarly in training and test sets) [30, 33]. In a small dataset, slope is often < 1 reflecting model overfitting; probabilities are too extreme (low probability too low, high probability too high) [32].

The null-model Brier score, which equals the probability of mortality in the dataset, was used to benchmark the algorithm’s Brier score. A Brier score lower than the null-model Brier score indicates superior performance of theprediction model to this null benchmark. Perfect prediction would have a Brier score of 0 and 1 the poorest prediction [30].

Decision curve analysis

In addition, decision curve analysis was undertaken and visualized to investigate the net benefit (weighted average of true positives and false positives) of the conducted algorithms over the range of risk thresholds for clinical decision-making [34]. The net benefit is a weighted average of true positives and false positives, formula = sensitivity x prevalence – (1-specificity) x (1 – prevalence) x odds at the threshold probability). With threshold probability, we refer to the probability that an algorithm ranks a ‘positive’ outcome over a ‘negative’ outcome. In this study, a ‘positive outcome’ is someone at high risk of mortality in 90 days or 2 years. If the threshold is set at 0.5, than patients with a probability > 0.5 are classified as ‘positive’, and < 0.5 are classified as ‘negative’. If the threshold is set at 0.8, then patients with a probability > 0.8 are classified as ‘positive’, and < 0.8 are classified as ‘negative’. The decision curve of the model is compared to decision curves of treating everyone as being at risk for shorter- or longer-term mortality (depending on the endpoint), and treating no one as being at risk.

For 90 day mortality, risk thresholds in the range of 1:3 (risk of 25%) to 1:5 (risk of 17%) seemed clinically relevant [35]. This effectively means we accept 3 to 5 cases of underestimation (a predicted probability that is too low for surviving up to 90 days, which may result in choosing a less invasive treatment option) per case of overestimation (a predicted probability that is too high for surviving up to 90 days, which may result in choosing a more invasive treatment option).

For 2 year mortality, higher risk thresholds, in the range of 1:2 (risk of 33%) to 1:3 (risk of 25%), seemed clinically relevant [35]. Not performing arthroplasty surgery in patients surviving up to 2 year is worse than in patients surviving up to 90 days. Therefore, we accept fewer cases of underestimation of the mortality probability.

Open-access web-application and individual patient explanation

The best-performing algorithms across the model performance metrics as described above, for each primary outcome (i.e. 90 day and 2 year mortality), were deployed as an open-access web application accessible on desktops, tablets and smartphones.

Individual patient-level explanations are incorporated in the web application for interpretation of the model to understand the reasoning how the model made a certain prediction. Local model explainability helps in understanding which features of the patient contributed most to the model’s prediction [36].

Statistical analysis

Categorical variables will be described as absolute numbers with frequencies, and continuous variables as medians with interquartile ranges (IQR). The model performance metrics were calculated with 95% confidence interval (CI). Given the retrospective study design, post hoc power analyses were conducted to evaluate the sample size of the study with an alpha value of 0.05.

Guidelines

The study set-up has been performed following the Transparent Reporting of Multivariable Prediction Models for Individual Prognosis or Diagnosis Guideline (TRIPOD Statement) (Supplemental Table 1) [37].

Software

Data pre-processing and analysis were performed using R Version 4.1 (“R: A Language and Environment for Statistical Computing” The R Foundation, Vienna, Austria 2013) and R-studio Version 1.2.1335 (R-Studio, Boston, MA, USA). Hyperparameter tuning was performed as recommended in the R package vignettes.

Results

Participants

In total, 2478 patients were included in this study with 90 day and 2 year mortality rates of 9.1% (n = 225) and 23.5% (n = 582) respectively. Of the included patients, 69.5% (n = 1723) patients were female, and the median age was 83 years (interquartile range = 76–88) (Table 1). The post hoc power analyses revealed 100% power in both evaluations (α = 0.05).

Table 1 Baseline characteristics of study population, n = 2478

Rates of missing data for covariates were as follows: race (144, 5.8%), ethnicity (144, 5.8%), marital status (98, 4.0%), veteran status (465, 18.8%), calcium (394, 15.9%), creatinine (193, 7.8%), hemoglobin (194, 7.8%), potassium (200, 8.1%), platelet (196, 7.9%), PT (274, 11.1%), INR (386, 15.6%), white blood cell count (193, 7.8%), absolute lymphocyte (567, 22.9%), absolute neutrophil (491, 19.8%), neutrophil/lymphocyte ratio (567, 22.9%), platelet/lymphocyte ratio (572, 23.1%).

90-day mortality prediction model

The following variables were included after variable selection: (1) INR; (2) age; (3) creatinine level; (4) absolute neutrophil; (5) CHF; (6) male gender; (7) hemoglobin; (8) displaced fracture; (9) hemiplegia and (10) COPD (Fig. 1).

Fig. 1
figure 1

(A) Receiver operating curve, (B) global variable importance, (C) calibration plot and (D) decision curve analysis for the stochastic gradient boosting algorithm for prediction of 90 day mortality in the testing set, n = 495

The performance of the conducted ML algorithms varied as measured by c-statistic from 0.53 to 0.74 in the independent testing set (Table 3) (performance of cross-validation on the training set can be found in Table 2). Model performance as assessed on calibration plot ranged from intercept − 0.08 to 0.15, and slope ranged from 0.71 to 2.13. The Brier scores ranged from 0.078 to 0.082 with Null model Brier score 0.83 (Table 3). The SGB algorithm was chosen as the final model with a c-statistic of 0.74, calibration intercept of − 0.05, calibration slope of 1.11 and a Brier score of 0.078.

Table 2 Algorithm performance on cross-validation of training set, n = 1983, mean (95% confidence interval)
Table 3 Algorithm performance in independent testing set, n = 495, mean (95% confidence interval)

2-year mortality prediction model

The following variables were included after variable selection: (1) age; (2) male gender; (3) absolute neutrophil; (4) CHF; (5) use of beta-blocker; (6) COPD; (7) CVA; (8) hemoglobin; (9) creatinine level and (10) INR (Fig. 2).

Fig. 2
figure 2

A Receiver operating curve, (B) global variable importance, (C) calibration plot and (D) decision curve analysis for the elastic-net penalized logistic regression algorithm for prediction of 2 year mortality in the testing set, n = 495

The performance of the conducted ML algorithms varied as measured by c-statistic from 0.63 to 0.70 in the independent testing set (Table 3) (performance of cross-validation on the training set can be found in Table 2). Model performance as assessed on calibration plot ranged from intercept − 0.04 to 0.22, and slope ranged from 0.83 to 0.97. The Brier scores ranged from 0.16 to 0.17 with Null model Brier score 0.18 (Table 3). The PLR algorithm was chosen as the final model with a c-statistic of 0.70, calibration intercept of -0.03, calibration slope of 0.89 and a Brier score of 0.16.

Decision curve analysis

Decision curve analyses of both models revealed that decision changes based on the model outperformed as compared to the default strategies of changing management for all patients or for no patients (Figs. 1D and 2D). However, the clinical utility in relevant risk threshold ranges showed clearer benefit for the 2 year mortality model.

Available web-application

The chosen algorithms were incorporated into a web-based application and deployed as open-access available tool for clinicians: https://sorg-apps.shinyapps.io/hipfracturemortality/.

Individual patient-level explanation

As an example, an 84 year-old male patient, after filling out the patient and injury characteristics values in the algorithm, this patient has a 13% and 43% chances of mortality in respectively 90 day and 2 year following femoral neck fracture surgery (Figs. 3 and 4).

Fig. 3
figure 3

Example of individual patient-level explanation for 90 day mortality prediction

Fig. 4
figure 4

Example of individual patient-level explanation for 2 year mortality prediction

Factors increasing the likelihood of 90 day mortality were an INR of 1.5, male gender, hemoglobin level of 9, sustaining a displaced fracture and an age of 84 years old. However, the lack of CHF and a creatinine level of 0.8 reduced the likelihood of mortality following femoral neck fracture surgery. The predicted probability (13%) was higher than the average probability in the total patient cohort (9.1%) (Fig. 3).

Factors increasing the likelihood of 2 year mortality were male gender, a history of COPD and dementia. However, a low absolute neutrophil level of 0.8 and the lack of CHF or having a history of CVA reduced the likelihood of mortality. The predicted probability (43%) was higher than the average probability (23.5%) (Fig. 4).

Discussion

The aim of this study was to develop and internally validate a clinical prediction model that can predict 90 day and 2 year mortality in femoral neck fracture patients aged 65 years or above to aid the challenging treatment decision-making. The developed and internally validated models show promise in estimating mortality in this frail patient population.

Limitations

The results of this study should be viewed in light of several limitations. First, the study was a retrospective study beholden to limitations inherent to such research design and prospective validation remains to be evaluated. Second, the mortality rate in our cohort was relatively low compared to other populations of hip fracture patients [38]. This resulted in predicted probabilities as shown in the calibration plots, up to 50% and 80% risk for respectively 90 day and 2 year mortality. This means that our model is likely more accurate in healthier hip fracture patients. To ensure external validation, our model should be validated in a cohort with representative rates, and future studies should assess the transportability of the developed algorithm to datasets with patients with higher mortality rates. Third, for this study, we chose a 80/20 ratio for data splitting into training and test set, which has been mostly used in previous literature [20,21,22, 39]. There is no fixed rule for the ratio of data splitting but a different ratio for algorithm training may have led to different model performances. Fourth, preoperative risk stratification for mortality is needed to guide the difficult treatment decision-making, although intraoperative and postoperative factors associated with complications, such as reoperation or postoperative infection, may be confounding with mortality after surgery. Future research may estimate this influence looking at causality for confounding factors [40]. Fifth, patients were included in the study undergoing femoral neck fracture surgery. However, patients who were suspected by the clinician of a very short survival prediction (e.g. 30 day) were chosen to be treated conservatively and were not investigated in this study. In future studies, both conservative and surgical treated patients should be included to optimize mortality prediction in all patients sustaining a femoral neck fracture to guide the challenging treatment decision-making (i.e. whether to operate or not?). Sixth, evaluating possible co-injuries occurring during trauma, some of which may cause significant disability, may influence survival outcome. Evaluating these co-injuries and calculating their injury severity score may have had an influence as candidate input variable on the model performance. In addition, we did not investigate the influence of the presence of advanced directives, which may influence the decision-making process in patients aged 65 years or above. In future research, when comparing treatment effects in conservatively and operatively treated patients, we recommend these influences to be investigated. Lastly, the 2 year mortality was chosen on the basis of endpoints in prior prospective randomized controlled trials [5, 6]. The 90 days was chosen to predict short-term mortality and accounts for a possible underestimation in outcomes seen with only a 30 day mortality. From a patient and provider perspective, a death 90 days post hip fracture is just as significant as one within 30 days. It takes in to account not just acute in-hospital complications but also short-term complications that may occur in skilled nursing facility and discharge to the community. There is growing evidence in other specialties that 30-day mortality underestimates short-term mortality [41, 42]. Future studies may additionally investigate earlier time points, such as 30 days or 1 year.

Findings

In the ranges of risk where we think clinical utility of the model is to be expected, the 2 year model clearly adds clinical utility over treating everyone or none with total hip arthroplasty. However, we assumed a more simplified scenario, since there are multiple treatment options available, namely nonoperative management, surgical fixation and arthroplasty surgery. The 90 day mortality model might add clinical utility for decisions between these tiered treatment options, which are more subtle and complex to assume. Moreover, clinical utility should be reassessed after external validation, and with input from multiple institutions from different countries. If found to be externally valid (generalizable to independent populations), future studies should prospectively evaluate the developed and validated tool. In patients with limited life expectancy, patients predicted with a high risk of short-term mortality, nonoperative management might be a viable option in the shared decision-making process compared to surgical fixation [8]. If patients have a high chance of surviving beyond the 90 day endpoint, surgical management would be in place [43]. Frail patients with a nondisplaced hip fracture may be favored to surgical fixation compared to arthroplasty surgery [6, 18]. However, arthroplasty is associated with a lower risk of reoperation and better long-term functional outcomes, at the cost of greater infection rates, blood loss, and operative time and possibly an increase in early mortality rates and may be recommended in patients with a longer-term life expectancy (e.g., high probability of surviving beyond the 2 year endpoint) [44].

When aiming to develop a prediction model that is applicable in daily practice, variables should be included in the trained algorithm that are readily available and use of definitions that are in line with daily practice should be followed. In this study, variables derived from variable selection are clinically readily available and in line with daily practice. It is important to emphasize that treatment decision-making should not be solely based on the outcome of an individualized probability calculator. The orthopaedic surgeon should discuss the available treatment options and reach a treatment decision following a shared decision-making process. Prediction of mortality is only one of the aspects to be considered in treatment decision-making.

The most important factors associated with a greater risk of 90 day mortality included in the SGB algorithm were INR, age, creatinine level, absolute neutrophil, CHF, male gender, hemoglobin level, displaced fracture, hemiplegia and COPD. For 2 year mortality, the most important factors were age, male gender, absolute neutrophil, CHF, use of beta-blocker, COPD, CVA, hemoglobin, creatinine level and INR. Our findings are in line with previous research on proximal femoral neck fractures in general and broader populations. Regarding age and sex, prior studies revealed a higher risk for higher age and the male gender [45,46,47]. The effect of CHF, CVA and COPD is in line with the high risk reported for a higher ASA classification in earlier studies [48, 49]. A possible explanation for this effect might be a lower physical condition of the patient at baseline and therefore a less adequate recovery after complications (e.g. pneumonia). Another explanation for comorbidities in general could be a lower life expectancy as a result of the comorbidity itself. In regard to displacement of the fracture, a reasonable explanation for the higher risk might be the disruption of the vascularization of the femoral head and the tendency that a displaced fracture comes from a frailer patient to start with where more displacement occurred compared to a younger patient (with the same level energy of trauma). This could lead to multiple complications and secondary surgery eventually resulting in death [50]. The prognostic value of laboratory characteristics in predicting mortality after hip surgery is a less explored subject. But the elevation of creatinine and absolute neutrophil count reflects respectively declined renal function and inflammation [51]. Which again is linked to a higher ASA score and a lower baseline physical condition. Whereas a higher INR is reflecting the inability to coagulate and most likely the use of anticoagulants, resulting in a higher risk for bleeding and as a result of this a higher risk for morbidity and mortality [46, 51]. On the contrary a lower hemoglobin is related to chronic comorbidities, which might reflect in a lower odds for mortality for higher hemoglobin levels [51].

Over the recent years, a lot of research has been done predicting mortality in femoral neck fracture patients. The greater part of these tools developed made an estimation of risk based on age, gender and in general the presence of comorbidity [52, 53], whereas the other part looked at postoperative factors, such as early ambulation after surgery and postoperative lab values [54, 55]. In contrast to the broader presence of comorbidity, our study used the ability of ML algorithms to differ between the effects of different types of comorbidity in a large database to estimate the individual value of each factor. This resulted in a more patient centered prediction tool.

Future perspectives

External validation is essential before testing and implementing the ML algorithm in clinical practice. Subsequently, a prospective observational study of the comparison of the current ML model prediction compared to a physician’s prediction of mortality can assess the clinical usefulness of the developed model. This will assess if the model’s prediction was more accurate than those of the treating physician [56]. An internally and externally validated algorithms can then be integrated into the electronic health record with an active feedback loop to improve the model performance and ultimately be integrated in the clinical workflow [57, 58].

Conclusion

In summary, the developed and internally validated clinical prediction model effectively predicts 90 day and 2 year mortality in femoral neck fracture patients aged 65 years or above with good model performance on discrimination, calibration and Brier score. Especially the model for 2 year mortality would likely improve the challenging treatment decision-making. Nevertheless, the model first requires external validation in an independent cohort. The model can be freely accessed: https://sorg-apps.shinyapps.io/hipfracturemortality/.