Introduction

Prediction of postoperative morbidity and requirement for hospitalization is important for planning of health care resources. With regards to the common surgical procedures of primary total hip (THA) and knee arthroplasty (TKA), the introduction of enhanced recovery or fast-track programs has led to a significant reduction of postoperative length of stay (length of stay) as well as morbidity and mortality [1,2,3]. However, despite such progress, a fraction of patients still have postoperative complications leading to prolonged length of stay or readmissions [1, 3, 4]. Consequently, to prioritize perioperative care, many efforts have been published to preoperatively predict length of stay and morbidity using traditional risk factors such as age, preoperative cardio-pulmonary disease, anemia, diabetes, frailty, etc. [4,5,6,7,8]. These efforts have been based on traditional statistical methods, most often multiple regression analyses, and essentially concluding that it is “better to be young and healthy than old and sick”. Consequently, despite being statistically significant, conventional risk-stratification based on such studies has had a relatively limited clinically relevant ability to predict and reduce potentially preventable morbidity and length of stay [4,5,6,7,8].

More recently, machine-learning methods have been introduced with success in several areas of healthcare and where preliminary data suggest them to improve surgical risk prediction compared to traditional risk calculation in certain anesthetic and surgical conditions [9, 10]. This is also the case in THA, TKA and uni-compartmental knee replacement, where several publications on machine-learning algorithms for prediction of length of stay [11, 12] complications [13], disability [14], potential outpatient setup [15], readmissions [16] or payment models [17], have shown promising predictive value compared to conventional statistical methods [18].

However, few papers have included fast-track programs, and most are based on large databases with the presence of risk factors and complications often relying on administrative coding with limited information on perioperative care, follow-up and discharge destination. In our previous study of 9512 THA and TKAs within a fully implemented fast-track protocol and including the above information, we did not find advantages of machine-learning methods compared to logistic regression in predicting a length of stay > 2 days [19]. However, this may have been due to data imbalance, lack of details on medication and the chosen outcome of length of stay of > 2 days which may not be directly related to preoperative patient characteristics [19]. Furthermore, medical complications resulting in prolonged admission or readmissions may be more clinically relevant than focusing on length of stay when attempting to identify a relevant patient population for future perioperative interventions [20]. Especially within well-established fast-track protocols where LOS is about 1 day [1]. Thus, the combination of modern evidence-based surgical fast-track protocols with machine-learning models remain promising as it may provide an improved and continually developing basis for identifying which patients may benefit from more extensive preoperative evaluation and postoperative medical care.

Consequently, we used a large consecutive cohort of patients undergoing fast-track total hip and knee replacement within a national public health-care system [1] to develop and test a new machine-learning model with an extended number of preoperative variables including information on dispensed reimbursed prescriptions [21], for preoperative prediction of “medical” complications with prolonged length of stay or readmissions.

Our hypothesis was that these changes with regards to preoperative information would make a machine-learning model perform better than logistic regression at predicting which patients would experience postoperative medical complications.

Methods

Study design and population

This study on preoperative prediction is done in accordance with the Transparent reporting of multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement [22] and the Clinical AI Research (CAIR) checklist proposal [23]. The study is based on the Centre for Fast-track Hip and Knee Replacement database which is a prospective database on preoperative patient characteristics and enrolling consecutive patients from 7 departments between 2010 and 2017. Only cases with surgery between 2014 and 2017 were used in the present study to ensure the most up-to date data. The database is registered on ClinicalTrials.gov as a study registry (NCT01515670). Patients completed a preoperative questionnaire with nurse assistance if needed. Additional information on reimbursed prescriptions 6 months prior to surgery was acquired using the Danish National Database of Reimbursed Prescriptions (DNDRP) which records all dispensed prescriptions with reimbursement in Denmark [21]. Finally, data were combined with the Danish National Patient Registry (DNPR) for information on length of stay (counted as postoperative nights spent in hospital), 90-days readmissions with overnight stay and mortality. In case of length of stay > 4 days or readmission, patient discharge summaries were reviewed for information on postoperative morbidity and in case of insufficient information, the entire medical records were reviewed. Readmissions were only included if considered related to the surgical procedure, thus excluding planned procedures like cancer workouts, cataract surgery, etc. Readmissions due to urinary tract infection or dizziness after day 30 were also considered unrelated to the surgical procedure. In case of postoperative mortality, the entire medical record including potential readmissions, was reviewed to identify cause of death. Evaluation of discharge and medical records was performed by PP supervised by CJ. In case of disagreement, records were conferred with HK. Subsequently, causes of length of stay > 4, readmissions or mortality were classified as “medical” when related to perioperative care (renal failure, falls, pain, thrombosis, anemia, venous thromboembolism or infection etc.) and “surgical” if related to surgical technique (prosthetic infection, revision surgery, periprosthetic fracture, hip dislocation, etc.) [1]. In case of a length of stay 4–6 days with a standard discharge summary describing a successful postoperative course, it was assumed that no clinically relevant postoperative complications had occurred. If length of stay was > 6 days but with standard discharge summary, the entire medical record was evaluated to confirm that no relevant complications had occurred.

Perioperative management

All patients had elective unilateral total hip and knee replacement in dedicated arthroplasty departments with similar fast-track protocols, including multimodal opioid sparing analgesia with high-dose (125 mg) methylprednisolone, preference for spinal anesthesia, only in-hospital thromboprophylaxis when length of stay ≤ 5 days, early mobilization, functional discharge criteria and discharge to own home [1]. There are no selection criteria for the fast-track protocol as it is considered standard of care, but we excluded patients with previous major hip or knee surgery within 90-days of THA or TKA and THA due to severe congenital joint disorder or cancer (Additional file 1).

Outcomes

Primary outcome

The primary outcome was to develop a machine-learning model to predict the occurrence of “medical” complications resulting in a length of stay > 4 days or readmission and compare model performance with a traditional logistic regression model. We also investigated the performance of parsimonious models including only the top ten variables from the full machine-learning and logistic regression model, respectively.

Secondary outcome

Secondarily we investigated the performance of the full and parsimonious machine-learning and logistic regression models when including cases with a length of stay > 4 days but no reported “medical” complications.

Statistical analysis

Data consisted of 33 input variables, of which 7 were continuous. All variables were collected prospectively, either through the patient completed questionnaire, through the DNDRP or a combination of both (Table 1). Initially we trimmed the dataset by removing 156 patients (1.7%) who were outliers with regards to weight (< 30 kg or > 250 kg) and height (< 100 cm or > 210 cm) or where these data were missing. To reduce the risk of overfitting and allow for unbiased evaluation of model performance, data was subsequently split into a training set consisting of 18,013 (82.2%) procedures from 2014–2016 and a test set of 3913 (17.8%) procedures from 2017, as is standard in modelling of data with a temporal component [24]. These sample sizes are larger than the proposed minima of 3656, when assuming the model will explain 20% of the variability as suggested by Riley et al. [25]. The data analysis, including sample size calculation, was performed in Python and is available online at https://zenodo.org/record/7330268.

Table 1 Patient demographics with and without the primary outcome (length of stay > 4 days or readmissions due to “medical” morbidity) in the combined test and training dataset

As reference model, we used logistic regression with missing values being handled by multiple imputations. All variables were then normalized to have zero mean and unit standard deviation by subtracting the original mean and dividing by the original standard deviation. In addition, we used boosted decision trees (LightGBM) [26] for the machine-learning models, as such methods work well with categorical data and missing values. We used cross entropy as the objective function for the machine-learning model.

The full machine-learning model was trained and hyperparameter optimized using the Optuna framework [27] with the Tree-structured Parzen Estimator algorithm [28] to efficiently sample hyperparameters and with a median stopping rule to minimize optimization time. The models were trained on the training data and then used for making predictions on the unseen test data (Additional file 1). We did not use cross-validation in order not to assume a constant performance over time. The model classification threshold was intentionally calibrated to include 20% of the total number of patients (positive predictive fraction of 20%). This number was chosen based on clinical assumption of available additional or rearranged resources in the Danish National Healthcare system. We also included results for positive predictive fractions of 25% and 30% to illustrate model performance under such circumstances. Furthermore, we trained two parsimonious models using machine-learning and logistic regression with only the 10 most important features. All mentioned models were calibrated using Platt’s method (Additional file 2) [29]. Calibrated risk score distributions can be found in Additional file 2. Finally, we constructed a model based on age alone (Age) to explore the added value of multiple variable prediction.

To investigate the importance of the included variables, we computed the SHapley Additive exPlanations (SHAP) values, which provide estimates on which variables contribute most to the risk score predictions [30, 31]. Finally, we investigated a potential relation between reimbursed prescribed cardiac drugs, anticoagulants, psychotropics and pulmonary drugs and age. For evaluating model performance, we computed the number of true positives (TP), false positives (FP), false negatives (FN), true negatives (TN), sensitivity (true positive rate = TP / (TP + FN)), precision (positive predictive value = TP / (TP + FP)). Since the data was quite imbalanced (about a 1:20 positive:negative ratio) we also computed the Matthews Correlation Coefficient (MCC) which is independent of class imbalance [32, 33]. The MCC ranges between -1 (the 100% wrong classifier), 0 (the random classifier), and 1 (the perfect classifier). Finally, we computed the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC). To evaluate the statistical difference between the classifiers, we applied a Bayesian metric comparison P(sensitivity) [34], which is the probability that a model will perform better than the machine-learning model relative to the sensitivity. Thus, for two equally performing models P(sensitivity) is ≈ 50%.

Results

Median age in the 3913 patients was 70 years (IQR 62–76), 59% were female and 58% had THA (Table 1). Details on prescribed drug types are shown in Additional File 3. Median length of stay was 2 (IQR: 1–2) days with 7.6% 90-days readmissions and the primary outcome occurring in 182 (4.7%) patients. When applying any model with a positive prediction fraction of 20% to the 3913 patients, 782 qualified as “risk-patients”. The results are summarized in Fig. 1 and Table 2.

Fig. 1
figure 1

a Distribution of full machine learning model risk scores for patients ± the primary outcome. The dashed line marks the classification threshold of 20% positive prediction fraction. b Receiver operating curves (ROC) for the full machine learning model (F-MLM), full logistic regression model (F-LRM), parsimonious machine learning model (P-MLM), parsimonious logistic regression model (P-LRM) and the age-only model (AM)

Table 2 Performance of the models with a predefined positive prediction fraction of 20% for primary outcome

When considering risk scores from the full machine-learning (Fig. 1a) and full logistic regression model leading to this risk-patient selection, 106 and 97 had the primary outcome, respectively. Correspondingly, the sensitivity and precision were 58.2% and 13.6% for the full machine-learning and 53.3% and 12.4% for the full logistic regression model, respectively. The full machine-learning model was superior (Fig. 1b) on all parameters (except AUPRC) compared to any of the other models, although the differences were minor (Table 2). Thus, the likelihood that the full logistic regression model or the parsimonious ML model would actually be better than the full ML model were 17.2 and 26.4% respectively. In contrast, the likelihood that the parsimonious logistic regression and the age-only model would be better that the full ML model were less than 5% (Table 2). The results were similar when using positive prediction fractions of 25% and 30%, but with the sensitivity for the full machine-learning model increasing to 64.3% and 69.2% and precision decreasing to 12.0% and 10.7%, respectively (Additional file 4). Despite age being the single most important variable, age alone had a significantly lower sensitivity at 47.8%.

When evaluating feature importance, we found a strong correlation between the full machine-learning and full logistic regression model, with age and use of walking aids being the most important variables in both (Fig. 2a). From the combined importance of variables outside the top ten, the machine-learning approach extracted more information with fewer variables than logistic regression (Fig. 1b).

Fig. 2
figure 2

a The overall importance of the 10 most important variables measured by the SHAP-values for the full machine-learning and full logistic regression models on the primary outcome (LOS > 4 days or readmission due to “medical” morbidity). Only the importance of prescribed anticholesterols and gender differ between the models. The contributions of the remaining variables are summed in the bottom bar. b The SHAP-values for the full machine-learning model on the primary outcome, where positive values increase and negative values decrease the risk score. Each dot represents a patient and the color is related to the value of the variable with blue being lowest and red highest

For the full machine-learning model, there was a clear signal that increasing age, number of reimbursed prescriptions, and presence of comorbidity, all contributed to an increased risk score. In contrast, a recent date of surgery and an increased hemoglobin level seemed to reduce the calculated risk (Fig. 2b). Individual analysis of the SHAP interaction values for types of anticoagulant prescriptions revealed that prescriptions on vitamin-K antagonists (VKA) or adenosine diphosphate (ADP) antagonists increased, while acetylic salicylic acid and direct oral anticoagulants (DOAC) reduced the risk score of the full machine-learning model, regardless of age (Fig. 3a). The SHAP analysis of prescribed cardiac drugs revealed that prescriptions on Ca2+-antagonists and betablockers in combination with one or two other antihypertensives increased the risk-score, as did prescriptions on nitrates, other antihypertensives and antiarrhythmics. For the remaining cardiac drugs, prescriptions either reduced or had minor influence, and with limited relation with age (Fig. 3b). Preoperative psychotropic prescriptions increased the risk-score except for antipsychotics (0.6%). For users of selective serotonin inhibitors there was a clear age-related distinction with the risk score being increased in elderly patients but decreased in those < 60 years (Fig. 3c). Finally, the risk score increased with prescriptions on inhalation steroid and β-blockers, and more accentuated in the younger patients (Fig. 3d). The results for our secondary outcome which included patients with a length of stay > 4 days, but no reported postoperative complications, were similar as for the primary outcome. In general, we found that the full machine-learning model was slightly superior to the others, although the differences were less than for the primary outcome. (Additional files 5 and 6). While the ten most important variables for the full machine-learning model remained unchanged, familiar disposition for venous thromboembolism replaced gender as one of the top ten important variables in the full logistic regression model (Additional file 7). Furthermore, the SHAP analysis on specific prescribed drugs demonstrated that the machine-learning model found no benefits from information on prescriptions on respiratory drugs, why all SHAP values were zero. In addition, the reduced risk with acetylsalicylic acid and DOAC prescriptions, as well as the influence of practically all cardiac drugs except for nitrates, other antihypertensives and antiarrhythmics, was attenuated (Additional file 8).

Fig. 3
figure 3

SHAP scatter-plot on the contributions to the full machine-learning model on the primary outcome (LOS > 4 days or readmission due to “medical” morbidity), for individual types of prescribed anticoagulants, cardiac drugs, psychotropics and respiratory drugs stratified by age. a Prescribed anticoagulants VKA: vitamin K antagonists ASA: acetylsalicylic acid DOAC: direct oral anticoagulant ADP: Adenosine diphosphate ACE: angiotensin converting enzyme. b Prescribed cardiac drugs ACE: angiotensin converting enzyme AHT: antihypertensive. Other AHT were defined as AHT different from diuretics ANG-II/ACE inhibitors or Ca2+antagonists. IHD: Ischemic heart disease. c Prescribed psychotropics SSRI: Selective serotonin inhibitor SNRI: Serotonin and norepinephrine reuptake inhibitor NaRI: Norepinephrine reuptake inhibitor NaSSA: Norepinephrine and specific serotonergic antidepressants. AD: antidepressants BZ: Benzodiazepines (likely underreported due to limited general reimbursement in Denmark). ADHD: Attention-deficit/hyperactivity disorder. d Prescribed respiratory drugs. SABA: Short-acting beta agonist LABA: long-acting beta agonist LAMA: Long-acting muscarinic antagonist

Discussion

We found that using a machine-learning algorithm including all 33 available variables and a parsimonious machine-learning-algorithm encompassing only the 10 most important predictors improved prediction of patients at increased risk of having a length of stay > 4 days or readmissions due to medical complications compared to traditional logistic regression models. Thus, despite similarities in weighting of predictor variables, using the full machine-learning model resulted in approximately 5% increase in correctly identified risk-patients compared to the full logistic regression model. This corresponded to an increase in AUROC of about 1.5, which is about 3 times larger than what was found in a study investigating potential benefits of machine-learning for the NSQIP risk calculator [35].

In contrast, when also including patients having a length of stay > 4 days but without a well-defined complication as an outcome, the parsimonious machine-learning model was slightly worse than a traditional logistic regression model including all variables. Wei et al. used an artificial neural network model to predict same-day discharge after TKA, based on the NSQUIP database from 2018 and found that six of the ten most important variables were the same compared with logistic regression, similar to our findings [36].

However, patients with one-day length of stay were intentionally excluded due to variations in in-patient vs. out-patient registration [36]. A previous systematic review found that machine-learning algorithms may provide better prediction of postoperative outcomes in THA and TKA [37]. The authors concluded that such models performed best at predicting postoperative complications, pain and patient reported outcomes and were less accurate at predicting readmissions and reoperations [37]. That machine-learning algorithms may improve prediction of complications after THA and TKA compared to traditional logistic regression was also found by Shah et al. who used an automated machine-learning framework to predict selected major complications after THA [13]. However, theirs was a retrospective study based on diagnostic and administrative coding and the selected complications occurred only in 0.61% of patients, potentially limiting clinical relevance. In contrast, we aimed at identifying a cohort which would comprise 20% of patients in which we found about 60% of all medical complications. This we believe, is within the means of the Danish socialized healthcare system to allocate additional resources for intensified perioperative care and with both patient-related and economic benefits due to potentially avoided complications and costs. In this context, the models using 25% and 35% positive prediction thresholds demonstrated that the gain in sensitivity leading to identification of 14–24 more patients with complications was at the cost of 196 to 391 more patients being “wrongly” classified as risk patients. Age has traditionally been a major factor when predicting surgical outcomes and remained the single most important predictor in our study. However, although elderly patients had increased risk of postoperative complications, likely related to decline of physical reserves [38], the use of chronological age alone was inferior compared to both machine-learning and logistic regression models incorporating comorbidity and functional status. Thus, using age by itself for identifying the high-risk population resulted in missing 18% of the “true risk-patients” (87 compared to 106 in the full ML model).

We used the SHAP values for estimation of the impact of the included variables. The SHAP values show which variables contribute most to the risk-score, thus providing a better understanding of the otherwise “black-box” machine-learning model. This approach was also used by Bonde and colleagues, who used deep neural networks to predict postoperative complications across several different surgical procedures [10]. In our study, the SHAP analysis on unique Danish registry data on reimbursed prescriptions, unsurprisingly found a considerable increase in risk-score with an increasing number of prescriptions, especially in elderly patients However, this is a complex relationship where some patients benefit from their treatments, while other may suffer from undesirable side-effects. Nevertheless, the information from the SHAP analysis in machine-learning studies may provide inspiration for new hypothesis-generating studies on risk-factors, e.g. on the potential differences in risk-profile between having preoperative prescribed VKA and DOAKs found in our study. Also, the age-related differences in risk from SSRI’s could guide further studies on “deprescription”.

Another important requirement for machine-learning-algorithms to be clinically useful is user friendliness and not depending on excessive additional data collection by the attending clinicians [9]. In this context, it was disappointing that the parsimonious machine-learning algorithm with only the ten most important variables was slightly worse at predicting the secondary outcome than the full logistic regression model. This could be due to a length of stay > 4 days but without described medical complications more often is related to social and logistical factors not contained within the ten most important patient-related preoperative variables, e.g., having a supportive network, availability of homecare etc. Thus, the information gained by the combination of all available information may be of further importance when merely using LOS as outcomes in prediction studies. However, it also highlights the need for as much detailed, and preferably non-binary, data as possible to fulfill the true potential of machine-learning algorithms. In contrast to several other machine-learning studies, our dataset included only one paraclinical variable, which was preoperative hemoglobin. Although the inclusion of other laboratory tests such as albumin, sodium and alkaline phosphatase has been found to be of importance in some machine-learning algorithms [10, 39] they are not standard in fast-track protocols and not easy to interpret from a pathophysiological point of view. Also, most decisions on intensified postoperative care in elective surgery will likely need to be conducted preoperatively, as there is an increasing need to prioritize limited health-care resources. Thus, although postoperative information such as duration of surgery, perioperative blood length of stays or postoperative hemoglobin have been included in other studies [39], we decided against the use of peri- and postoperative data. The same approach has been used by Ramkumar et al. who used U.S. National Inpatient Sample data including 15 preoperative variables, to predict length of stay, patient charges and disposition after both TKA and THA [17, 40]. However, these studies were not conducted in a socialized health care system, and their main focus was on the need for differentiated payment bundles and without specific information on the reason for increased length of stay or non-home discharge [40].

Our study has some other limitations. First, one of the strengths of machine learning compared to logistic regression is the analysis of multilevel continuous data, whereas we included only a limited number of, often binary, preoperative variables. This could have limited the full realization of our machine-learning model. As previously mentioned, we excluded intraoperative information, including type of anesthesia, surgical approach etc. all of which may influence postoperative outcomes. The observational design of this study means that we cannot exclude unmeasured confounding or confounding by indication. Also, despite that the DNDRP has a near complete registration of dispensed medicine in Denmark, some types or drugs, especially benzodiazepines, are exempt from general reimbursement and thus not sufficiently captured [21]. Furthermore, it is doubtful whether the patients used all types of drugs at the time of surgery (e.g. heparin which is rarely for long-term use). The classification of a complication being “medical” depended on review of the discharge records could also introduce bias. However, we believe our approach to be superior to depending only on diagnostic codes which often are inaccurate [41] and provide limited details on whether the complication may be attributed to a medical or surgical adverse event. The strengths of our study include the use of national registries with high degree of completion (> 99% of all somatic admissions in case of the DNDRP) [42], prospective recording of comorbidity, extensive information on prescription patterns 6 months prior to surgery. Finally, the similar established enhanced recovery protocols in all departments assured that all patients were treated according to the most modern evidence-based principles. Thus, our analysis is based on well-defined time-relevant clinical treatments.

In summary, our results suggest that machine-learning-algorithms may provide slight, but clinically relevant, improved predictions for defining patients in high-risk of medical complications after fast-track THA and TKA compared to logistic regression models. Future studies could benefit from using such algorithms to find a manageable population of patients who may benefit the most from intensified perioperative care.