Background

The prevalence of elective knee arthroplasty for arthritis continues to increase, reaching 1.5% in the general population and 10.4% in those aged 80 years [1, 2]. A pattern of increasing frequency has been reported by many worldwide joint registries [2,3,4]. Knee arthroplasty is being extended to patients who are older than 60 years, or those who have substantially comorbidities, or those who have preoperative symptoms [2]. However, systemic complications such as deep vein thrombosis, delirium, and renal complications can occur in 6–9% of patients who undergo knee arthroplasty, especially in older patients [5, 6]. As the number of knee arthroplasty in older patients and those who have comorbidities is increasing annually, it is preferrable to prevent postoperative complications [6]. Of these complications, postoperative delirium can result in significant delay in rehabilitation, prolonged hospitalization, and increased mortality [4]. The incidence of delirium after elective joint arthroplasty, including knee arthroplasty, is 5–17% [5, 7, 8]. Despite its potential adverse effects, there is no consensus with regard to prevailing risk factors of delirium for developing prevention strategies [6].

Electronic health records (EHRs) accumulate huge amounts of data, facilitating machine-learning and the use of artificial intelligence [9,10,11,12]. Machine-learning models for prediction of delirium have also been developed. These models have performed better than logistic regression models for hospitalized patients [10, 12,13,14]. However, these models don’t target knee arthroplasty patients. In addition, these models use unmodifiable data such as age [10, 14], female [14], number of diagnosis [10], and data that can be collected only retrospectively such as length of hospital stay [10]. Moreover, all these models were based on data from single hospitals without external validation. It remains unclear whether these machine-learning models could improve postoperative prognoses in daily clinical practice.

This study has the following hypotheses 1) patients might have a high-risk of delirium after knee arthroplasty and long hospital stay, and 2) postoperative delirium can be predicted through machine-learning using only preoperative features. Thus, the objectives of this study were: 1) to identify key preoperative risk factors, especially modifiable factors, for delirium development, and 2) to develop and validate a machine-learning model for predicting postoperative delirium in knee arthroplasty patients.

Methods

Study population

This study included the patient who underwent primary and revision knee arthroplasty from January 2016 to Sep 2019 at two tertiary referral hospitals [15]. The developmental cohort included patients from one hospital, and the validation cohort included patients from another. The type of knee arthroplasty was defined as follows: Unilateral knee arthroplasty (UKA), Total knee arthroplasty (TKA), and revision knee arthroplasty. Exclusion criteria were 1) those who were younger than 50 years, 2) those who had established an active delirium at the time of hospitalization [16, 17]. Lowering the age threshold may compromise the integrity of cohort by including patients with underlying disease like bone tumor, RA, osteonecrosis, etc. Increasing the threshold may inefficiently shrink the study population. Fifty years age threshold was set by rule of thumb.

A total of 4,029 patients were eligible (1,973 from hospital A and 2,060 from hospital B). After apply the exclusion criteria, 1,931 patients from hospital A were assigned to the developmental cohort and 2,049 patients from hospital B were assigned to the validation cohort (Fig. 1). Baseline characteristics of both cohorts are listed in Table 1. The mean age was 71.0 (standard deviation [SD]: 6.9) years in the development cohort and 71.3 (SD: 6.7) in the validation cohort. Female comprised 86% in the development cohort and 88% in the validation cohort.

Fig. 1
figure 1

The study population. A total of 1,931 and 2,049 patients from two tertiary teaching hospitals were included in the analysis

Table 1 Baseline characteristics of the developmental and validation cohorts

Surgical protocol

Both cohorts were treated via either a parapatellar or mid-vastus approach depending on surgeons’ preferences [15]. A posteriorly stabilized implant was placed in more than 80% of TKA cases. One gram of intra-articular tranexamic acid (TXA) was given unless patients had the following contraindications: TXA allergy, a history of deep vein thrombosis, pulmonary embolism, or ischemic heart or cerebrovascular disease, and/or a glomerular filtration rate (GFR) less than 60 mL/min [15]. Pain was controlled by Celecoxib 200 mg bid, Tramadol 37.5 mg tid from 1 day before to 1 week after the surgery and IV PCA (nefopam 80mcg, fentanyl 1000mcg) for 3 days after surgery. Continuous passive motion (CPM) was applied 1 day after the surgery [15]. Ambulation was permitted 12 h after the surgery [15]. Periarticular multimodal drug injection (Ropivacaine 225 mg + Ketorolac 30 mg) was applied at 1 day after the surgery.

Outcomes

Primary outcomes were the development of delirium during the first postoperative week. Delirium assessment was conducted retrospectively using EHRs.

First, diagnosis of delirium was made based on the Diagnostic and Statistical Manual of Mental Disorder (DSM-5) [18]. Natural language words were collected from EHRs implicating the presence of delirium. Red flag word set included disturbance in attention, disoriented features, and behavioral alteration [19, 20]. A total of 69 natural language words were included. Second, two medical doctors reviewed postoperative delirium consultation with psychiatrists and postoperative antipsychotic drugs prescription records. Findings were adjudicated by a psychiatrist who had extensive training in delirium assessment [15].

When psychiatrist’s diagnosis was absent, diagnosis of delirium should be conducted in more complicated method. To be more specific, those red flag word set contains “increased irritability, no motor response, no verbal response, aggressive behavior, delusional, inappropriate emotional response, hard to communicate, lose orientation to time, place, person, decreased attention”. Inside EHR, not only physician but nurses also record patients’ status in timely fashion. Thus, it contains pretty much comprehensive and thorough information pertaining to patients’ medical and mental condition. Therefore, records containing those words in red flag set cannot directly diagnose but can assist psychiatrist determine whether patients were delirious at that time or not. For example, when red flag words were present on some patient’s EHR record and antipsychotic drugs were prescribed on the same day, the patient was very likely to be delirious. Otherwise, when antipsychotic drugs were not prescribed, we analyzed entire EHR record of each patient and decision was made in a most conservative manner. To sum up, patients without enough evidence of delirium was classified as non-delirium and there was no missing data relating to the outcome measurement.

Therefore, delirium assessment was conducted by trained staffs using DSM-5 augmented with a validated medical record review method [15, 19,20,21].

Predictor variables

A total of 63 variables were initially chosen as candidate predictors based on findings of previous studies [5, 6, 8, 10, 15, 21, 22]. They are listed in Additional file 1. After removing variables with which less than 10 patients present non-missing values, 54 variables remained. These variables were directly compared with the final outcome value; postoperative delirium; and variables with p-value < 0.05 were only chosen, which made them 20. When calculating p-value, t-test and chi-square test were used in continuous and categorical variables, respectively. These 20 variables are listed in Table 2.

Table 2 Comparison of the delirium and non-delirium groups of the developmental cohort

Demographic data included age, sex, body mass index (BMI), current smoker, and alcohol consumption (more than 5 times/week) [23, 24]. The American Society of Anesthesiologists Classification (ASA Class), fall-down risk, visual impairment, hearing impairment, and sleep impairment were extracted from preoperative checklists. Morse Fall Scale (MFS) was used for fall-down risk (Table 3) [25]. Score was calculated based on the following six patient factors: previous fall down history, presence of secondary diagnosis, usage of walking assistant device, presence of heparin lock, stability of ambulation, and psychiatric condition. The fall-down risk variable was defined as dichotomization of the final risk score. The type of surgery, operation number, and type of anesthesia (general or spinal or epidural) were included. Type of surgery included unilateral knee arthroplasty, total knee arthroplasty, revision total knee arthroplasty. Operation number included unilateral, simultaneous bilateral, staged bilateral (1-week interval).

Table 3 Morse fall risk assessment

Serum laboratory results included blood urea nitrogen (BUN), creatinine, BUN/Cr ratio, eGFR (Modification of Diet in Renal Disease), hemoglobin (Hb), hematocrit (Hct), white blood cell (WBC), c-reactive protein (CRP), erythrocyte sedimentation rate (ESR), total protein, albumin, prothrombin time (INR), ALP, AST, ALT, total bilirubin, total cholesterol, sodium, and potassium (the latest value within 90 days before surgery). Urine laboratory results included albumin level (the latest value within 90 days before surgery). The collection of preoperative laboratory values is a routine procedure in most Korean hospitals. Thus, missing values were quite rare.

To explore preoperative medication status and underlying diseases, admission records were combined with in-hospital drug prescriptions [12]. Three important drug classes (i.e., anticholinergic drugs, hypnotics/sedatives, and opioids) were included in the analyses. Drug categorization was based on the Anatomical Therapeutic Chemical (ATC) classification. Drug details are listed in Additional file 2. Anticholinergic drug cognitive burden scale [26], number of hypnotics and sedatives drugs, number of opioid drugs, and total number of drugs (any kinds of oral medication) were extracted to represent preoperative medication status.

Hypertension, diabetes mellitus, hypoglycemia, hypercholesterolemia, acute kidney injury, end-stage renal disease, atrial fibrillation, pulmonary embolism, ischemic heart disease, neurologic disorders (Parkinson’s disease, dementia, epilepsy, headache disorder), depression, generalized anxiety disorder, schizoaffective disorder, obstructive sleep apnea, cerebrovascular disease, meningitis, adrenal insufficiency, peripheral arterial disease, peripheral vascular disease, malignancy, sepsis, septic arthritis, and HIV + were extracted. History of trauma and history of amputation were also extracted.

Statistical analyses

All statistical analyses were performed using SAS version 9.4 (SAS Inc, Cary, NC, USA). A gradient boosting machine (GBM) was used to predict the probability of delirium, employing all predictor variables. GBM used a series of decision trees, where each tree corrected residuals of previous trees. XGboost is a machine learning based gradient boosting model. Figure 2 is the visualization of one of 300 trees that consists our final model. Tree leaf on the lower end is the prediction score of each tree. On each step, tree is constructed with proper split values. Addition of extra branch on each tree may enhance the accuracy of the model while increasing the complexity of the model. These accuracy score and complexity score are quantified to determine the necessity of extra tree-branching. Following this particular method, XGBoost became a compelling technique in machine learning that learns fast while avoiding overfitting. More details can be found at https://arxiv.org/abs/1603.02754. AUROC was obtained by roc_auc_score function in scikit-learn library. It is calculated as area under the curve when FPR(false positive ratio) and TPR(true positive ratio) are plotted on XY axis with varying threshold values. Python 3.7.11 and Google Colaboratory were used to encode the machine-learning algorithm. Missing values were imputed using a built-in GBM algorithm. Two feature-selection methods were used: sequential feature selection and forward elimination. The stratified K-fold (K = 10) approach was used to select predictor variables and optimize hyperparameters.

Fig. 2
figure 2

Visualization of one of gradient boosting trees [13]

The developmental cohort was divided into two subgroups: a training group (N = 1351, 70%) and a test group (N = 580, 30%). The 20 variables were divided into categorical variables and continuous variables. Feature selection method with stratified tenfold cross validation on training group was applied for each variable subgroup. Top three variables selected by this algorithm were included in the final model. For each variable subgroup, the 4th, 5th and 6th ranked variables presented by this algorithm were tested in conjunction with six pre-selected variables. The variable that maximized the performance of internal validation was incorporated into the final model. The final model contained a total of seven variables.

The final model was trained using the training group (N = 1351) with seven selected variables and tested with the test group (N = 580) to calibrate the internal validation within the developmental cohort. Youden index was used to identify the optimal ROC curve threshold [27]. External validation was performed using all data from one institution as a test set (n = 2,049).

Results

Of 3,980 patients, 196 (4.9%) were diagnosed with delirium after knee arthroplasty. These delirious patients had longer hospital stays (15.4 days vs. 11.6 days, p < 0.001) than non-delirious patients. Of 20 variables, seven key predictors were selected for the model, including four continuous variables (age, serum albumin level, number of hypnotics and sedatives drugs, and total number of drugs) and three categorial variables (neurologic disorders, depression, and fall-down risk). The odds ratio of polypharmacy (total number of drugs >  = 6) patients was 2.38 (95% confidence interval (CI): 1.55–3.30). The XGBoost importance plot is shown in Fig. 3 [12]. The performance of the final model on developmental cohort calibrated as AUC was 0.80 (95% CI: 0.77–0.84) after internal validation. Optimal threshold, sensitivity and specificity of the model was 0.085, 0.85, 0.69 respectively. For validation cohort, AUC score was 0.82 (95% CI: 0.80 – 0.83) and the sensitivity, specificity was 0.72 and 0.73 respectively. Our model was uploaded in an online website, which can be found at https://safetka.connecteve.com. The model automatically calculates the probability of postoperative delirium and visualize the value itself along with weights and significance hierarchy of each seven variables that users enter. The model was saved as “Predict_Delirium_after_knee_arthroplasty.pkl”. Thus, clinicians could still use the model when several value of variables cannot be obtained. The detailed protocol was uploaded in github repository; https://github.com/JasonJeongMed/Postoperative_Delirium_prediction_afterKA. The AUROC curve and the confusion table of internal and external validation are shown in Fig. 4.

Fig. 3
figure 3

The importance factor of the complete model. The feature importance plot was shown from the highest F score. The feature’s higher F score have a greater impact on the prediction of postoperative delirium

Fig. 4
figure 4

The AUROC and confusion table of the model. The pictures on the left from the top to the bottom are the AUROC curve of the internal and external validation, respectively. The pictures on the right from the top to the bottom are the confusion table after internal and external validation, respectively

Discussion

In the present study, the postoperative delirium risk was predicted based on the preoperative EHRs data using a machine-learning algorithm. Thus, this algorithm can be used not only to screen high-risk group, but also to assist orthopedic surgeons to take more proactive approach on delirium prevention [12].

This algorithm can also be applied in independent institutions, because the predictive performance could be maintained in an external validation. Key preoperative variables to predict delirium after knee arthroplasty were incorporated into a machine-learning algorithm. The model yielded sound performance in terms of AUC in both internal and external validation, with comparable sensitivity and specificity values respectively. Thus, the model is not institution-specific. It can be readily accessed in the outpatient clinic [12]. Even when several variables were not able to obtain, given that XGboost model works with missing values, the physician could still use this model with the rest of the variables. Twenty-one (10.7%) patients in the delirious group were hospitalized for more than 3 weeks, leading to high costs of care. Thus, implying that presence of delirium and long hospital stay may be correlated. However, direct causality cannot be guaranteed. Similar causal relationship issue regarding modifiable variables is also addressed below.

Several previous studies have used machine-learning to develop delirium prediction models [14, 24]. However, these models were not validated with an external cohort [14, 24]. Corradi et al. have developed a model using a large dataset (128 variables) with high ROC-AUC (91%) without an external validation for model [24]. In addition, too many variables can compromise an external validation. Our machine-learning model used only seven key variables that appeared to be the most important factors with respect to correlation with delirium. All variables were commonly measured in a clinical setting. Our model was not only internally validated, but also externally validated with patients from an independent institution, confirming that our model was not overfitted. Thus, its application to other institutions can be warranted.

Key preoperative features included in our model have already been discussed in previous studies as risk factors of delirium. Inouye et al. have described that age, fall-down risk, neurologic disorders, depression, and polypharmacy are risk factors of delirium [28]. Suman et al. described that those with lower albumin level and lower nutrition status have far higher risk for delirium using pooled analysis [29]. In our study, pre-operative patients take an average 5.75 drugs (Fig. 5). John et al. have established a prediction model, with polypharmacy having a high rank [24]. In our study, the odds ratio of those with polypharmacy (total number of drugs >  = 6) having delirium was 2.38 (95% CI: 1.55–3.30).

Fig. 5
figure 5

Distribution of total number of drugs at admission. A total of 1,931 patients took average 5.67 pills (SD: 4.19)

Our work had several limitations. First, the presence of delirium was not based on structured methods such as Mini-Mental State Examination (MMSE), CAM-ICU, and CERAD (Consortium to Establish a Registry for Alzheimer’s Disease) neuropsychological battery. Chart based retrospective decision of delirium can omit several patients with delirium when chart information is not sufficient to prove the status of the patients. This could lead to relatively low proportion of delirious patients compared to other studies that focused on delirium prediction [6, 30, 31]. However, various methods including consultation with psychiatrist, prescription of antipsychotic drug, and natural language analysis from medical records were deployed to detect postoperative delirium. Moreover, two psychiatrists directly reviewed medical records to confirm delirium cases to enhance accuracy. Secondly, this algorithm was developed by one medical center. It failed to cover the general population. Substantially higher female proportion of woman (> 80%) compared to western knee arthroplasty recipients (around 60%) shows that demographics may vary by institutions and countries [1, 32, 33]. Although our model was based on one particular institution, numbers of cases in both developmental and validation cohorts were larger than those of other studies. Thirdly, only patients with knee arthroplasty were recruited in the first place. This might have compromised the generalizability of the model when applying to general patients. However, incidence of postoperative delirium varies by type of surgical intervention the patient went through. Thus, by limiting the population to only single surgical intervention recipient, the machine learning model could efficiently highlight the rest of the variables that can contribute to incident delirium after surgery. Fourthly, patient’s underlying history was only identified as the name of the disease itself. It failed to deliver the severity and functional impairment of patients. However, incorporation of the severity index could seriously compromise the simplicity of EHRs information. Furthermore, we collected multiple disease histories with varying comorbidities. These variables could putatively replace the severity index of particular diseases. Lastly, modifiable variables we included in the model (albumin level, number of drugs taken preoperatively) do not actually guarantee that modification of such variables will reduce the incidence of delirium. Especially recommendation on our web-app was based on reduction of the possibility of postoperative delirium. Thus, clinical evidence of impact of these variables on postoperative delirium on real hospital setting has yet to be established. Confounders that affect both postoperative delirium and modifiable variables should be taken under consideration to clarify the causal relationship. However, we suggest that boosting the low albumin value (especially lower than 3.0 mg/dl) and refrain from taking unnecessary duplicated medication are considered as routine procedures in several hospitals. Moreover, studies described these two variables as potential risk factors for delirium [24, 28, 29]. Thus, even clinical evidence lacks, physicians can take this modification as a precautionary measure when the model recommends as such. We currently embarked on a prospective study with delirium prediction including treatment response modeling to investigate potential risk factors that has real impact on postoperative delirium. We expect that this follow-up study will verify the causal mechanism that links modifiable factors we suggested and postoperative delirium.

Conclusion

With just 7 preoperative variables, a web-based machine learning algorithm that can predict delirium after knee arthroplasty was constructed. The model is simple. It was validated to improve both short- and long-term prognoses of knee arthroplasty patients. Postoperative delirium could be a potential correlating factor of longer hospital stay. Thus surgeons should strive to avoid.