INTRODUCTION

Delirium, the most common post-operative complication in adults over the age of 65, has an incidence of 15–25% after major elective surgery.1 Delirium is associated with both short- and long-term clinical and functional complications2,3,4,5,6,7,8,9 and greater risk for developing dementia.10,11 Delirium prediction algorithms could help pre-operatively stratify patients according to risk for delirium to improve patient care, reduce risk of adverse outcomes, and facilitate enrollment into clinical trials.

Multiple approaches have previously been proposed to predict delirium,12,13,14,15,16,17 but few studies have employed machine learning (ML) algorithms.18,19,20,21,22,23 ML methods are optimally applied when there is an abundance of data.24 Nevertheless, it is important to consider if ML algorithms can be usefully applied to smaller datasets, which are more common in clinical cohort studies of disorders like delirium. Generally, situations that result in high-volume data rely upon administrative data or routinely collected clinical data that may suffer from measurement error to a greater degree than purposefully collected research data.25 Our objective was to identify the optimal ML approach to predict delirium in a rigorous and well-characterized, prospective, observational cohort study of delirium, and to compare it with a traditional statistical prediction model.

We analyzed data from the Successful Aging after Elective Surgery (SAGES) study, which used reference standard approaches to assess both pre-operative cognitive function and post-operative delirium,26,27,28 which are often not available in large datasets that rely primarily on electronic health records (EHR). Based on prior work,17,23,29 we hypothesized that we could identify a ML model to predict delirium with an area under the receiver operating characteristic curve (AUC) greater than 0.70, indicating good diagnostic accuracy,30 and that this ML model would have a higher AUC than a model derived using stepwise logistic regression. Given the known importance of cognitive function to delirium prediction,31,32,33 we further sought to determine the extent that prediction can be improved with the inclusion of a measure of pre-operative cognitive function (a variable that may not always be available in the pre-operative setting) in the feature set (i.e., list of predictors).

METHODS

Study Population

The SAGES study design and methods have been described in detail previously.26,27 In brief, eligible participants were aged 70 years and older, English speaking, scheduled to undergo elective surgery at one of two Harvard-affiliated academic medical centers, and with an anticipated length of stay of at least 3 days. Eligible surgical procedures included the following: total hip or knee replacement; lumbar, cervical, or sacral laminectomy; lower extremity arterial bypass; open abdominal aortic aneurysm repair; and open or laparoscopic colectomy. Exclusion criteria were dementia, delirium, hospitalization within the past 3 months, terminal condition, legal blindness, severe deafness, history of schizophrenia or psychosis, and history of alcohol abuse or withdrawal. A total of 560 patients met all eligibility criteria and were enrolled between June 18, 2010, and August 8, 2013. Written informed consent for study participation was obtained from all participants according to procedures approved by the institutional review boards of Beth Israel Deaconess Medical Center and Brigham and Women’s Hospital, the two study hospitals, and Hebrew SeniorLife, the coordinating center for the study.

Data Collection

Participants underwent baseline assessment in their homes approximately 2 weeks (mean [standard deviation] 13 [15] days) prior to surgery.26 All study interviews were conducted by experienced interviewers who underwent 2–4 weeks of intensive training and standardization. Inter-rater reliability assessment and standardization on all key study variables, including delirium assessment, was conducted every 6 months throughout the study and coding questions were addressed in weekly meetings of all study staff. Medical records were reviewed by study clinicians to collect information on surgical procedure, anesthesia type and duration, abnormal laboratory results, baseline diagnoses, development of delirium, precipitating factors for delirium (e.g., medications, iatrogenic events, or catheters), post-operative complications, and death.26 Chart abstraction data were randomly checked for illogical values and against data collected as part of the screening process (e.g., surgery type). In addition, a 10% subset of charts underwent re-abstraction for reliability checks.26

Assessment of Delirium

The delirium assessment, which took 10–15 min, included daily brief cognitive testing,27,34 Delirium Symptom Interview (DSI),35 and family and nurse interviews conducted from the first postoperative day until discharge. Delirium was rated using the Confusion Assessment Method (CAM).36 The CAM is a standardized approach with high sensitivity (94–100%) and specificity (90–95%) in prior studies.37,38 Inter-rater reliability was high in SAGES (kappa statistic = 0.92 in 71 paired ratings).26 The DSI was used to rate CAM symptoms. An established chart review method was used to capture delirium symptoms between interviews.28,39 Patients were classified as delirious if either the CAM or chart review criteria were met. The procedure results in approximately 83% of identified cases as resulting from the patient assessment (31% of which are also identified by the chart review), and 17% of cases of delirium are identified through chart review and not detected by patient assessment.28 Given an overall incidence of delirium of 24%, this implies an incidence of CAM delirium of 20% and an incidence of chart delirium of about 10%.

Identification and Formalization of the Predictor Variable Set

Medical records were reviewed with a comprehensive medical record abstraction tool to collect information on the surgical procedure, anesthesia type and duration, baseline diagnoses and comorbidity, abnormal laboratory results, development of delirium, precipitating factors for delirium (e.g., medications, iatrogenic events, catheters, or physical restraints), post-operative complications, and intercurrent illnesses.27 From this information set, we identified features for use in our predictive models. Potential predictors were required either to be readily available in a clinical setting through existing sources (e.g., medical record or standard laboratory data) or through quick screening tests that would be feasible in a busy clinical setting. We decided that although pre-surgical medication use could be predictive of post-operative delirium risk, the process of identifying predictors from among the multitude of medications in various formulations and dosages would require extensive pre-processing and would not satisfy the criterion of being readily available in a clinical setting. Of the remaining potential features, 71 pre-operative variables were selected and included demographic characteristics, lifestyle factors, cognitive function, physical function, psychosocial factors, frailty, sensory function, medical conditions, and laboratory values (Appendix Table 1). We will refer to this set of 71 variables as the full feature set. Missing data in the feature set were multiply imputed by chained equations.

In addition to using the full feature set, we identified a selected feature set with a more accessible number of predictors. Features were selected based on a clinician expert review process consisting of two iterative rounds of review by clinicians (S.K.I., T.G.F., T.T.H., E.D.M., E.R.M., P.A.T.) with expertise in delirium, neurology, geriatrics, geriatric psychiatry, general medicine, and nursing. The final set of 18 predictors is described in Table 1. Because of the known importance of cognitive function in delirium prediction,31 we performed analyses of the selected feature set with and without a summary score from a brief mental status test, the modified mini-mental state examination (3MS).40 Thus, all analyses were performed using three overlapping feature sets: (1) a selected feature set (q = 18 features) selected by an expert panel using an iterative process; (2) the selected feature set plus 3MS (q = 19 features); and (3) the full feature set (q = 71 features).

Table 1 Pre-operative Patient Characteristics

Machine Learning Algorithms and Comparison Statistical Prediction Model

ML algorithms for prediction of delirium included cross-validated logistic, gradient boosting, neural network, random forest, and regularized regression (least absolute shrinkage and selection (LASSO) and ridge regularization).41,42,43,44 In addition, we assessed model performance with two ensemble approaches. Ensemble methods combine multiple ML algorithms to obtain better predictive performance.45 We considered two relatively straightforward ensemble methods. The first ensemble method results in a positive test (i.e., predicts delirium) if any of the 5 individual algorithms are positive (i.e., ensemble-union). The second ensemble method results in a positive test if a majority of tests (≥ 3) are positive (i.e., ensemble-majority).

We used two strategies to compare our ML algorithms to standard approaches for delirium prediction. First, we evaluated prediction using standard backwards stepwise logistic regression. The starting model included all predictors, and each subsequent step eliminated the predictor that resulted in the largest improvement to the Akaike information criterion (AIC), and terminated when the maximum AIC value was reached. Second, we used a previously published delirium risk prediction rule for hospitalized medical patients16 to obtain predictions of delirium in the SAGES sample used for model testing. This predictive model uses vision impairment, severe illness, cognitive impairment, and high blood urea nitrogen/creatinine ratio to stratify patients according to risk for delirium.

Analysis and Comparison of Models

To implement the learning algorithms, we split the SAGES sample into a training set (80%) to be used for model derivation and a testing set (20%) to be used for model validation. Random assignment to training/testing set was stratified on delirium status. For ML models, we performed repeated k-fold cross-validation (k = 4, 10 repeats) to identify the optimal model parameters based on optimization of the AUC in the training set.46 We compared models based on performance in the test set on the following criteria: AUC, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), detection rate, and balanced accuracy. In order to compare these performance characteristics across models, we set the incidence rate at 25%, which is similar to the incidence in our sample (24%) rounded to the nearest 5%. We examined overall calibration by plotting the distribution of predicted probabilities of delirium as a function of observed delirium for all algorithms using violin plots. We also generated calibration curves, which are plots of the observed proportion classified as delirious against the model-implied proportion with delirium-given predictions derived from a model. These figures are displayed in the supplementary material. All analyses were conducted within the R computing programing environment (version 3.6.1, R Core Development Team, Vienna, Austria) using several different packages, including caret,47 nnet,48 earth,49 glmnet,42 randomForest,50 kernlab,51 and gbm.52 Analysis code is available upon request.

RESULTS

Patient characteristics, including the selected feature set, are described in Table 1. The training and test sets were selected at random, so any differences in the training and test datasets were due to chance. None of the differences across training and testing sets exceeds an effect size (Cohen’s h for proportions, d for continuous variables53) exceeded 0.15, well below the conventional threshold for small effects. The mean effect size was 0.07 across all features. By design, the incidence of delirium was constant (minor variation due to rounding) in the full sample (24%), training sample (24%), and test sample (23%).

Figure 1 panels a, b, and c illustrate receiver operating characteristic (ROC) curves for predictive models under the selected, selected +3MS, and full feature sets, respectively. Under the selected feature set, all models perform comparably and near the main diagonal, indicating poor prediction relative to chance. The models improve with the addition of a measure of cognitive performance, 3MS (panel b), with the highest AUC observed using the full feature set (panel c).

Fig. 1
figure 1

Comparison of receiver operator curves (ROCs) for prediction of delirium by the various machine learning (ML) algorithms examined. a ROC curves when a measure of pre-operative cognitive function (3MS) was not included in the selected feature set; b ROC curves when 3MS was included in the selected feature set; c ROC curves for the full feature set.

Detailed results of the prediction modeling are summarized in Table 2. In this table, we summarize a number of statistics that describe the algorithm and success in predicting delirium in the training data. These include the area under the receiver operating characteristic curve (AUC, and 95% confidence interval) which can be interpreted as the probability that if presented with a random case and random control, the case would have the higher predicted value. We also present standard confusion matrix results, including the sensitivity (proportion of cases that are predicted to be cases given the model), specificity (proportion of controls predicted to be controls), positive predictive value (proportion of those predicted to be cases that are actually cases), and negative predictive value (proportion of those predicted to be controls that are actually controls). The detection prevalence is the proportion of persons in the testing set classified as delirious with the algorithm, and was fixed to 25% for all algorithms other than the ensemble methods. With the selected feature set (3MS not included among the predictors), all models performed similarly and close to chance, with AUCs derived from the testing/validation data ranging from 0.53 to 0.57. With the selected feature set + 3MS, prediction improved for most algorithms, indicated by higher AUCs relative to the selected feature set (Table 2). Using this set, stepwise logistic regression had the highest AUC (0.68) among all the predictive modeling approaches. Among the ML models, regularized regression and cross-validated logistic regression shared the highest AUC (0.66). Regularized regression had a positive predictive value (PPV) of 0.36, negative predictive value (NPV) of 0.81, sensitivity of 0.38, and specificity of 0.79.

Table 2 Comparison of Machine Learning Algorithms for Prediction of Delirium in Three Overlapping Feature Sets

Prediction was strongest when we used the 71 predictors in the full feature set (Table 2). The highest AUC (0.71) was observed for the neural network algorithm. When the detection prevalence was set to 25% for comparison across models, the neural network model had among the highest PPV (0.46), NPV (0.84), sensitivity (0.50), and specificity (0.82). The ensemble-union approach achieved the highest sensitivity (0.62). Note the ensemble-union approach is the only approach where the detection prevalence (the proportion of the sample identified as a probable case of delirium) deviates from 0.25. For all other algorithms, the detection prevalence was constrained to 0.25 by design, but this is not possible with the ensemble-union.

Similar to the ML models, delirium prediction using stepwise logistic regression was poor (AUC = 0.54; Fig. 1a; Table 2) without the inclusion of the 3MS. After 3MS was added to the potential predictor set, the stepwise logistic regression showed improved model performance (AUC = 0.68, sensitivity = 0.42; specificity = 0.80; PPV = 0.39; NPV = 0.82; Fig. 1b; Table 2).

Figure 2 illustrates the range and distribution of predicted probabilities of post-operative delirium among patients identified as having delirium and those without delirium for each of the predictive modeling approaches in the validation dataset (hold-out sample) and using the full feature set. The violin-like shape of the distributions among those who did not have delirium illustrates the relatively high levels of specificity of the predictive models; the more rectangular shape of the distributions among those who did have delirium illustrates the relatively low sensitivity of all predictive modeling approaches (Table 2). Model calibration is described more completely in the Supplemental Appendix. Overall, model calibration was poor. This is important for situations in which the actual predicted value of the prediction model is to be used as an indicator of the predicted probability of delirium. Our evaluation models addressed poor calibration by forcing the detected prevalence to match the sample prevalence.

Fig. 2
figure 2

Violin plots showing the distribution of the probability of delirium across ML models and stepwise logistic regression for the full feature set. In addition to a marker for the median of the data and a box indicating the interquartile range (as in standard box plots), these violin plots also show the kernel probability density of the data at different values for non-delirious patients (green) and delirious patients (salmon). The horizontal bar indicates a detection prevalence of 25%.

Finally, we used a published delirium prediction model16 to classify hospitalized patients according to risk for delirium. This predictive model uses vision impairment, severe illness, cognitive impairment, and high blood urea nitrogen/creatinine ratio to stratify patients according to risk for delirium, and all but vision impairment were among the features in our full feature set. Vision impairment was considered a plausible risk factor by our clinical experts, but was dropped from the feature set due to a lack of variance. The published algorithm resulted in identifying 59/111 (53%) of the testing set as delirious, and a positive predictive value of 25%, a sensitivity of 58%, and specificity of 48%. The AUC was 0.55.

CONCLUSIONS

Using multiple ML approaches and a standard statistical technique, we were able to predict delirium with moderate accuracy based on variables that are readily available or minimally burdensome to collect in a clinical setting (e.g., on hospital admission). We demonstrated that ML methods can be used to develop prediction algorithms that perform better than chance but fail to demonstrate superior performance relative to models developed using stepwise logistic regression in hold-out validation data from a single clinical cohort study. Predictive accuracy of derived models had higher AUC than a single previously published predictive model derived in a different population. Nevertheless, predictive performance was modest, model calibration was poor, and the general pattern of results (e.g., Fig. 2) suggests that persons at low risk for delirium may tend to be alike (good pre-operative cognition, few vulnerability factors) but that persons who do develop delirium do so for widely varying reasons that are difficult to identify robustly. In agreement with a recently published systematic review, we find no strong evidence for benefit of ML over traditional logistic regression in developing a prediction rule.54

Strong prediction requires strong predictors. Delirium is by nature a heterogeneous, multifactorial condition, and predictive models have been relatively limited in their overall performance and their ability to generalize across populations. In our sample, predictive performance was better when pre-operative cognitive function (3MS) was included in the feature set. The role of cognitive impairment in delirium prediction is well established.31,32 Further improvements in performance were observed when the feature set included a large number of clinical variables, demonstrating the advantage of using ML for cohort studies with a large number of predictors and high-quality data. Modestly improved performance over stepwise regression supports the notion that ML approaches may help improve prediction, with the understanding that the full realization of these advantages may require more data than we had available.

Our study adds to a small but growing body of literature using ML to predict delirium. Among published studies, there is substantial variability in terms of sample size, identification of delirium, patient population, and types of ML algorithms evaluated. For instance, compared with prior studies, our sample size was much smaller (N = 560 compared with N = 9221–64,237). Prior studies variably defined delirium using ICD-10 codes,21,22 CAM alone or CAM and some other instrument,19,20 or DSM criteria.18 The method for detection of delirium is of critical importance, as certain methods may be more or less sensitive and often disproportionately detect the subtypes of delirium, such as hyperactive delirium. Model development and performance will also vary substantially depending on the population studied and factors included to develop and test delirium prediction models. Although model prediction may be improved by considering post-baseline (but pre-delirium) precipitating factors occurring during hospitalization, these were not examined in the current study since information collected during hospitalization would not be available during the window to recruit patients into a clinical trial, which is the long-term goal of this research. Model performance has also varied across studies using either ML or statistical models to predict delirium, with AUCs ranging from 0.56 to 0.94; the results of the present study (AUC = 0.70) fall within this range.17 While the AUC is the most commonly used metric to quantify the quality of a predictive model,54 the interpretation can be somewhat challenging. It is worth considering what the implications of a given AUC are under different conditions of disease and screen positive proportions. For instance, if both the prevalence and screen positive proportion of delirium are 25%, we estimate that the required AUC should be about 0.75 in order to achieve a PPV greater than 50%, but if the prevalence and screen positive proportion of delirium are 5%, an AUC of about 0.92 is needed to achieve a PPV of greater than 50%. The delirium prevalence was considerably higher in our study (24%) compared with rates observed in other ML studies of delirium (3–9%). Reasons for this are likely that ML is optimally applied in very large datasets, very large datasets typically derived from administrative or routine clinical assessment data rather than from highly controlled research protocols with extensive training and data quality controls.25 There are very likely many more false negatives in administrative and routine clinical datasets than what are observed in data derived from rigorous field studies.

This study has several strengths including evaluating multiple ML approaches in a well-characterized cohort with little missing data, reference standard determination of delirium, and inclusion of a measure of pre-operative cognitive function in the feature set, which sets our study apart from those relying solely on EHR data. This study also has a number of limitations. First, although the SAGES study is one of the largest studies of surgical patients with detailed pre-operative assessment, it is smaller than most datasets used for ML and a larger dataset would result in more stable parameter estimates, and better replication within model training and testing. Second, it is important to note that delirium prediction may have been improved by inclusion of additional variables including neuropsychological test scores, genetic information (e.g., APOE4), biomarkers that are not commonly evaluated clinically (e.g., C-reactive protein), or post-baseline and precipitating factors, especially medications. The absence of an external dataset for model validation, and our reliance on an internal hold-out sample for model testing, is not ideal. We elected not to include variables that would be difficult or time-consuming to collect prior to surgery in order to increase the clinical applicability of the results, including the potential to use these models for recruitment from clinical settings into large, multi-site clinical trials. Third, we compared our machine learning models with only one statistical algorithm (backwards stepwise logistic regression); it is possible that other statistical algorithms or different parameters could have improved its performance compared with the machine learning models.

In conclusion, we developed prediction models for post-operative delirium that performed better than chance. This study supports the notion that using available or minimally burdensome clinical data, machine learning, or more traditional stepwise logistic regression methods can be used to identify patients at high risk of developing delirium after surgery. These models could be used to identify high-risk persons for known delirium prevention interventions, or to optimize recruitment into clinical trials aimed at improving post-operative outcomes.