Introduction

Shunt-dependent hydrocephalus is a significant complication of aneurysmal subarachnoid hemorrhage (SAH) affecting 11 to 39.6% of patients with aneurysmal SAH [1, 8, 13, 17, 27, 29, 35, 39, 49]. Several studies have tried to identify potential predictors of shunt-dependency to estimate prognosis, to offer the best treatment strategy by preventing complications associated with unnecessary treatment or delayed surgical intervention, and to reduce hospitalization and rehabilitation length and costs [1, 5, 8, 9, 11,12,13, 16, 17, 21,22,23, 25, 27, 29, 31, 33, 35, 36, 39, 43, 46,47,48, 50]. Among these variables, some have been proposed, like the patient’s age and gender [22, 23, 33, 35, 36, 47, 48], the neurological status at presentation (Hunt & Hess and WFNS scales) [12, 20,21,22, 25, 27, 47, 48], the amount of cisternal blood on the first CT scan (Fisher and BNI scales) [12, 16, 21,22,23, 25, 35, 47], the presence of acute hydrocephalus on the first CT scan and the need for external ventricular drain (EVD) [1, 12, 21,22,23, 25, 27, 33, 36, 43, 47, 48], the duration of EVD treatment [25, 48], aneurysms location and size [9, 22, 25, 33, 36, 47], the type of treatment for aneurysm exclusion (endovascular or surgical) [9, 25, 35, 36, 50], the onset and duration of posttreatment complications (i.e., fever and/or infections) [25, 27, 36, 47, 50], the duration of blood clearance detected on serial CT scans [29], and altered values of blood or cerebrospinal fluid (CSF) markers [27, 31, 46]. Some meta-analyses [11, 45, 47, 50] have synthesized this information, and scores for risk stratification have been proposed to be used in the clinical practice [5, 12, 13, 21, 27], whose validity, however, has not yet been confirmed on other cohorts.

Despite consistent associations between some variables and the development of shunt-dependency across studies, results differ on the role of other potentially important items, with the likely effect of scores over-emphasizing some variables while neglecting relevant ones. This could represent a limitation when risk scores built on these premises are used in clinical practice.

Modern standards of data analysis and prediction models rely on machine learning (ML), a branch of statistical analysis that is gaining more and more consideration in the medical field due to its excellent results and, more recently, also in neurosurgery [6, 10, 18, 34, 41, 42, 44]. ML consists of algorithm-based models with the ability to learn and perform tasks that are not explicitly programmed, to improve the performances with experience (i.e., when the model analyzes new data), and to work with a large amount of data and nonlinear associations, where classical statistical methods can show some limitations [6, 18, 38].

We aimed at testing the ability of machine learning models to predict the development of shunt-dependent hydrocephalus in aneurysmal SAH patients, intending to develop a prognostic model based on current data analysis standards, in order to reduce omission of potentially relevant variables and allow for better individual risk estimation.

Methods

Data collection and variables selection

Electronic files and radiological data of patients undergoing surgical or endovascular treatment for aneurysmal SAH at our institution between January 2013 and March 2019 were retrospectively consulted to collect information on variables potentially related to shunt-dependency, according to the results of previous works on this topic [1, 5, 7,8,9, 11,12,13, 16, 17, 21,22,23, 25, 27, 29, 33, 36, 39, 43, 46,47,48, 50]. Also, we expanded this information by including quantitative information on the neurological and general clinical status of patients at the time of the acute event (SAH), such as the Karnofsky performance status (KPS), the ASA physical status classification system, the modified rankin scale (mRS), and the National Institute of Health Stroke Scale (NIHSS). The variables considered are summarized in Table 1.

Table 1 Variables retrieved

Patients were subdivided according to their GCS at presentation into three groups to train the models: 12–15, 8–11, and < 8. We subdivided treatment timing into very early (< 6 h), early (6–12 h), late (12–24 h), and delayed (> 24 h). Patients were further dichotomized according to the duration of EVD permanence and fever in ≤ 5 days and > 5 days. Fever onset was dichotomized in early and delayed onset (cutoff: 7 days after treatment). The bicaudate index was measured at the first CT after symptoms onset, immediately after treatment and at 14 days (or at the last CT performed within 14 days from the acute event). Additionally, information on follow-up duration and the onset of shunt-dependency were retrieved.

Treatment protocol

Patients with suspected aneurysmal SAH after skull CT and CT-angiography (CTA) referred to our institution were managed by a multidisciplinary team composed of neurosurgeons, endovascular neuroradiologists, and anesthesiologists who chose the most appropriate treatment on a case-by-case basis, taking into account the patient’s clinical status, age, and comorbidities, as well as the entity of SAH and the aneurysm location and morphology. In cases of imaging of insufficient quality of the CTA or unsatisfactory depiction, a digital subtraction angiography (DSA) of the intracranial vessels was performed.

After aneurysm occlusion (either endovascular or surgical), patients were transferred to the intensive care unit (ICU), the intermediate care (IMC) unit, or the neurosurgical ward according to the patients’ preoperative clinical status, age, comorbidities, entity of the subarachnoid hemorrhage, and performed treatment after the operating physician and the anesthesiologist had reached interdisciplinary consensus. In each case, patients underwent neurological and clinical monitoring for a minimum of 14 days after treatment, as well as routine transcranial Doppler studies for early detection of vasospasm. CT of the skull was performed immediately after treatment, whenever neurological deterioration occurred (pre- and postoperative), before sedation weaning (for intubated patients), before EVD removal, and before discharge.

EVD insertion, weaning, and indications for permanent shunts

Indication for EVD was posed in patients with an acute neurological deterioration associated with radiological findings of acute hydrocephalus. In patients who were referred intubated to our institution, EVD was placed when the immediate pre- or postoperative scan showed acute hydrocephalus. In both cases, the reservoir was placed at such a height to drain 10 ml of CSF/h. Once the clinical and neurological status was stable, weaning began by increasing by 2 cm of H2O every 24–48 h until the absence of CSF drainage. Then, the drain was kept closed for 24–48 h, and if neurological status remained stable, a CT scan was performed. If no ventricle dilatation was documented, the EVD was removed. If neurological deterioration occurred during EVD weaning, a CT scan was performed, and in case of evident or suspected ventricle dilatation, the reservoir was open again to drain 10 ml of CSF/h. A new attempt of weaning was made following the same protocol, and if neurological deterioration with ventricle dilatation occurred a second time, the patient was deemed shunt-dependent.

Patients with poor clinical conditions or low GCS, in which recognizing neurological deterioration would have been more challenging during EVD weaning, were treated following the same protocol, and a CT scan was performed prior to each change of the reservoir height.

After discharge, if clinical conditions remained stable, patients underwent a clinical and radiological follow-up with CTA at 3 months, MR-angiography at 6 months, and DSA at 12 months. Other investigations were performed when deemed necessary, in case of suspected incomplete exclusion of the aneurysm or when neurological changes occurred. If ventricle enlargement was detected in association with neurological deterioration during follow-up, a permanent ventricular shunt was indicated.

Statistical analysis, preprocessing, creation, and testing of models

Continuous variables are reported as mean with standard deviation, and categorical variables are expressed as percentages. Statistical analysis, data preprocessing, and graphics creation were performed with SPSS Statistics© 23 (IBM Corp. Armonk, NY, USA) and MATLAB R2020 (MathWorks Inc., Natick, MA, USA; https://www.mathworks. com). A Wilk-Shapiro test was used to assess normal distribution. We first conducted an exploratory statistical analysis, and a comparison of variables between shunt-dependent and non-shunt-dependent patients was performed with a t test for unpaired data for continuous variables and a χ2 test for categorical variables. A Bonferroni correction was used for multiple comparisons.

Before training machine learning models, missing variables for > 40% of patients were removed to avoid the significant influence of the imputation, as well as those patients with missing information on eight or more variables. Missing data were imputed with K-nearest neighbor imputation. Patients deceased before assessment of shunt-dependency were excluded, like those survived but missed on follow-up lacking information on the development of shunt-dependency.

ML models were trained using the open-source platform H2O (https://www.h2o.ai, Mountain View, CA, USA), which provides a package of scripts for ML algorithms whose parameter can be customized ad hoc. We used the web interface (H2O Flow) provided by the site running in Java™ (https://www.java.com, Oracle Corporation, Redwood, CA, USA). For our purposes, we tested four of the most frequently employed algorithms for supervised learning without knowing previously which one would be the most precise for our purposes: generalized linear modeling (GL), distributed random forest (DRF), gradient boosting machine (GBM), and deep learning (DL). The clean dataset was randomly split into training (75% of the patients) and validation set (25%). A 6-fold cross-validation was performed on the training set, before evaluating prediction performances on the validation set. Cross-validation is a resampling technique to obtain a more accurate and less biased estimate of how the model will score on previously unseen data. It consists of creating k samples (in our case, k = 6) of equal size from the training dataset, of which one is used as a validation set and the remaining as a training set. This process is repeated k times, using each of the subsamples once as a validation sample, and the results of all iterations are summarized by metrics mean and standard deviation. In our case, the area under the receiver operating characteristic curve (AUROC or AUC), accuracy, sensitivity, specificity, the positive predictive value (PPV), and the Matthews correlation coefficient were calculated. The Matthews correlation coefficient, or ϕ, is a measure of the quality of a binary classification used in machine learning, with scores ranging between + 1 identifying a perfect prediction and − 1 indicating total disagreement. A score equal to 0 means the model makes no better prediction than a random guess [4].

For each model, the algorithm parameters were customized and fine-tuned to obtain the optimized Matthews correlation coefficient. Also, the binarization threshold was chosen to maximize ϕ. Algorithms training was performed using logloss (logarithmic loss metric) as the stopping parameter: once the algorithm parameters are set by the operator, this procedure iterates the development of models of increasing complexity until the performance of the model decreases. The logloss evaluates how close the predicted values are to the actual ones. Values can be greater than or equal to 0, with 0 meaning that the model correctly predicts an event. For each model variable, importances were calculated, and recursive feature selection was performed by removing variables with lower coefficients stepwise until reaching optimal scores. Performances on the validation sets were synthesized in confusion matrices, and sensitivity, specificity, PPV and negative predictive value (NPV), and accuracy with 95% confidence intervals were calculated, along with the AUC and ϕ.

On both sets, calibration metrics were calculated as the Hosmer-Lemeshow goodness of fit test and as slope and intercept of the calibration curve.

Results

During the considered period, 479 patients underwent treatment of aneurysmal SAH at our institution (mean age: 59 ± 13 years, 320 females [66.8%], mean follow-up: 19.7 ± 12.6 months). Variables retrieved were available in the following proportions: GCS at admission in 393 patients (82% of cases), Fisher score in 377 (78.5%), Hunt-Hess in 395 (82.3%), BNI in 152 (31.6%), WFNS in 392 (81.6%), ICH in 385 (80.2%), IVH in 388 (80.8%), treatment timing in 376 (78.3%), SAH sum score in 150 (31.2%), IVH sum score in 357 (74.4%), BI preoperative in 141 (29.3%), postoperative in 358 (74.6%), at 14 days in 170 (35.4%), at the last CT scan in 249 (51.9%), mRS on admission in 397 (82.7%), ASA class in 390 (81.2%), KPS score in 392 (81.6%), NIHSS score in 392 (81.6%), presence/absence of acute hydrocephalus in 330 (68.7%), need for EVD placement in 392 (81.6%), EVD duration—if EVD present—in 150 (31.2%), rebleeding in 393 (81.9%), aneurysm location in 400 (83.3%), multiple aneurysm in 389 (81%), aneurysm max. diameter in 370 (77.1%), treatment modality in 404 (84.2%), need for posttreatment ICU in 394 (82.1%), DCI in 388 (80.8%), treatment complication in 390 (81.2%), postoperative fever in 390 (81.2%), fever timing in 375 (78.1%), days with fever in 319 (66.5%), meningitis in 382 (79.6%) or other infections in 383 (79.8%), and development of shunt-dependency in 390 (81.2%).

After removing patients deceased before evaluation of the development of shunt-dependency (n = 29 [6%]), those with more than eight missing variables or missing information on the development of shunt-dependency (n = 64, 13.4%), and after eliminating all variables missing for more than 40% of patients, the clean dataset comprised 386 patients and 32 variables (Table 2).

Table 2 Variables used for the models creation

The exploratory statistical analysis yielded significant association with the onset of shunt-dependency for the following variables: Fisher > 2 (p < 0.01), Hunt-Hess = 1 (p = 0.001, negative correlation) or 3 (p = 0.001, positive correlation), WFNS > 2 (p < 0.01), presence of IVH on admission (p < 0.001), higher preoperative IVH sum score (p = 0.001), higher postoperative BI (p = 0.001), lower KPS on admission (p = 0.004), acute hydrocephalus on admission (p < 0.001), need for EVD treatment perioperatively (p < 0.001), the permanence of EVD > 5 days (p < 0.001), the need for posttreatment ICU (p < 0.001), presence of postoperative fever lasting > 5 days (p < 0.001), and the postoperative development of meningitis (p = 0.001) or other infections (p < 0.001). See also Table 2 for details.

Performances after 6-fold cross-validation are summarized in Table 3. The highest accuracy and ϕ on the resampled training set was obtained with a DL algorithm with 31 variables (0.84 [± 0.07]) and 0.54 [± 0.1], respectively). However, the best performances on the validation set were reached with a DRF model including 21 variables (see Tables 4 and 5), with a correlation coefficient (ϕ) of 0.59 an AUC = 0.88, sensitivity and specificity of 0.73 (C.I.: 0.39–0.94) and 0.92 (C.I.: 0.84–0.97), respectively, PPV = 0.59 (0.38–0.77) and 0.96 (0.90–0.98). Accuracy was = 0.90 (0.82–0.95) (see also Table 4).

Table 3 Discrimination obtained after sixfold cross-validation on the training set (n = 296)
Table 4 Discrimination metrics on the validation set (n = 90). 95%-confidence intervals are reported in brackets
Table 5 Variables included in the DRF model after recursive feature elimination and their importance

Figure 1 and Table 6 depict the AUC and the confusion matrix obtained from this model on the validation set, while Fig. 2 is a calibration plot of the model on the training and validation set: for the test set, the calibration slope and intercept were 1.02 and 0.03, respectively, whereas the calibration slope and intercept for the validation set were 0.88 and 0.07. The Hosmer-Lemeshow goodness of fit test showed a good fit of the model on both the resampled training set (χ2 = 1.7, p = 0.99) and the validation set (χ2 = 1.02, p = 1) (see also Tables 6 and 7 and Supplementary Material).

Fig. 1
figure 1

ROC curve of the model with the best performances on the a resampled training and b validation set

Table 6 Confusion matrices of the model performance on the resampled training (n = 296) and validation set (n = 90) of the model with the highest accuracy and Matthews correlation coefficient, obtained with the distributed random forest algorithm and analyzing 21 variables
Fig. 2
figure 2

Calibration plot of the DRF model. Slope and intercept are 1.02 and 0.03 for the training frame and 0.88 and 0.07 for the validation frame

Table 7 Calibration metrics from the training and validation sets

Discussion

In this study, we trained different machine learning models to predict the occurrence of chronic shunt-dependent hydrocephalus after aneurysmal subarachnoid hemorrhage. We further identified a single model with the best performances on previously unseen data, analyzing all variables retrieved with a distributed random forest algorithm (see Supplementary Material for details on the model parameters and the model code).

The results are comparable to previously proposed predictive models (see Table 8), which, however, took into account a limited number of variables selected after a statistical analysis of association with shunt-dependency performed with traditional methods [5, 12, 13, 20, 21, 23].

Table 8 Variables considered and findings of previous works proposing prognostic scores for shunt-dependent hydrocephalus

In comparison with models and scores based on previous statistical concepts, however, ML models bear the advantage of allowing more precise and subject-based predictions by including a substantial amount of variables and analyzing complex nonlinear relationships, rather than fitting the subjects’ features into predetermined models with selected and weighted variables according to statistical significance. As our experience confirms, including items that did not show a significant association with shunt-dependency when creating ML models improved the overall performances. Moreover, in the final model, variables significantly associated with shunt-dependency did not improve the overall model accuracy when used as splitting nodes in the decision trees (see also Supplementary Material). This enables to perform a more flexible and updated prediction for each subject according to small as well as relevant changes in patients’ clinical and radiological conditions.

Additionally, ML models can improve and refine autonomously when new data are provided [37], providing a dynamic model that can increase accuracy with time.

The potentials of machine learning techniques in medicine and neurosurgery have been widely tested, and their employment in diagnostic and prognostic tasks is becoming more and more common given their abilities to outperform human capacity and traditional statistics [18, 38, 41, 42, 44]. Machine learning can be considered an evolution of traditional statistics, and there is no clear line dividing them [3]. The fundamental distinction between machine learning models and traditional statistical approaches is the ability of machine learning models to independently learn from examples rather than perform a pre-programmed task [37].

Classification or regression tasks can be accomplished by supervised learning algorithms. These algorithms work with known variables (input) and outcomes (output) to detect associations between them and, once trained, can generalize this information and predict the outcome when new inputs are provided. In contrast, unsupervised learning algorithms are used to detect unknown clusters or patterns among vast amounts of data [37].

Among four of the most diffused supervised algorithms, in our experience, the most accurate model was built on a distributed random forest algorithm and included 21 items (see Table 3, Table 4 and Supplementary Material). The concept behind the DRF algorithm is to build a set of decision trees, each taking into account a subgroup of randomly selected variables and then summarizing the results of all trees either be mean or by vote to obtain an overall prediction by majority [28]. For each tree, the algorithm identifies a set of decision rules that predict the outcome based on the given variables [24]. A detailed explanation of the other tested algorithm (GL, GBM, DL) can be found here [14, 15, 26, 30, 32].

We relied on the Matthews correlation coefficient (ϕ) to identify the single best model, a metric optimized for data imbalance that is commonly used in ML and bioinformatics [4]. When the sample size in the data classes are unevenly distributed (in our case, shunt-dependent vs. non-shunt-dependent), data imbalance occurs. This frequently happens in ML, resulting in classification models maximizing the accuracy by biasing toward the majority class and leading to poor generalization [40]. In this situation, the standard measures of performance, like accuracy, are no longer a proper measure of imbalanced data. A common way to address this issue is to over-/under-sample one of the two classes [41]. However, this strategy can alter the results when the number in the minority class is limited [4].

An additional issue with model building in ML is overfitting. This is a frequent problem occurring when a too sophisticated and accurate model learns from irrelevant information or randomness of the training dataset. As a result, the predictions on new datasets will be weak. To prevent this problem, we used cross-validation, early stopping, and features removal. Also, a so-called train-test split [19] can be a sign of overfitting: when a model performs with significantly better accuracy on the training set than on the validation set, overfitting is probably occurring. In our DRF model, no train-test split differences were observed under this respect (see Tables 3 and 4).

We have to stress some limitations: first, despite all information being recalled from clinical records (i.e., prospectively acquired), the data were collected retrospectively, and we cannot exclude related biases. Second, observations of radiological scans, for example, the bicaudate index, are highly operator-dependent, and having automatized measurements would make the data more reliable. Third, some potentially relevant variables were not included, like CSF markers, or specific surgical procedures like fenestration of the lamina terminalis: CSF markers are not routinely acquired at our institution, but it would be interesting to test them in future models. Finally, despite the good metrics shown by our final model, its ability to identify patients who will actually need a permanent shunt is less accurate than its capacity to correctly exclude subjects who will not develop chronic shunt-dependent hydrocephalus. The ability to predict the development of a disease is the actual goal of any prediction model. Still, correctly ruling out future negative patients can represent a significant support for clinical decision-making and follow-up planning as well as a tool to reduce hospitalization length and costs. Additionally, it is noteworthy that the positive and negative predictive values of a test are related to the prevalence of the condition to be predicted [2]. In our cohort, the prevalence of shunt-dependency was 12.9%, and we could reasonably expect the same model to show different positive predictive values in cohorts with a different proportion of positive subjects.

Conclusions

We trained and tested a distributed random forest model with 32 features, which reached an excellent sensitivity and specificity with ϕ = 0.59. Compared to previous models built on traditional statistical methods, it can analyze a larger amount of data and variables; work with complex nonlinear relationships; and offer a more flexible, subject-based, and accurate prognostic tool, which autonomously refines with the experience. Even though some limitations are present, prospectively testing this model performance could confirm its prognostic capacity.