Development of machine learning models to prognosticate chronic shunt-dependent hydrocephalus after aneurysmal subarachnoid hemorrhage

Background Shunt-dependent hydrocephalus significantly complicates subarachnoid hemorrhage (SAH), and reliable prognosis methods have been sought in recent years to reduce morbidity and costs associated with delayed treatment or neglected onset. Machine learning (ML) defines modern data analysis techniques allowing accurate subject-based risk stratifications. We aimed at developing and testing different ML models to predict shunt-dependent hydrocephalus after aneurysmal SAH. Methods We consulted electronic records of patients with aneurysmal SAH treated at our institution between January 2013 and March 2019. We selected variables for the models according to the results of the previous works on this topic. We trained and tested four ML algorithms on three datasets: one containing binary variables, one considering variables associated with shunt-dependency after an explorative analysis, and one including all variables. For each model, we calculated AUROC, specificity, sensitivity, accuracy, PPV, and also, on the validation set, the NPV and the Matthews correlation coefficient (ϕ). Results Three hundred eighty-six patients were included. Fifty patients (12.9%) developed shunt-dependency after a mean follow-up of 19.7 (± 12.6) months. Complete information was retrieved for 32 variables, used to train the models. The best models were selected based on the performances on the validation set and were achieved with a distributed random forest model considering 21 variables, with a ϕ = 0.59, AUC = 0.88; sensitivity and specificity of 0.73 (C.I.: 0.39–0.94) and 0.92 (C.I.: 0.84–0.97), respectively; PPV = 0.59 (0.38–0.77); and NPV = 0.96 (0.90–0.98). Accuracy was 0.90 (0.82–0.95). Conclusions Machine learning prognostic models allow accurate predictions with a large number of variables and a more subject-oriented prognosis. We identified a single best distributed random forest model, with an excellent prognostic capacity (ϕ = 0.58), which could be especially helpful in identifying low-risk patients for shunt-dependency. Electronic supplementary material The online version of this article (10.1007/s00701-020-04484-6) contains supplementary material, which is available to authorized users.

Despite consistent associations between some variables and the development of shunt-dependency across studies, results differ on the role of other potentially important items, with the likely effect of scores over-emphasizing some variables while neglecting relevant ones. This could represent a limitation when risk scores built on these premises are used in clinical practice.
Modern standards of data analysis and prediction models rely on machine learning (ML), a branch of statistical analysis that is gaining more and more consideration in the medical field due to its excellent results and, more recently, also in neurosurgery [6,10,18,34,41,42,44]. ML consists of algorithm-based models with the ability to learn and perform tasks that are not explicitly programmed, to improve the performances with experience (i.e., when the model analyzes new data), and to work with a large amount of data and nonlinear associations, where classical statistical methods can show some limitations [6,18,38].
We aimed at testing the ability of machine learning models to predict the development of shunt-dependent hydrocephalus in aneurysmal SAH patients, intending to develop a prognostic model based on current data analysis standards, in order to reduce omission of potentially relevant variables and allow for better individual risk estimation.

Data collection and variables selection
Electronic files and radiological data of patients undergoing surgical or endovascular treatment for aneurysmal SAH at our institution between January 2013 and March 2019 were retrospectively consulted to collect information on variables potentially related to shunt-dependency, according to the results of previous works on this topic [1, 5, 7-9, 11-13, 16, 17, 21-23, 25, 27, 29, 33, 36, 39, 43, 46-48, 50]. Also, we expanded this information by including quantitative information on the neurological and general clinical status of patients at the time of the acute event (SAH), such as the Karnofsky performance status (KPS), the ASA physical status classification system, the modified rankin scale (mRS), and the National Institute of Health Stroke Scale (NIHSS). The variables considered are summarized in Table 1.
Patients were subdivided according to their GCS at presentation into three groups to train the models: 12-15, 8-11, and < 8. We subdivided treatment timing into very early (< 6 h), early (6-12 h), late (12-24 h), and delayed (> 24 h). Patients were further dichotomized according to the duration of EVD permanence and fever in ≤ 5 days and > 5 days. Fever onset was dichotomized in early and delayed onset (cutoff: 7 days after treatment). The bicaudate index was measured at the first CT after symptoms onset, immediately after treatment and at 14 days (or at the last CT performed within 14 days from the acute event). Additionally, information on follow-up duration and the onset of shunt-dependency were retrieved.

Treatment protocol
Patients with suspected aneurysmal SAH after skull CT and CTangiography (CTA) referred to our institution were managed by a multidisciplinary team composed of neurosurgeons, endovascular neuroradiologists, and anesthesiologists who chose the most appropriate treatment on a case-by-case basis, taking into account the patient's clinical status, age, and comorbidities, as well as the entity of SAH and the aneurysm location and morphology. In cases of imaging of insufficient quality of the CTA or unsatisfactory depiction, a digital subtraction angiography (DSA) of the intracranial vessels was performed.
After aneurysm occlusion (either endovascular or surgical), patients were transferred to the intensive care unit (ICU), the intermediate care (IMC) unit, or the neurosurgical ward according to the patients' preoperative clinical status, age, comorbidities, entity of the subarachnoid hemorrhage, and performed treatment after the operating physician and the anesthesiologist had reached interdisciplinary consensus. In each case, patients underwent neurological and clinical monitoring for a minimum of 14 days after treatment, as well as routine transcranial Doppler studies for early detection of vasospasm. CT of the skull was performed immediately after treatment, whenever neurological deterioration occurred (pre-and postoperative), before sedation weaning (for intubated patients), before EVD removal, and before discharge.

EVD insertion, weaning, and indications for permanent shunts
Indication for EVD was posed in patients with an acute neurological deterioration associated with radiological findings of acute hydrocephalus. In patients who were referred intubated to our institution, EVD was placed when the immediate pre-or postoperative scan showed acute hydrocephalus. In both cases, the reservoir was placed at such a height to drain 10 ml of CSF/h. Once the clinical and neurological status was stable, weaning began by increasing by 2 cm of H 2 O every 24-48 h until the absence of CSF drainage. Then, the drain was kept closed for 24-48 h, and if neurological status remained stable, a CT scan was performed. If no ventricle dilatation was documented, the EVD was removed. If neurological deterioration occurred during EVD weaning, a CT scan was performed, and in case of evident or suspected ventricle dilatation, the reservoir was open again to drain 10 ml of CSF/h. A new attempt of weaning was made following the same protocol, and if neurological deterioration with ventricle dilatation occurred a second time, the patient was deemed shunt-dependent.
Patients with poor clinical conditions or low GCS, in which recognizing neurological deterioration would have been more challenging during EVD weaning, were treated following the same protocol, and a CT scan was performed prior to each change of the reservoir height.
After discharge, if clinical conditions remained stable, patients underwent a clinical and radiological follow-up with CTA at 3 months, MR-angiography at 6 months, and DSA at 12 months. Other investigations were performed when deemed necessary, in case of suspected incomplete exclusion of the aneurysm or when neurological changes occurred. If ventricle enlargement was detected in association with neurological deterioration during follow-up, a permanent ventricular shunt was indicated.
Statistical analysis, preprocessing, creation, and testing of models Continuous variables are reported as mean with standard deviation, and categorical variables are expressed as percentages. Statistical analysis, data preprocessing, and graphics creation were performed with SPSS Statistics© 23 (IBM Corp. Armonk, NY, USA) and MATLAB R2020 (MathWorks Inc., Natick, MA, USA; https://www.mathworks. com). A Wilk- Shapiro test was used to assess normal distribution. We first conducted an exploratory statistical analysis, and a comparison of variables between shunt-dependent and non-shuntdependent patients was performed with a t test for unpaired data for continuous variables and a χ 2 test for categorical variables. A Bonferroni correction was used for multiple comparisons. Before training machine learning models, missing variables for > 40% of patients were removed to avoid the significant influence of the imputation, as well as those patients with missing information on eight or more variables. Missing data were imputed with K-nearest neighbor imputation. Patients deceased before assessment of shunt-dependency were excluded, like those survived but missed on follow-up lacking information on the development of shunt-dependency.
ML models were trained using the open-source platform H 2 O (https://www.h2o.ai, Mountain View, CA, USA), which provides a package of scripts for ML algorithms whose parameter can be customized ad hoc. We used the web interface (H 2 O Flow) provided by the site running in Java™ (https://www.java.com, Oracle Corporation, Redwood, CA, USA). For our purposes, we tested four of the most frequently employed algorithms for supervised learning without knowing previously which one would be the most precise for our purposes: generalized linear modeling (GL), distributed random forest (DRF), gradient boosting machine (GBM), and deep learning (DL). The clean dataset was randomly split into training (75% of the patients) and validation set (25%). A 6-fold cross-validation was performed on the training set, before evaluating prediction performances on the validation set. Cross-validation is a resampling technique to obtain a more accurate and less biased estimate of how the model will score on previously unseen data. It consists of creating k samples (in our case, k = 6) of equal size from the training dataset, of which one is used as a validation set and the remaining as a training set. This process is repeated k times, using each of the subsamples once as a validation sample, and the results of all iterations are summarized by metrics mean and standard deviation. In our case, the area under the receiver operating characteristic curve (AUROC or AUC), accuracy, sensitivity, specificity, the positive predictive value (PPV), and the Matthews correlation coefficient were calculated. The Matthews correlation coefficient, or ϕ, is a measure of the quality of a binary classification used in machine learning, with scores ranging between + 1 identifying a perfect prediction and − 1 indicating total disagreement. A score equal to 0 means the model makes no better prediction than a random guess [4].
For each model, the algorithm parameters were customized and fine-tuned to obtain the optimized Matthews correlation coefficient. Also, the binarization threshold was chosen to maximize ϕ. Algorithms training was performed using logloss (logarithmic loss metric) as the stopping parameter: once the algorithm parameters are set by the operator, this procedure iterates the development of models of increasing complexity until the performance of the model decreases. The logloss evaluates how close the predicted values are to the actual ones. Values can be greater than or equal to 0, with 0 meaning that the model correctly predicts an event. For each model variable, importances were calculated, and recursive feature selection was performed by removing variables with lower coefficients stepwise until reaching optimal scores. Performances on the validation sets were synthesized in confusion matrices, and sensitivity, specificity, PPV and negative predictive value (NPV), and accuracy with 95% confidence intervals were calculated, along with the AUC and ϕ.
On both sets, calibration metrics were calculated as the Hosmer-Lemeshow goodness of fit test and as slope and intercept of the calibration curve.

Results
During the considered period, 479 patients underwent treatment of aneurysmal SAH at our institution (mean age: 59 ± 13 years, 320 females [66.8%], mean follow-up: 19.7 ± 12.6 months). Variables retrieved were available in the fol- After removing patients deceased before evaluation of the development of shunt-dependency (n = 29 [6%]), those with more than eight missing variables or missing information on the development of shunt-dependency (n = 64, 13.4%), and after eliminating all variables missing for more than 40% of patients, the clean dataset comprised 386 patients and 32 variables (Table 2).       Table 2 for details.
Performances after 6-fold cross-validation are summarized in Table 3 Table 4). Figure 1 and Table 6 depict the AUC and the confusion matrix obtained from this model on the validation set, while Fig. 2 is a calibration plot of the model on the training and validation set: for the test set, the calibration slope and intercept were 1.02 and 0.03, respectively, whereas the calibration slope and intercept for the validation set were 0.88 and 0.07. The Hosmer-Lemeshow goodness of fit test showed a good fit of the model on both the resampled training set (χ 2 = 1.7, p = 0.99) and the validation set (χ 2 = 1.02, p = 1) (see also Tables 6 and 7 and Supplementary Material).

Discussion
In this study, we trained different machine learning models to predict the occurrence of chronic shuntdependent hydrocephalus after aneurysmal subarachnoid hemorrhage. We further identified a single model with the best performances on previously unseen data, analyzing all variables retrieved with a distributed random forest algorithm (see Supplementary Material for details on the model parameters and the model code). The results are comparable to previously proposed predictive models (see Table 8), which, however, took into account a limited number of variables selected after a statistical analysis of association with shunt-dependency performed with traditional methods [5,12,13,20,21,23].
In comparison with models and scores based on previous statistical concepts, however, ML models bear the advantage of allowing more precise and subject-based predictions by including a substantial amount of variables and analyzing complex nonlinear relationships, rather than fitting the subjects' features into predetermined models with selected and weighted variables according to statistical significance. As our experience confirms, including items that did not show a significant association with shunt-dependency when creating ML models improved the overall performances. Moreover, in the final model, variables significantly associated with shuntdependency did not improve the overall model accuracy when used as splitting nodes in the decision trees (see also Supplementary Material). This enables to perform a more flexible and updated prediction for each subject according to Values are determined according to how much the squared error over all trees improves after the single variables is selected for splitting on a decision tree Fig. 1 ROC curve of the model with the best performances on the a resampled training and b validation set small as well as relevant changes in patients' clinical and radiological conditions. Additionally, ML models can improve and refine autonomously when new data are provided [37], providing a dynamic model that can increase accuracy with time.
The potentials of machine learning techniques in medicine and neurosurgery have been widely tested, and their employment in diagnostic and prognostic tasks is becoming more and more common given their abilities to outperform human capacity and traditional statistics [18,38,41,42,44]. Machine learning can be considered an evolution of traditional statistics, and there is no clear line dividing them [3]. The fundamental distinction between machine learning models and traditional statistical approaches is the ability of machine learning models to independently learn from examples rather than perform a pre-programmed task [37].
Classification or regression tasks can be accomplished by supervised learning algorithms. These algorithms work with known variables (input) and outcomes (output) to detect associations between them and, once trained, can generalize this information and predict the outcome when new inputs are provided. In contrast, unsupervised learning algorithms are used to detect unknown clusters or patterns among vast amounts of data [37].
Among four of the most diffused supervised algorithms, in our experience, the most accurate model was built on a distributed random forest algorithm and included 21 items (see Table 3, Table 4 and Supplementary Material). The concept behind the DRF algorithm is to build a set of decision trees, each taking into account a subgroup of randomly selected variables and then summarizing the results of all trees either be mean or by vote to obtain an overall prediction by majority [28]. For each tree, the algorithm identifies a set of decision rules that predict the outcome based on the given variables [24]. A detailed explanation of the other tested algorithm (GL, GBM, DL) can be found here [14,15,26,30,32].
We relied on the Matthews correlation coefficient (ϕ) to identify the single best model, a metric optimized for data imbalance that is commonly used in ML and bioinformatics [4]. When the sample size in the data classes are unevenly distributed (in our case, shunt-dependent vs. non-shunt-dependent), data imbalance occurs. This frequently happens in ML, resulting in classification models maximizing the accuracy by biasing toward the majority class and leading to poor generalization [40]. In this situation, the standard measures of performance, like accuracy, are no longer a proper measure of imbalanced data. A common way to address this issue is to over-/under-sample one of the two classes [41]. However, this strategy can alter the results when the number in the minority class is limited [4].  An additional issue with model building in ML is overfitting. This is a frequent problem occurring when a too sophisticated and accurate model learns from irrelevant information or randomness of the training dataset. As a result, the predictions on new datasets will be weak. To prevent this problem, we used cross-validation, early stopping, and features removal. Also, a so-called train-test split [19] can be a sign of overfitting: when a model performs with significantly better accuracy on the training set than on the validation set, overfitting is probably occurring. In our DRF model, no traintest split differences were observed under this respect (see Tables 3 and 4).
We have to stress some limitations: first, despite all information being recalled from clinical records (i.e., prospectively acquired), the data were collected retrospectively, and we cannot exclude related biases. Second, observations of radiological scans, for example, the bicaudate index, are highly operator-dependent, and having automatized measurements would make the data more reliable. Third, some potentially relevant variables were not included, like CSF markers, or specific surgical procedures like fenestration of the lamina terminalis: CSF markers are not routinely acquired at our institution, but it would be interesting to test them in future models. Finally, despite the good metrics shown by our final model, its ability to identify patients who will actually need a permanent shunt is less accurate than its capacity to correctly exclude subjects who will not develop chronic shuntdependent hydrocephalus. The ability to predict the development of a disease is the actual goal of any prediction model. Still, correctly ruling out future negative patients can represent a significant support for clinical decision-making and followup planning as well as a tool to reduce hospitalization length and costs. Additionally, it is noteworthy that the positive and negative predictive values of a test are related to the prevalence of the condition to be predicted [2]. In our cohort, the prevalence of shunt-dependency was 12.9%, and we could reasonably expect the same model to show different positive predictive values in cohorts with a different proportion of positive subjects.

Conclusions
We trained and tested a distributed random forest model with 32 features, which reached an excellent sensitivity and specificity with ϕ = 0.59. Compared to previous models built on traditional statistical methods, it can analyze a larger amount of data and variables; work with complex nonlinear relationships; and offer a more flexible, subject-based, and accurate prognostic tool, which autonomously refines with the experience. Even though some limitations are present, prospectively testing this model performance could confirm its prognostic capacity.
Funding Information Open Access funding provided by Università degli Studi di Firenze within the CRUI-CARE Agreement.  For both sets, the slope and intercept of the calibration curve and the Hoslem-Lemeshow test χ2 and p are reported