External validation of prognostic models predicting outcome after chronic subdural hematoma

Background Several prognostic models for outcomes after chronic subdural hematoma (CSDH) treatment have been published in recent years. However, these models are not sufficiently validated for use in daily clinical practice. We aimed to assess the performance of existing prediction models for outcomes in patients diagnosed with CSDH. Methods We systematically searched relevant literature databases up to February 2021 to identify prognostic models for outcome prediction in patients diagnosed with CSDH. For the external validation of prognostic models, we used a retrospective database, containing data of 2384 patients from three Dutch regions. Prognostic models were included if they predicted either mortality, hematoma recurrence, functional outcome, or quality of life. Models were excluded when predictors were absent in our database or available for < 150 patients in our database. We assessed calibration, and discrimination (quantified by the concordance index C) of the included prognostic models in our retrospective database. Results We identified 1680 original publications of which 1656 were excluded based on title or abstract, mostly because they did not concern CSDH or did not define a prognostic model. Out of 18 identified models, three could be externally validated in our retrospective database: a model for 30-day mortality in 1656 patients, a model for 2 months, and another for 3-month hematoma recurrence both in 1733 patients. The models overestimated the proportion of patients with these outcomes by 11% (15% predicted vs. 4% observed), 1% (10% vs. 9%), and 2% (11% vs. 9%), respectively. Their discriminative ability was poor to modest (C of 0.70 [0.63–0.77]; 0.46 [0.35–0.56]; 0.59 [0.51–0.66], respectively). Conclusions None of the examined models showed good predictive performance for outcomes after CSDH treatment in our dataset. This study confirms the difficulty in predicting outcomes after CSDH and emphasizes the heterogeneity of CSDH patients. The importance of developing high-quality models by using unified predictors and relevant outcome measures and appropriate modeling strategies is warranted. Supplementary Information The online version contains supplementary material available at 10.1007/s00701-022-05216-8.


Introduction
Chronic subdural hematoma (CSDH) is a common condition in neurosurgical practice. CSDH is mainly diagnosed in older adults with an overall reported incidence ranging from 20.6 to 79.6 per 100,000 persons per year [2,6,24,33]. Burr-hole craniostomy is the most commonly performed and worldwide most accepted treatment option in symptomatic CSDH [26,41], most often with the insertion of closedsystem drainage [4,7,8,17,36]. In CSDH, the multiplicity of (peri)operative options may influence the outcome after surgical treatment, in addition to the variety of outcome measures such as recurrence, mortality, functional outcome, and quality of life. However, the outcome of CSDH is not only influenced by treatment choices. The outcome can also be related to baseline characteristics such as age, sex, comorbidity, severity of symptoms, the use of medication, and the severity of abnormalities seen on baseline imaging. The contribution of various (peri)operative features to outcome is still under investigation in multiple randomized controlled trials [15].
Multivariable prognostic models are developed to predict the outcome based on baseline patient characteristics. Model-based outcome predictions can inform clinicians and patients and improve decision-making [29]. For instance, models can be used to predict the probability that a hematoma will require reoperation and hence inform the patients and their next-of-kin on what outcome to expect and which treatment option may be optimal [45]. Even if the same treatment strategy is implemented for all patients, a prognostic model can improve their management. For example, a patient with a higher probability of poor outcome can be invited for an earlier appointment or additional rehabilitation. Apart from clinical practice, prognostic models can be used for covariate adjustment in clinical trials and for standardized outcome comparisons between studies, countries, or centers [29,52].
However, prognostic models are developed in a specific patient population and do not have to be equally successful in making predictions in another setting. Before considering the implementation of a model in clinical practice, the model should show good performance in an independent population in a different place or time [13].
Over the years, several CSDH prognostic models [1, 3, 5, 10-12, 16, 20, 22, 23, 25, 27, 30, 31, 34, 37, 38, 40, 42, 46, 48, 54-58] have been published. The developers of CSDH prognostic models aim to predict and stratify patients' risk of mortality, recurrence, and/or functional outcome after surgical CSDH treatment. These models are developed in a specific patient population and have not been externally validated. External validation-assessing the performance of a model in a sufficiently large cohort of patients in a different place or time-is essential before these prognostic models can be considered for guiding clinical decisions. Moreover, external validation and updating of existing models are preferred before starting developing new models.
This study aims to identify existing prognostic models for outcomes after CSDH treatment and to assess the performance in a large dataset of CSDH patients.

Literature search
Medline Ovid, Embase, Web of Science, Cochrane Central, and Google Scholar were systematically searched from their starting dates to February 2021 (See Supplemental Table 1 for search string). Titles and abstracts of these studies were screened by the first author (DCH) to identify all CSDH prognostic models after which the full text was screened. Any discrepancies were discussed (authors DCH, AM, RD, and HL) and resolved through consensus.

Selection of studies: inclusion and exclusion criteria
Studies were included if they contained at least one predictor of one of the outcomes of interest in patients with CSDH, that is mortality, recurrence, and functional status. Studies only describing possible predictors of outcome, without the development and presentation of a prediction model, were excluded. In addition, when predictors were absent in our data or were available for only a small number of patients (pre-specified minimum: 150 patients), these models were also excluded. We did not set specific quality criteria that the development studies needed to satisfy to be included.

Data extraction
From each paper, we extracted the number of patients, inclusion criteria, predictors, outcomes, the prediction model, and its discriminative ability in the development study (area under the curve (AUC).

Study population of the validation cohort
Independently from each other, three regions of the Netherlands (Amsterdam (AM), Rotterdam (RO), and North-East (NE)) collected retrospective data from 2384 consecutive patients who were treated for a CSDH in different time frames between 1991 and 2019. Amsterdam included 288 patients diagnosed between 2012 and 2018. In Rotterdam, two cohorts of patients were included: 509 patients diagnosed between 1991 and 2008 and 280 patients diagnosed between 2010 and 2015. North-East Netherlands included 1307 patients in this database, diagnosed between 2004 and 2019. Data were completely anonymized; all potentially identifying information was removed by the treating hospital and merged into a large retrospective database, which became the validation cohort for this external validation study.

Measurement of predictors and outcomes in the validation cohort
Patient characteristics were extracted from clinical records.
The CSDH preoperative volume was measured with different methods. Researchers in Amsterdam used Brainlab AG (Munich, Germany) and researchers in North-East Netherlands used the ABC/2 volume formula. This formula can be used fast and easily with good accuracy [53].
One of the prognostic models used the occurrence of septations within the CSDH. The presence or absence of septations was not always available in our database. Only if a patient was diagnosed with a "trabecular" or "membranous" CSDH, information on septations was present. In other hematoma types, we could not deduce the presence of septations from the name of the hematoma type only and therefore hematoma types other than "trabecular" or "membranous" were scored as not containing septations.
Mortality within 30 days (yes/no) was determined based on the time of death.
Hematoma recurrence was defined as receiving medical treatment (reoperation or retreatment with dexamethasone) for CSDH.

Statistical analyses
The performance of prediction models was evaluated in terms of calibration and discrimination. Calibration refers to the agreement between predicted and observed risk, and it was visualized by a calibration plot and quantified by calibration in the large (agreement between average observed and predicted outcomes and calibration intercept) and a calibration slope [50]. The calibration intercept expresses the difference between the average predicted risk and the average observed risk. An intercept > 0 indicates that predictions were on average too low, and an intercept < 0 indicates that predictions were on average too high. The calibration slope indicates if the average strength of the association between predictors and outcomes was correctly estimated. A slope < 1 indicates overfitting (overestimated associations), whereas a slope > 1 indicates underfitting (underestimated associations).
Discrimination describes the ability of a model to correctly separate patients with the outcome and without, and it was quantified by the concordance (C) index. The C-index estimates the probability that the risk prediction of randomly selected patients with the outcome (e.g., with CSDH recurrence at three months) was higher than the risk prediction of a randomly selected patient without that outcome (e.g., without CSDH recurrence at 3 months).
To understand the influence of the slope and case-mix heterogeneity on the discriminative ability of a model, we calculated the model-based concordance (mbc) [51]. The mbc is only influenced by the case-mix heterogeneity and not by the validity of regression coefficients.
Models were validated in (a) patients who had relevant data available (complete case analysis) and (b) in patients with missing predictor values imputed (imputation analysis). In imputation analysis, if a predictor variable was not assessed in a certain region, values for all patients on that variable were imputed based on available data from other hospitals. The model for multiple imputations included predictor and outcome variables, hospital regions, and auxiliary variables (e.g., hematoma thickness, aphasia, midline shift). The results were averaged over 10 imputed datasets using Rubin's rules [35]. Missing outcomes were not imputed.
If a model was developed for a specific population (e.g., older adults), the model was validated in all patients with CSDH and in that specific subgroup (e.g., older adults). The performance of models was assessed and presented for the pooled data of all hospital centers and three separate regions in the Netherlands.

Included publications
The initial search identified 3105 studies of which 1680 remained after the removal of double references (Fig. 1).
One thousand six hundred fifty-eight records were excluded based on title/abstract because they did not concern CSDH and/or they only reported predictors of outcome, but did not develop a prognostic model. The remaining 22 articles were screened on the full text of which 7 were excluded on full text; three articles were excluded because they did focus on moderate and severe traumatic brain injury and not on CSDH specifically. Four articles were excluded because the full text was not available, and therefore, no further explanation of the prognostic models could be found.
The remaining 15 articles were included but of these 12 could not be validated (Fig. 1, Table 1).
For eight articles, one or more predictors of the described models could not be found in our retrospective database (e.g., frailty scores, laboratory results, and pneumocephalus volume). For one article, less than 150 patients in the validation cohort had available predictor values (e.g., postoperative volume), and for two articles, one predictor could not be found in our retrospective database and another predictor had too many missing values. For one article, the main outcome measure was missing in our retrospective database. Finally, three papers (4 models) were included in the external validation ( Table 2).

Models selected for validation
All selected models were developed for patients with unilateral hematoma. The Subdural Hematoma in the Elderly (SHE)-scoring model by Alford [3] was developed to predict 30-day mortality in older patients (> 65 years) based on age, admission Glasgow Coma Scale (GCS) score, and hematoma volume. The model by Jack [23] was developed to predict 2-month hematoma recurrence based on age, hematoma volume, and septations on CT. The preoperative prognostic model (model B) proposed by Andersen [5] aimed to predict 3-month recurrence based on hematoma volume, hematoma density, and history of hypertension. Andersen's postoperative model (model A) additionally included drainage time, drain type, and surgical complications ( Table 1). The Andersen models, developed with Fine-Gray regression, were validated based on predictions derived from their nomograms (Supplemental Table 2).

Performance of models in the retrospective database
One thousand six hundred fifty-six patients with available information on 30-day mortality were selected for validation of Alford's model and 1733 patients with available information on 2-month and 3-month hematoma recurrence were selected for validation of Jack's model and Andersen's models, respectively.
The prognostic model of Alford predicted that 15% of patients would die within 30 days, whereas the observed proportion in our data was 4%. Thus, it overestimated the proportion of patients dying within 30 days by 11 percentage points (intercept = − 1.51 [− 1.77, − 1.26]; Fig. 2a, Table 4).
The overestimation of the 30-day mortality rate was consistent for the patient selection (> 65 years) that was used for model development (16% predicted vs. 5% observed; intercept = − 1.38 [− 1.65, − 1.12]); Supplemental Table 2). The slope (1.92 [0.99, 2.85]) indicated a stronger association between the predictors and the outcome in our data. Nevertheless, the discriminative ability (C = 0.70 [0.63-0.77]) was reduced by the more homogeneous case-mix in our study (mbc = 0.60 versus C = 0.80 in the development study).
The prognostic model by Jack (2-month hematoma recurrence) showed a negative calibration slope, indicating reverse predictor effects (− 0.25 [− 1.05, − 0.56]). This indicated that higher predicted probabilities of recurrence by the model were in our data associated with lower observed rates. Additional analyses showed, for instance, that, in contrast with the model, age above 80 was associated with a lower likelihood of recurrence at 2 months in our data (Supplemental Fig. 1). The proportion of patients with recurrent hematoma by 3 months was estimated accurately (10% predicted vs. 9% observed; intercept = − 0.15 [− 0.33, 0.02]), but the discriminative ability of the prognostic model was extremely poor (C < 0.50; Table 4; Fig. 2b).
The performance of Andersen's postoperative model The results of complete case analyses were consistent with imputation analyses (Table 4). In addition, analyses per hospital region generally showed consistent results (Supplemental Table 3).

Discussion
We examined the performance of three published prognostic models for the prediction of outcomes in patients with unilateral CSDH using a retrospective database, which contains data from three regions in the Netherlands. None of the models showed both good discriminative ability and calibration in our data. The most likely explanations of the predictive performance of the models in our data concern suboptimal modeling strategies and differences in study populations.
The differences in the population (case-mix) and differences in the distribution of predictors (case-mix heterogeneity) between the development and validation study can affect model performance in the validation setting. The prognostic model by Alford [3] largely overestimated the percentage of patients who died within 30 days, which could be associated with the substantially different mortality rate between the development study and validation study. It is possible that the patient population was more severely affected in the development study, which was not captured by the predictors in the model; for instance, patients might have had more comorbidities. In addition, although this model was able to discriminate reasonably well between patients who died and did not die within 30 days based on age, hematoma volume, and GCS score, the discrimination ability was decreased by the more homogeneous case-mix in our data. The case-mix and case-mix heterogeneity of the validation data also differed compared to the study of Andersen; for instance, patients had a higher GCS score, a smaller hematoma volume, a lower percentage of drain placement, and a different distribution of hematoma density [5]. In our retrospective validation cohort dataset, almost half of the patients were treated with dexamethasone; 43% of patients were operated with additional dexamethasone and 3% received primary dexamethasone. In these patients, the recurrence rate might be lower, but also the favorable outcome is expected to be worse and patients in the validation cohort might suffer from more adverse events and higher mortality in comparison to the cohorts used for model development [21]. Moreover, the effects of predictors differed between our study and development studies. For instance, whereas older age was predictive of 2-month recurrence in the model of Jack [23], in our dataset, age above 80 was associated with a lower recurrence rate. It is possible that older patients were more likely to die or to receive no treatment at all, in case of hematoma recurrence or in case of comorbidity or greater frailty scores in the validation study. However, frailty scores were not included in the retrospective database. In addition, different definitions of predictors could have contributed to observed differences in the effects of predictors. For example, the inter-rater variability concerning the classification of hematoma types is considered low [47], but assessing septations on a CT-scan is prone to inter-rater variability because membranes cannot always be clearly recognized on CT-scans [39]. If the predictor "septations" was not specifically scored in patients, trabecular hematomas were marked as "septations present." All other hematoma types (homogenous, mixed, and separated) were marked as "septations absent." This restraint in detecting septations in the validation cohort is expected to lead to an underestimation of septations in our population, because septations can also occur in homogenous, mixed, and separated hematoma types.
Finally, suboptimal modeling strategies have likely negatively affected the effects of predictors and model performance in a new setting (our data). A very small sample size of older adults with CSDH was used for the development of the Alford model [31]. In addition, in the models of Alford [31] and Jack [23], continuous predictors were dichotomized/categorized (e.g., age, hematoma volume). Although categorization can make a model seem appealing and easier to use, it leads to a loss of information and usually poor performance in other cohorts [28,43]. Furthermore, the predictors were selected based on p-values and there was no internal validation, which lead to overfitting; meaning that predictor effects are overestimated, model performance in the development sample is overoptimistic, and performance in external validation is poor(er) [44]. The authors of the Andersen [5] model did apply shrinkage in the model development, an approach to prevent overfitting, but the models still showed weaker effects of predictors in our study, probably due to differences in case-mix. In addition, the discriminative ability of this model was also modest in the development study (C = 0.60). The generally limited discriminative ability obtained in both the development study and validation cohort suggests that other variables could be considered for the prediction of this outcome in future studies.
Besides considering other predictors, the strategies for developing models for predicting outcome after CSDH should therefore be improved. Future studies should comprise large samples and collaborative efforts. The predictors should not be primarily selected by p-values but based on level I evidence and clinical expertise. Also, internal validation should be applied in model development. In that way, the effects of predictors are less likely to be exaggerated leading to optimistic model performance [44]. The categorization of continuous variables should be avoided and missing values should be imputed using single or multiple imputation techniques [28,43]. Unified definitions of baseline data elements (predictors) and a unified core outcome set would also facilitate a more reliable establishment, validation, and clinical usefulness of models. In addition, when proposing a new prognostic model, all relevant information that indicates model performance and enables future external validation studies should be reported, such as the full model equation and discriminative ability.
Furthermore, the results also suggest that it is difficult to predict the outcome after CSDH. It is known that "there is significant heterogeneity in the data elements that are collected and reported as part of clinical studies examining outcomes for CSDH" [9]. Moreover, the disease CSDH itself is also heterogeneous. CSDH patients have in common that they are generally older and that most have a high GCS score on admission, but many other characteristics differ such as frailty and overall clinical status. From our experience, a more voluminous CSDH does not necessarily indicate a larger midline shift or more severe clinical symptoms. Also, a less voluminous hematoma does not always result in a rapid recovery without the occurrence of a recurrence. Moreover, the use of anticoagulants is not necessarily related to a more voluminous CSDH, and more severe symptoms at admission are not necessarily related to a poorer functional outcome. This heterogeneity in the data of CSDH patients makes prediction inherently challenging.

Limitations and future directions
In this study, we systematically searched for published models for the prediction of outcomes after CSDH and validated eligible models in our multicenter database. However, we did not perform a systematic review nor assessed the quality of published studies since we considered the validation as "proof" of validity. However, since our retrospective database was originally not built to validate these prognostic models, a substantial number of models could not be validated in our data due to unmeasured predictors and outcomes and due to a large number of missing values. We nevertheless describe these models and encourage other studies with available data to validate all models identified by our search. Also, for the models we did validate there was a significant percentage of missing data. Although complete case and imputation analyses point in the same direction, this should be noted as a limitation. Moreover, although we systematically searched the literature to identify existing models, finally, we did not consider the number of outcomes and our data-quality insufficient to develop a new model.
Even if we would have used a prospective database, there are no well-established predictors and outcomes derived from level I evidence. Currently, there is no consensus on the definition of CSDH and no consensus on baseline data elements nor a core outcome set. The CODE-CSDH group established a Delphi survey to reach a consensus on a core outcome set and baseline data elements to be used in future CSDH studies [19]. Results of this survey are expected in the spring of 2022. It is expected that these results will be a first step in decreasing the heterogeneity and with that improving the quality of available CSDH data. The Dutch Subdural Hematoma Research group (DSHR) [14] is planning to establish a prospective, observational, multicenter registry. Once consensus is reached on the Delphi survey, the DSHR will incorporate the baseline data elements and core outcome set in their prospective database. In the future, this prospective registry can be used for the development of a new prognostic model. This future model should predict endpoints that are relevant for clinical practice. These endpoints will correspond to the core outcome set, as to be determined at the consensus meeting of the CODE-CSDH group.

Conclusion
Published models for the prediction of outcomes following CSDH did not perform well in our retrospective database. The study confirms the complexity of predicting outcomes in patients with CSDH and the need for the collection of standard baseline variables and a core outcome set and for improved modeling strategies, which will improve current prognostic models. This should be part of the focus of future large-scale data collections.

Acknowledgements
The authors wish to thank Wichor M. Bramer, information specialist at the Erasmus MC Medical Library in Rotterdam, The Netherlands, for developing and updating the search strategies.
Funding This study was funded by a grant from The Netherlands Organisation for Health Research and Development (ZonMw project number 843002824) and the Erasmus MC (Mrace, project number 2016-16118; EMC). Ana Mikolic was supported by the European Union 7th Framework programme (EC grant 602150), with the additional support obtained from the Hannelore Kohl Stiftung (Germany), OneMind (USA), Integra LifeSciences Corporation (USA), and Neurotrauma Sciences (USA). The sponsor had no role in the design or conduct of this research.

Declarations
Ethical approval This article does not contain any studies with human participants performed by any of the authors.
Informed consent For this type of study, formal consent is not required.

Conflict of interest The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.