Background

Brain arteriovenous malformation (bAVM) is a cerebrovascular disease characterized by direct shunts between arteries and veins and abnormal vascular masses [1]. The main presenting clinical symptoms are hemorrhage and epilepsy. Because of the high mortality and disability associated with bAVMs rupture in many cases, particularly how to prevent and treat rupture, is always the focus of research. However, whether to intervene when bAVMs occur is still controversial [2,3,4]. Sometimes both the rupture rate of bAVMs in patients and the risk of endovascular or surgical treatment(when radiosurgery is not appropriate) are not low, it is important to assess the risk of rupture more cautiously before treatment.

The common method of developing a prediction model or a scoring system for disease risk is to build a mathematical model based on correlated clinical predictors. For binary category data, multivariate logistic regression (LR) is the conventional algorithm [5, 6]. With the development of computational algorithms, different machine learning methods have been introduced into this field [7]. Of them, random forest (RF) is considered to be a promising method. Previous studies on predicting the risk of diseases have reported many successful cases in which RF was applied [8, 9].

In this study, we collected the clinical data of 353 patients with bAVMs and built prediction models by the LR algorithm and RF algorithm based on multiple random samplings and different training sample sizes, and areas under the curve (AUCs) were used to assess the performances of the models. The purpose of our study is to test and compare the stability and performances of prediction models built by both algorithms and to investigate the deficiencies in these prediction models.

Methods

Case selection and data collection

All patients with bAVMs confirmed by digital subtraction angiography (DSA) from January 2013 to December 2019 were enrolled in our study. Patients with the following conditions were excluded: 1) a combination with brain injury or brain tumors; and 2) incomplete clinical data. Variables that were reported to be correlated with bAVM rupture in previous studies were collected [1, 6, 10]. General variables including age and sex were collected, and morphological variables pertaining to the bAVMs were separately measured on DSA images by 2 neurosurgeons (Wengui Tao and Laochao Yan), including the location, size, associated aneurysm, draining type, and number of draining veins. Other variables, including rupture information, were recorded.

All procedures in this retrospective study that involved human participants were approved by the ethical committee of Xiangya hospital and performed in accordance with the institutional ethical standards, the 1964 Helsinki declaration and its later amendments, or comparable ethical standards.

Building prediction models by the LR algorithm and RF algorithm based on multiple repeated samplings and different sample sizes

RStudio (version 1.1.383; RStudio Inc.) was used to build the prediction models. Variables including sex, location, correlated aneurysm, draining type, and rupture were set as factor (categorical) variables, and variables including age, size, and the number of draining veins were set as numeric (continuous) variables. Rupture was set as the dependent (response) variable, and the other 7 variables were set as independent (explanatory) variables. In the LR algorithm, the independent variables were filtered by the step method, and significant variables were finally used for the predicting formula. In the RF algorithm, default values were set for the "ntree" and "mtry" parameters (500 and 3).

According to the 10 events per variable (EPV) rule [11,12,13], we sampled different sizes of training datasets from all 353 cases each time, and the remaining cases were defined as test datasets. The sample sizes of the training datasets were 140, 175, 210, 245 and 280, and the corresponding test datasets were 213, 178, 143, 108 and 73. For each pair of datasets, the number of random sampling times was 1, 10, 50, 100, 300, 600, 1200 and 2100.

Calculating AUCs to assess the performances of prediction models

AUCs were used to assess the performances of the prediction models. The mean ± standard deviations (SD) was used to depict the AUCs.

After the source code was confirmed, multiple samplings, building the models, predictions, calculating the AUCs and plotting were fulfilled by a computer.

Statistical analysis

Paired sample T-tests were used to compare the AUCs that resulted from the different prediction models built by the LR and RF algorithms. A p value < 0.05 was considered to be statistically significant.

Results

Demographics

The clinical data of 353 patients with ruptured and unruptured bAVMs are summarized in Table 1. Of all patients, 220 were male, and 133 were female, with a mean age of 32.82 ± 15.77 years. A total of 264 (74.8%) bAVMs were located in the cerebral lobes (superficial), 40 (11.3%) in the corpus callosum, basal ganglia or lateral ventricle (deep), and 49 (13.9%) in the cerebellum or brain stem (infratentorial). Ten (5.4%) patients had aneurysms related to bAVMs. The mean size of the bAVM nidus was 3.71 ± 2.15 cm. Seventy-four (21.0%) patients only had deep draining veins. A total of 198 (43.9%) patients only had single draining veins. BAVMs in 228 patients were confirmed to be ruptured and 125 unruptured.

Table 1 Summary of the clinical data

*p value < 0.05: statistically significant

Univariate analysis

Univariate analysis showed that age, location, associated aneurysm, size and the number of draining veins were significantly different between patients with unruptured and ruptured bAVMs. All these variables were used in LR and RF analyses.

Performances of the prediction models

All the AUCs showed that the performances of the prediction models built by the LR algorithm were better than those built by the RF algorithm (p < 0.001), see Fig. 1 and Table 2. The AUC results showed that while the training sample size increased in the LR algorithm, the AUCs were slightly improved from 0.70 to 0.71 (> 100 sampling times). However, in the RF algorithm, the AUCs decreased. The standard deviations (SDs) of the AUCs showed a maximum fluctuation range > 0.1 in different samplings, and different single samplings also reflected unstable performances of the prediction models (see the first row of Fig. 1).

Fig. 1
figure 1

AUCs for the mean ± SD with the training sample size and changes in the sampling times. ad The instability of the prediction models built by the LR algorithm (red line) and RF algorithm (blue line) based on different single sampling times and sample sizes. a-l show that the prediction models built by the LR algorithm were better than those built by the RF algorithm. AUCs above 100 samplings showed that the performances of the prediction models built using the LR algorithm could be slightly improved as the training sample size increased, but the RF algorithm demonstrated the opposite performance. SDs of the AUCs from the prediction models built by both algorithms with different sample sizes displayed wide ranges. a-l separately represent the sampling times: 1, 1, 1, 1, 5, 10, 50, 100, 300, 600, 1200, and 2100 (related data are shown in Table 2). AUC area under the curve, LR logistic regression, RF random forest, SD standard deviations

Table 2 AUCs of prediction models based on different training sample sizes and multiple sampling times

Discussion

BAVMs represent an intracranial hemorrhagic disease. The annual rupture rate of bAVMs reported in various literature is different [14,15,16,17,18]. For each patient and lesion, the risk of rupture should be assessed separately. Of patients who survive after the initial hemorrhage, approximately 20% die, and one-third remain moderately disabled after 3 months [1]. For patients with unruptured bAVMs, the psychological impacts associated with the long-term fear of hemorrhage should not be underestimated [19]. Additionally, it is necessary to compare the risk of bAVMs rupture with that of treatment. All these showed that predicting the hemorrhagic risk was important for unruptured bAVMs. Some studies proposed predictors for hemorrhagic risk, such as female sex, deep location, deep draining veins, single draining veins, and associated aneurysm [20,21,22,23]. Depending on these predictors, some authors tried to develop prediction models or scoring systems for the hemorrhagic risk of bAVM [6]. A successful prediction model or a scoring system would help clinical workers find suitable and low-risk management options for patients.

For binary categorical clinical data, the LR algorithm is the conventional method for building prediction models [5]. In recent years, machine learning algorithms have been introduced in this field. The highly accurate results and simplified procedures that resulted from the introduction of these methods are impressive. Of these machine learning algorithms, the RF algorithm is considered most promising because of its better performance, especially for big data [24].

The common method for building a prediction model is to obtain a training dataset from the whole data by date sequence or randomly and then to build a model in the form of a predicting formula (LR) or a predicting procedure hidden in black boxes (machine learning). The remaining data are defined as the test dataset and used to test the model. The AUC is usually used to evaluate predicting performances. The training sample size of the training dataset should meet the basic request of the 10 events per variable (EPV) rule [11,12,13].

In this study, our original purpose was to try to build prediction models for predicting the risk of bAVM rupture by the LR algorithm and RF algorithm and to compare the performances of those models. However, the results were not as expected, and the models displayed instability and uncertainty. When we performed multiple random samplings for the training dataset, the coefficients of the prediction formula from the LR algorithm varied, and the AUC also displayed different values, as did the RF algorithm. To explore this problem further, we increased the number of sampling times, changed the ratio of the training sample size to the test sample size, and even changed the number of independent variables; additionally, we observed the change in AUCs and tried to identify rules. Although the AUCs were widely dispersed with varying sample sizes and random sampling times, they still displayed certain patterns. Being familiar with these patterns can help us understand the possible uncertainty and instability of prediction models, help us build optimal prediction models, and avoid pitfalls.

The independent variables (explanatory variables) used in this study have been accepted by most researchers and are considered to be risk factors for bAVM rupture [1, 6, 10], but their performances in predicting hemorrhage were not ideal in this study. Their deficiencies did not radically change regardless of the algorithms we used or the increased sampling times or different training sample sizes. We believed that obtaining an ideal prediction model for predicting bAVM rupture might depend on the identification of new, more valuable predictors.

According to statistics, it is generally considered that if we try to obtain an effective result in regression analysis, the sample size should meet the 10 EPV rule. Our study showed that if the training sample size for the LR algorithm was increased on the basis of the 10 EPV rule, the predicting performance would only be improved slightly. This result indirectly proved the 10 EPV rule. Although the RF algorithm has shown advantages in many studies, in this study, its performance was not better than that of the LR algorithm. This result suggested that if there were not some significant independent variables, it would also be difficult for the RF algorithm to display its power.

In most previous studies on prediction models, the training dataset was almost always based on a single random sampling or date order; in fact, the number of sampling times was not specified in the statistics [5, 6]. However, in our study, the SDs reflected the instability that resulted from different samplings.

This study was based on clinical data from 353 patients with bAVMs; limitations in the sample size may affect the conclusions, and data were collected from a single center. The reliability and generality of the conclusions should be verified in a multicenter study.

Conclusions

Both the prediction model by LR algorithm or RF algorithm based on the current risk predictors are not ideal. Compared with sample size and algorithms, meaningful predictors are more important in establishing an accurate and stable predictive model.