1 Introduction

Automated brain tumor segmentation promises to provide more reliable measurements for cancer diagnosis and assessment, establishing new possibilities for high-throughput analysis [1]. Segmentation enables clinicians to determine tumor location, extent, and subtype. Additionally, brain tumor segmentation on longitudinal MRI scans can facilitate monitoring tumor growth or shrinkage. In current clinical practice, accurate segmentation of brain tumor regions is usually done manually by experienced radiologists, which is time-consuming and labor-intensive. Furthermore, manual labeling of results may involve human bias, as they rely on the physician’s experience and subjective decision-making. On the other hand, automated segmentation techniques can reduce labor and human bias to provide efficient, objective, and reproducible results for tumor diagnosis and monitoring.

The performance of automated brain tumor segmentation methods has grown rapidly over the past few years. This development is due to the growth of annotated datasets and the advent of deep learning models that can leverage large amounts of data [2]. Most methods are based on fully convolutional neural networks (FCN) [3], like U-Net [4] and its variants [5] for improving the performance of brain tumor segmentation. Recently, Transformers and self-attention, originating from natural language processing (NLP), have also been applied to medical image segmentation [6].

Although the segmentation results of deep neural networks are reported to be close to or comparable to human performance [7], their robustness levels are low, and concerns remain with their clinical acceptability [8]. Possible reasons include the large variability in imaging properties, such as artifacts and magnetic field strength, as well as the inherent heterogeneity of brain tumors, which are beyond the training dataset. Furthermore, human bias in dataset annotations can cause models also to inherit this bias. One possible direction to alleviate the reliability problem of deep neural networks is to use uncertainty estimation. The uncertainty reflects how confident the network predicts the class labels. Confidence studies can help identify areas of data dominated by lack of annotations (epistemic uncertainty) or noisy annotations (aleatoric uncertainty). Additional information from uncertainty estimation can be used to quantify segmentation performance or as a post-processing step to correct automatic segmentation. Clinically, uncertainty estimates can feed back potential error regions to guide or automate corrections or be used for patient-level segmentation failure detection [1]. Therefore, reliably quantifying the uncertainty of segmentation performance is critical in clinical applications.

Popular methods for quantifying uncertainty in neural networks include quantile regression (QR) [9], Bayesian neural network (BNN) [10,11,12], ensemble-based [13], dropout-based [14,15,16], and evidential deep learning (EDL) [17,18,19]. Simply interpreting the confidence scores of softmax/sigmoid outputs as event probabilities in a categorical distribution can lead to overconfident wrong prediction [20, 21]. The classical BNN aims to capture uncertainty by learning the weight distribution of the network and approximates the integral of parameters by variational inference or Laplace approximation to estimate the posterior prediction distribution [18]. However, most BNNs are challenging to implement and train since model parameters have to be explicitly modeled as random variables [22], which lack scalability in both architecture and data size [12]. Hence, subsequent approaches focused on being able to reuse the training pipeline and maintain scalability while providing reasonable uncertainty estimates. To this end, more intuitive and simple methods, such as learning an ensemble of deterministic networks [15, 21] and introducing Monte Carlo dropout [13] are proposed for brain tumor segmentation. On the downside, ensemble-based methods need to train multiple models from scratch, which is computationally expensive, and the introduction of dropouts results in inconsistent outputs [23].

On the other hand, EDL has been gradually developed in recent studies, demonstrating more promising and reliable performance in uncertainty estimation. Based on the Dempster-Shafer Evidence Theory (DST) [24], EDL uses the Dirichlet distribution to model the categorical distribution of the output given the input to the network. This class of methods produces closed-form prediction distributions and outperforms BNNs in adversarial queries and out-of-distribution uncertainty quantification [18]. Compared to ensemble-based and dropout-based methods, EDL showed more robust results with lower computational costs [25]. However, most of the recent works focus on the natural image classification and segmentation problem, making the application of EDL in medical image segmentation to be further studied.

In this paper, we propose a region-based EDL network for reliable brain tumor segmentation, which is robust to noise and corruption of images. The network learned classification distribution by minimizing region-based prediction error under the Dirichlet prior distribution. This enabled the proposed network to provide accurate segmentation results and reliable uncertainty estimate simultaneously, even under noise-corrupted inputs. Our method improved the mean squared error (MSE) loss used for the simple natural image classification [17], making it more suitable for semantic segmentation of medical images. The main contributions of our work can be summarized as follows:

  • An EDL framework was adopted for accurate brain tumor segmentation, which can quantify the uncertainty of segmentation results and improve the reliability and robustness of segmentation with less computational complexity compared to ensemble-based and dropout-based methods.

  • A novel training loss was developed based on minimizing the region-based prediction error under the Dirichlet prior distribution to improve the segmentation accuracy of EDL. Theoretical properties are fully provided to guarantee the evidential learning of the model.

  • A new evaluation metric called soft uncertainty-error overlap (sUEO) was designed for uncertainty estimation to assess the model’s ability to localize segmentation errors more easily.

  • The robustness of the segmentation accuracy and uncertainty quantification of the proposed method is comparatively evaluated on the BraTS2020 dataset using image corruption techniques. The effectiveness and efficiency of the novel loss function were verified.

The rest of the paper is structured as follows: Sect. 2 briefly introduces EDL and recent development. Section 3 details our segmentation framework, including EDL and novel loss functions. Section 4 illustrates the experimental setup and evaluation metrics, and the results are analysed and discussed in Sect. 5. The conclusion and future research directions are given in Sect. 6.

2 Related work

Despite many uncertainty estimation methods mentioned, our proposed framework resorts to arguably the most cutting-edge methodology, EDL, for this purpose. The rest of the section presents the principles of EDL (Sect. 2.1) and a brief overview (Sect. 2.2) of the scarce contributions in which EDL has been utilized for tumor segmentation.

2.1 Principles of EDL

Evidence Deep Learning (EDL) is based on Dempster-Shafer Evidence Theory (DST) [24], which is a generalization of Bayesian theory to subjective probability. It assigns belief masses to subsets of a discerning frame, representing a set of exclusive potential states, such as possible class labels for a voxel. A belief mass can be assigned to any subset of the frame. Assigning all belief masses to the entire frame represents the opinion that the truth can be any possible state, e.g. any label is equally likely.

The belief distribution of DST in the discerning framework can be formalized as a Dirichlet distribution by Subjective Logic (SL) [26]. For a voxel i, the Dirichlet distribution \(Dir\left( \varvec{\alpha }_{i}\right)\) is parameterized by a vector of Dirichlet parameters \(\alpha _{ij}\) for K classes, where j denotes the j-th class. (The denotations of subscripts i and j hold for the entire paper.) The neural network collects evidence \(e_{ij}\) from the input data, a measure of support that facilitates classifying samples into the class j. The belief mass distribution, i.e. subjective opinion, in [17] corresponds to a Dirichlet distribution with parameter \(\alpha _{ij}=e_{ij}+1\).

As a result, it is equivalent to placing a Dirichlet distribution on the predicted categorical distribution, allowing a single network to output different predictions. The output layer of an EDL-based neural network parameterizes a simplex distribution representing the probability distribution of class assignments. The softmax/sigmoid classification layer is replaced with a ReLU activation layer that outputs non-negative continuous values, resulting in \(e_{ij}\). The vector of predicted classification probabilities can be computed by:

$$\begin{aligned} {\hat{p}}_{ij}=\frac{\ \alpha _{ij}}{S_i}, \end{aligned}$$
(1)

where \(S_i=\sum _{j=1}^{K}\alpha _{ij}\) is called the Dirichlet strength. The class probability vector for voxel i given by \({\textbf {p}}_{i}\) is modeled as a random vector drawn from the Dirichlet distribution [18].

Let \({\textbf {y}}_{i}\) be the one-hot encoded labels with \(y_{ik}=1\) and \(y_{ij}=0\) for all \(j\ne k\). Treating the Dirichlet distribution \(Dir\left( {\textbf {p}}_{i} \mid \varvec{\alpha }_{i}\right)\) as a prior on the multinomial likelihood \(Mult({\textbf {y}}_{i} \mid {\textbf {p}}_{i})\), one can minimize the negative logarithm of the marginal likelihood:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {ML},i}&=-\log \left( \int {\prod _{j=1}^{K}{p_{ij}}^{y_{ij}}\frac{1}{\mathcal {B}(\varvec{\alpha }_{i})}\prod _{j=1}^{K}{p_{ij}}^{\alpha _{ij}-1}d{\textbf {p}}_{i}}\right) \\ {}&=\sum _{j=1}^{K}{y_{ij}\left( \log \left( S_i\right) -\log \left( \alpha _{ij}\right) \right) }, \end{aligned} \end{aligned}$$
(2)

where \(\mathcal {B}\) is the multinomial beta function [27]. Alternatively, one can minimize the Bayes risk of the cross-entropy loss:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {CE},i}&=\int {\left[ -\sum _{j=1}^{K}{y_{ij}\log (p_{ij})}\right] \frac{1}{\mathcal {B}(\varvec{\alpha }_{i})}\prod _{j=1}^{K}{p_{ij}}^{\alpha _{ij}-1}d{\textbf {p}}_{i}}\\&=\sum _{j=1}^{K}{y_{ij}\left( \psi \left( S_i\right) -\psi \left( \alpha _{ij}\right) \right) }, \end{aligned} \end{aligned}$$
(3)

or the mean squared error:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {MSE},i}&=\int {\Vert {\textbf {y}}_{i}-{\textbf {p}}_{i}\Vert ^2\frac{1}{\mathcal {B}(\varvec{\alpha }_{i})}\prod _{j=1}^{K}{p_{ij}}^{\alpha _{ij}-1}d{\textbf {p}}_{i}}\\ {}&=\sum _{j=1}^{K}{\left( y_{ij}-{\hat{p}}_{ij}\right) ^2+\frac{{\hat{p}}_{ij}\left( 1-{\hat{p}}_{ij}\right) }{\left( S_i+1\right) }}, \end{aligned} \end{aligned}$$
(4)

where \(\psi\) refers to the digamma function [28]. Sensoy et al. [17] observed that \(\mathcal {L}_{\text {ML},i}\) and \(\mathcal {L}_{\text {CE},i}\) produced excessively high belief masses and were less stable than \(\mathcal {L}_{\text {MSE},i}\). This can be attributed to the fact that these two loss functions encourage maximizing the correct likelihood.

2.2 Related work of EDL

Sensoy et al. [17] used the MSE loss for natural image classification. They showed that the loss decreases as the correct class parameter grows and decreases when the largest incorrect parameter decays. Furthermore, they integrated the KL divergence loss to narrow the error class parameters further. However, the properties of the aggregated loss function were not shown, and the behavior of the loss was not studied for all parameters. Also, for the image classification problem, [18] improved the square-norm of MSE loss to max-norm and achieved higher performance. Because max-norm minimizes the highest class prediction error, and square-norm minimizes the total sum of squares, which is more susceptible to outliers. However, this situation may not be applicable for tumor segmentation with severe class imbalance. The TBraTS network [25] attempted to apply EDL’s CE loss to brain tumor segmentation. In order to improve the segmentation accuracy, the network output was additionally passed through the softmax layer to calculate the soft Dice loss, which was added with the CE loss. However, this increases training costs and complexity, and an incomplete deployment of the EDL framework may cause the network to fail to produce true evidence values.

Different from these methods that employed MSE or CE loss and show inferior segmentation results, our approach minimized region-based prediction error (soft Dice loss) under the Dirichlet prior distribution, which significantly facilitated the segmentation performance of the EDL framework. The improvement of EDL in segmentation was statistically verified in the medical image dataset, paving the way for the clinical application of the EDL-based segmentation and uncertainty estimation framework.

3 Method

This section details our approach, a novel region-based EDL framework for 3D brain tumor segmentation (Sect. 3.1) and describes how we quantify the uncertainty (Sect. 3.2).

3.1 Region-based evidential deep learning

For semantic segmentation of medical images, it is important to consider the accuracy of segmented regions in addition to standard classification errors. Hence, we proposed a region-based objective to minimize the expected prediction error in the EDL framework while maintaining high segmentation accuracy. Unlike Zou et al. [25] who added the soft Dice (sDice) loss based on the result of softmax activation to \(\mathcal {L}_{\text {CE},i}\), we directly minimized the Bayes risk of sDice loss:

$$\begin{aligned} \begin{aligned} \text {sDice}&=\frac{1}{K}\sum _{j=1}^{K}{1-\frac{2\sum _{i}{y_{ij}p_{ij}}}{\sum _{i}{{y_{ij}}^2+{p_{ij}}^2}}}, \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {DICE}}&=\int {\left[ \text {sDice}\right] \frac{1}{\mathcal {B}(\varvec{\alpha }_{i})}\prod _{j=1}^{K}{p_{ij}}^{\alpha _{ij}-1}d{\textbf {p}}_{i}}\\&=\frac{1}{K}\sum _{j=1}^{K}\mathbb {E}\left[ 1-\frac{2\sum _{i}{y_{ij}p_{ij}}}{\sum _{i}{{y_{ij}}^2+{p_{ij}}^2}}\right] \\&=1-\frac{2}{K}\sum _{j=1}^{K}\frac{\sum _{i}{y_{ij}\mathbb {E}\left[ p_{ij}\right] }}{\sum _{i}{{y_{ij}}^2+\mathbb {E}\left[ {p_{ij}}^2\right] }}. \end{aligned} \end{aligned}$$
(6)

By using the identity:

$$\begin{aligned} \mathbb {E}\left[ {p_{ij}}^2\right] ={\mathbb {E}\left[ p_{ij}\right] }^2+\text {Var}(p_{ij}), \end{aligned}$$
(7)

the equation can be formulated in an easily interpretable form:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {DICE}}&=1-\frac{2}{K}\sum _{j=1}^{K}\frac{\sum _{i}{y_{ij}{\hat{p}}_{ij}}}{\sum _{i}{{\underbrace{{y_{ij}}^2+{\hat{p}}_{ij}^2}_{\text {sDiceDen}}}+{\underbrace{\frac{{\hat{p}}_{ij}\left( 1-{\hat{p}}_{ij}\right) }{\left( S_i+1\right) }}_{\text {Var}}}}}\\&=1-\frac{2}{K}\sum _{j=1}^{K}\frac{\sum _{i}{y_{ij}\frac{\ \alpha _{ij}}{S_i}}}{\sum _{i}{{y_{ij}}^2+\left( \frac{\ \alpha _{ij}}{S_i}\right) ^2+\frac{\alpha _{ij}\left( S_i-\alpha _{ij}\right) }{{S_i}^2\left( S_i+1\right) }}}. \end{aligned} \end{aligned}$$
(8)

By factoring out the denominator of sDice (sDiceDen) and variance (Var), the loss aims to achieve the joint goal of minimizing the region-based prediction error and variance of the Dirichlet experiments generated by the neural network for the training set.

In order to ensure an effective EDL framework that allows the network to learn to generate subjective opinions from evidence correctly, the loss function needs to have the following properties.

Hypothesis 1

When the network optimises, the loss function prioritizes data fitting over variance estimation.

Hypothesis 2

The loss function has a tendency to fit the data.

Hypothesis 3

The loss function avoids generating evidence for all observations it cannot explain.

These properties of the proposed DICE loss (\(\mathcal {L}_{\text {DICE}}\)) can be guaranteed by the following theorems, each numbered one-to-one with the hypothesis. The proofs of all Theorems are presented in Appendix 1.

Theorem 1

For any \(\alpha _{ij}\ge 1\), the inequality \(sDiceDen > Var\) is satisfied.

Theorem 2

For a given voxel p with the correct label c, \(\mathcal {L}_{\text {DICE}}\) decreases when new evidence is added to \(\alpha _{pc}\) and increases when evidence is removed from \(\alpha _{pc}\).

Theorem 3

For a given voxel p with the correct label c, \(\mathcal {L}_{\text {DICE}}\) decreases when evidence is removed from all incorrect Dirichlet parameters \(\alpha _{pw}\) for all \(w\ne c\).

To summarise, Theorems 1 to 3 demonstrate that the proposed loss function can optimize the neural network to provide more evidence for the correct class of each voxel while avoiding misclassification by discarding misleading evidence. By accumulating evidence, the loss also tends to reduce the variance of its predictions on the training set, but only if the additional evidence leads to a better fit to the data.

Furthermore, to further minimize the contribution of parameters associated with incorrect classes, a KL divergence loss function is introduced to shrink their evidence to 0 as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {KL},i}=&\log \left( \frac{\Gamma \left( \sum _{j=1}^{K}{\widetilde{\alpha }}_{ij}\right) }{\Gamma \left( K\right) \prod _{j=1}^{K}\Gamma \left( {\widetilde{\alpha }}_{ij}\right) }\right) \\&+\sum _{j=1}^{K}\left( {\widetilde{\alpha }}_{ij}-1\right) \left[ \psi \left( {\widetilde{\alpha }}_{ij}\right) -\psi \left( \sum _{j=1}^{K}{\widetilde{\alpha }}_{ij}\right) \right] , \end{aligned} \end{aligned}$$
(9)

where \(\Gamma (\cdot )\) is the gamma function [28] and \({\widetilde{\varvec{\alpha }}}_{i}={\textbf {y}}_{i}+\left( \textbf{1}-{\textbf {y}}_{i}\right) \bigodot \varvec{\alpha }_{i}\) is the Dirichlet parameters after removal of the non-misleading evidence. The following theorem shows a desirable monotonicity property of this regularization loss as a supplementary to [17].

Theorem 4

For a voxel i with the correct label c, \(\mathcal {L}_{\text {KL},i}\) increases in \(\alpha _{iw}\) for all \(w\ne c\).

Theorems 3 and 4 show that the strength of parameters associated with misleading results is expected to decrease during training. Since the parameters are all expected to be minimized, the preferred behavior of the proposed loss function results in a higher uncertainty of misclassification.

The final loss function is defined as:

$$\begin{aligned} \mathcal {L}_{\text {EDL}}=\mathcal {L}_{\text {DICE}}+\lambda \mathcal {L}_{\text {KL,mean}}, \end{aligned}$$
(10)

where \(\mathcal {L}_{\text {KL,mean}}\) is the mean KL divergence loss over all voxels and \(\lambda\) is an annealing coefficient. The KL divergence loss is gradually introduced by \(\lambda\) for a stable training due to its strong regularization effect. The annealing scheme is set to reach a maximum \(\frac{1}{10}\) as: \(\lambda =\frac{1}{10}{\min \left( 1,\ \frac{n}{100}\right) }^2\) where n is the current epoch.

In addition, the weighted sDice loss, \(\mathcal {L}_{\text {wDICE}}\), is also proposed to ease the class imbalance between tumor and background voxels. The weight for each segmentation class is one minus the ratio of foreground voxels to background voxels. Since the weights are all positive and class-wise, all theoretical properties of the loss function still hold.

Furthermore, the parameter of Dirichlet distribution in our framework is re-defined as:

$$\begin{aligned} \alpha _{ij}=\left( e_{ij}+1\right) ^2. \end{aligned}$$
(11)

Unlike [17] defined the Dirichlet \(\alpha _{ij}=e_{ij}+1\), the alternative formula allows the network to output high Dirichlet parameters more easily. This avoids the defect that it is almost impossible for the network to express a high degree of uncertainty for a particular outcome since each outcome gives a minimal proof of one, i.e. \(\alpha _{ij}\ge 1\).

3.2 Uncertainty quantification

Calculating the predictive entropy (PE) is a common way to quantify uncertainty. Based on the information theory, PE uses confidence scores of predictions to calculate the total uncertainty for a voxel i, which is defined as:

$$\begin{aligned} \mathcal {H}({\textbf {p}}_{i})=-\sum _{j=1}^{K}{p_{ij}\log (p_{ij})}, \end{aligned}$$
(12)

where \({\textbf {p}}_{i}\) is the confidence score vector [29]. In order to better compare different methods, we normalized the PE by its maximum possible value as:

$$\begin{aligned} \mathcal {H}({\textbf {p}}_{i})=-\frac{1}{\log (K)}\sum _{j=1}^{K}{p_{ij}\log (p_{ij})}. \end{aligned}$$
(13)

As a result, the value range of normalized predictive entropy (NPE) is normalized to [0, 1], where 1 implies the maximum uncertainty and 0 implies the absolutely confident prediction.

4 Experiment setup

Experiments on the standard benchmark (BraTS 2020) were conducted to compare different techniques for uncertainty quantification and evaluate qualitatively the produced segmentation along with the uncertainty associated with each voxel. We first present the implementation details (Sect. 4.1) and then introduce the models (Sect. 4.2) and evaluation metrics (Sect. 4.3) for comparative experiments.

4.1 Data acquisition and processing

The BraTS 2020 [7, 30, 31] dataset comprises brain MRI images of various scanners and protocols. The ground truth (GT) label includes the GD-enhancing tumor (ET), peritumoral edema (ED), and necrotic and non-enhancing tumor core (NCR + NET). The segmentation masks were evaluated on three tumor subregions: the ET, the tumor core (TC = ET + NCR + NET), and the whole tumor (WT = ET + NCR + NET + ED). Four MRI modalities of T1, T1ce, T2, and T2-FLAIR were co-registered with a size of 240 \(\times\) 240 \(\times\) 155. They were then interpolated to \(1 mm^3\) and skull-stripped. Since GT labels are only available for the training set (369 cases), we split the original training set into a new training set of 236 cases, a validation set of 59 cases, and a test set of 74 cases.

All images are cropped to 160 \(\times\) 192 \(\times\) 128 to reduce computational waste in the background and are then preprocessed by intensity normalization. During the training, various data augmentation techniques were applied on-the-fly as in [32] to artificially increase the dataset size and minimize the risk of overfitting.

Fig. 1
figure 1

Representative visual segmentation results of the proposed region-based EDL method on the BraTS 2020 test set. The labels are enhancing tumor (yellow), edema (green), and necrotic and non-enhancing tumor (red)

4.2 Model Training and Optimization

We chose the well-validated nnU-Net [33] as our Base network model and configured as in our previous work [32, 34]. All softmax/sigmoid layers in the Base network were replaced with ReLU activation layers as described in the previous section. For comparison, we used different loss functions to optimize the network: \(\mathcal {L}_{\text {CE},i}\), \(\mathcal {L}_{\text {MSE},i}\), \(\mathcal {L}_{\text {DICE}}\), and \(\mathcal {L}_{\text {wDICE}}\). Since the evaluation would be based on more meaningful tumor subregions, the network was trained to segment each overlapping subregion separately. However, we also trained the network for multi-class segmentation of the basic labels using \(\mathcal {L}_{\text {DICE}}\) of \(K=4\) for ablation study.

In addition, we also employed training strategies of Ensemble [15], Dropout [14], and TBraTS [25], which all used soft Dice based loss function for fair comparisons. For Ensemble, we trained five networks with different initialized weights, which has proven to be sufficient in practice [35]. Dropout layers (factor of 0.5) were added to the deepest three layers of the Base network to handle high-level features, which is the most efficient [16].These layers were also active during inference, and the same images were passed 10 times to quantify the prediction uncertainty [14]. Previous research has found that a sampling rate of 10 is adequate for reasonable uncertainty estimation [14]. Moreover, we used the strategy of TBraTS, which combined existing losses for multi-label segmentation.

The adaptive moment estimator (Adam) optimizer was used to optimize all networks in 200 epochs, with a batch size of 1 and an initial learning rate of 0.0003. Experiments were implemented using PyTorch 1.10 on NVIDIA GeForce RTX 3090 GPUs.

4.3 Evaluation metrics

Our method was evaluated using the independent test set of BraTS 2020 (74 cases). The segmentation performance was evaluated using the Dice score, which is defined as:

$$\begin{aligned} \text {Dice} = \frac{2|\mathcal {X} \cap \mathcal {Y} |}{|\mathcal {X} |+|\mathcal {Y}|}, \end{aligned}$$
(14)

where \(\mathcal {X}\) and \(\mathcal {Y}\) are sets of GT and prediction. The Dice score measures spatial overlap between the GT and segmentation results, where a score of 1 indicate a complete overlap.

In addition, the following metrics were utilized to evaluate uncertainty estimation: expected calibration error (ECE) [1], soft uncertainty-error overlap (sUEO), and BraTS score (BraS) [36]. ECE is defined by the absolute calibration error between the confidence interval and the accuracy interval (\(c_m\) and \(a_m\), where m is the m-th bin defined in the interval [0, 1]), weighted by the number of voxels (\(n_m\)) in the interval. ECE is given by

$$\begin{aligned} \text {ECE}=\sum _{m=1}^{M}\frac{n_m}{N}|c_m-a_m |, \end{aligned}$$
(15)

where N and M are the total numbers of voxels and bins, and the confidence is calculated by one minus the uncertainty. ECE ranges from 0 to 1, where lower values indicate better calibration. To reduce the effect of the large, confident, and accurate extracranial regions typically found in brain tumor MRI, we only considered voxels within the brain. Improved on the uncertainty-error overlap (UEO) [1], we proposed the soft uncertainty-error overlap (sUEO) that directly uses the uncertainty quantities (\(u_i\)) to measure the overlap:

$$\begin{aligned} \text {sUEO}=\frac{2\sum _{i}{y_iu_i}}{\sum _{i}{{y_i}^2+{u_i}^2}}. \end{aligned}$$
(16)

sUEO does not require thresholding the uncertainty map, which can save time optimizing the threshold over the validation set. It shows whether a model can precisely localize segmentation errors. Moreover, the comprehensive BraS is defined by:

$$\begin{aligned} \begin{aligned} \text {BraS}=&\frac{1}{3}\left[ \text {{AUC}}_\text {{Dice}}+(1-\text {{AUC}}_\text {{FTP}}) \right. \\&\left. +(1-\text {{AUC}}_\text {{FTN}})\right] , \end{aligned} \end{aligned}$$
(17)

where \(\text {{AUC}}_\text {{Dice}}\), \(\text {{AUC}}_\text {{FTP}}\), and \(\text {{AUC}}_\text {{FTN}}\) are area under three curves: 1) Dice vs. confidence threshold, 2) ratio of filtered True Positives (FTP) vs. confidence threshold, and 3) ratio of filtered True Negatives (FTN) vs. confidence threshold. The curves are plotted against the segmentation filtered by different confidence levels, which only voxels with confidence greater than the threshold retain. This metric rewards uncertainty estimates that yield high confidence for correct segmentations or assigns a low confidence level to incorrect segmentations and penalizes uncertainty measures that result in a higher percentage of under-confident correct segmentations.

5 Results and discussion

This section first evaluates the performance of the novel region-based EDL framework for brain tumor segmentation and uncertainty quantification through experiments on the original dataset (Sect. 5.1). It then examines its robustness by applying various image processing techniques to the test image data (Sect. 5.2).

5.1 Segmentation and uncertainty estimation

Our method generated comparable segmentation results with the GT labels, as visualized in Fig. 1. The quantitative results of our methods averaged over three tumor subregions are compared in Table 1. Although the proposed DICE and wDICE loss functions achieved the highest Dice scores (0.791 and 0.793) among all EDL-based methods, Ensemble and Dropout methods performed slightly more accurately in segmentation (0.807 and 0.804). The success of Ensemble and Dropout was attributed to the variance reduction by combining predictions prone to various errors. However, the dominance of the proposed region-based losses in all EDL frameworks still proved their effectiveness in improving EDL in segmentation performance. Compared to CE-based or MSE-based losses, the DICE-based losses significantlly improved the Dice score by 0.01.

As for the uncertainty estimation, Ensemble and Dropout obtained the lowest ECE metrics of 0.009 and 0,010, which indicated they were well-calibrated. However, our methods achieved the highest sUEO and BraS of 0.420 and 0.875, showing their ability to more precisely locate errors and estimate uncertainty. The EDL model optimized by the proposed wDice loss generated the most accurate uncertainty map to indicate the potential false predictions. On the other hand, the proposed EDL (DICE) model made the most reliable uncertainty estimation, maintaining the lowest error while thresholding along the uncertainty dimension. The advantages of our region-based EDL methods are also shown in Fig. 2. The proposed methods had the most precise uncertainty map consistent with the error map. Ideally, a learned model only give high uncertainty for a possible erroneous prediction. Despite the high segmentation accuracy, both Ensemble and Dropout methods generated more uncertainty around mask boundaries and other correct regions, leading to inferior uncertainty estimation performance in terms of sUEO and BraS. It is also worth mentioning that EDL models trained to segment each tumor subregion separately outperformed the ones trained with multi-class labels (DICE-M).

Table 1 Quantitative comparisons of different uncertainty estimation methods on the BraTS 2020 test set

In addition, the inference runtimes of the uncertainty estimation methods on one sample are reported in Table 2. Runtimes of all EDL-based methods are lower than the others. This is because both Ensemble and Dropout use multiple sampling mechanisms at inference time to obtain uncertainty estimates.

Table 2 Inference runtimes of different uncertainty estimation methods for one image
Fig. 2
figure 2

Representative visual results of the whole tumor (WT) produced by different uncertainty estimation methods on the BraTS 2020 test set. The right half of the figure was evaluated on the test images blurred by a Gaussian filter of sigma = 1.5

Table 3 Quantitative comparisons of different uncertainty estimation methods on preprocessed BraTS 2020 test set. (Bold numbers: best results)
Fig. 3
figure 3

Representative visual results of the whole tumor (WT) produced by different uncertainty estimation methods on the BraTS 2020 test set. The left half of the figure was evaluated on the test images added with Gaussian noise of variance = 1.5. The right half was evaluated on the test images after Gamma correction of gamma = 5

5.2 Robustness experiment

To verify the robustness of the segmentation model, we applied several image processing techniques to simulate the low-quality acquisition that usually happens in actual practice. We first blurred the four modalities of the MRI images using a Gaussian filter with sigma = 1.5. Subsequently, we re-evaluated the performance of all methods for segmentation and uncertainty estimation, as shown in Table 3. We can observe that with the addition of Gaussian blur, the segmentation performance of all methods dropped significantly, especially Ensemble and Dropout. Our method leaped to the highest Dice metric of 0.572 for blurry images, demonstrating its robustness. By comparing the segmentation results with the original input and high-noise input in Fig. 2, it can be seen that the EDL using our loss function segmented the WT region more accurately than all other methods. This is due to the evidence extracted from the data that produced these subjective opinions.

Furthermore, our method exhibited the most reliable uncertainty quantification on blurred images compared to other uncertainty estimation methods. Unlike the uncertainty of the Ensemble and Dropout methods, which only came from the variance of the prediction, the uncertainty of the EDL method represented whether the prediction was supported by sufficient evidence. Therefore, the uncertainty estimates of EDL methods can more correctly indicate possible prediction errors or provide a better rationale for erroneous predictions, such as learning the wrong evidence or failing to identify the correct features. As shown in Table 3, all uncertainty evaluation metrics decreased, except for sUEO for all EDL-based methods. This might benefit from the robust evidence captured by the EDL segmentation framework, where the advantage is more noticeable with larger error regions. The proposed EDL (DICE) method achieved top performance, especially for generating reliable uncertainty maps. Besides showing the robustness of our method, this again demonstrated the importance of region-based loss for locating semantic segmentation errors. As shown in the right half of Fig. 2, the uncertainty map generated by the EDL (DICE) method is the most relevant to the error map. Unlike other methods that only generate high uncertainty at the edges of the predicted mask, the proposed method can indicate potential error regions inside masks. It is true that regional boundaries should a priori undergo higher uncertainty, but this high uncertainty should not be assumed to be exclusive of these regions. This is what is favored by our proposed methods, as they declare high uncertainty in delimiting and non-delimiting regions of the segmented image. This demonstrates the potential of our proposed method for clinical application. Potential error regions fed back by the model can assist in automatic correction or quality quantification of tumor segmentation.

To enrich the robustness experiment, we further applied Gaussian noise and Gamma correction to the original input to simulate the noise and the contrast variability introduced by imaging or enhancement technique. With the Gaussian noise of variance = 1.5, the segmentation performance of all methods was superior to burred ones, as shown in Table 3. The proposed EDL (wDICE) remained the top in The proposed EDL (wDICE) method remained the top in terms of the Dice metric (0.771), followed by the proposed EDL (DICE) with Dice 0.769. The metrics for uncertainty estimation resembled blurred input, while the sUEO of EDL (wDICE) became 0.002 higher than that of EDL (DICE). However, as for the result after Gamma correction with gamma = 5, the proposed EDL (DICE) method excelled in all metrics, which showed its robustness to unexpected contrast variance.

These observations can also be visually inspected in Fig. 3. Compared to other methods, our methods provided more precise uncertainty maps. Ensemble and Dropout models had trouble handling the boundaries, especially for Gamma corrected input. The extracranial area is no longer zero after Gamma correction, which might cause problems when applying zero-padding. Moreover, their overconfident prediction using softmax/sigmoid is shown for Gamma corrected input where the main error regions were not indicated in the uncertainty map. Non-region-based EDL methods also showed inaccurate uncertainty maps. The shortcoming of using MSE loss in EDL to quantify the uncertainty of medical image segmentation can be seen in Fig. 3, which was significantly biased by the interference.

6 Conclusions and future research directions

In this paper, we proposed a region-based EDL framework to segment brain tumors and quantify their uncertainty reliably and robustly. We demonstrated that the proposed region-based loss could generate reliable prediction confidence by gathering evidence in the output image by demonstrating four theoretical properties. Our method produced voxel-level uncertainty maps for brain tumor segmentation, which provided additional information on segmentation confidence for cancer diagnosis. Extensive experiments showed that the proposed method is more robust than previous methods on the BraTS 2020 dataset and achieves the best performance in segmentation uncertainty estimation. Furthermore, the novel framework maintained the low computational cost properties of EDL and can be easily integrated into any neural network.

Unfortunately, the performance of our method was currently slightly inferior to Ensemble and Dropout methods in terms of ECE and Dice when segmenting raw images. Calibration methods such as temperature scaling can be applied to improve the ECE, while EDL frameworks with higher segmentation accuracy are worthy of further study. Moreover, tuning and optimizing the parameters of EDL to achieve faster inference is a known problem, especially the suitability of the Dirichlet prior, that will be addressed in follow-up studies. In addition, since the predictive uncertainty can be separated into epistemic and aleatoric uncertainty, future work can also focus on the inherent value for automated diagnosis that uncertainty estimation brings when differentiating between the two sources of uncertainty. The fourth direction is validating this framework in other diagnostic applications, possibly favoring the fusion of more multimodal information sources. Then, we can assess whether the fusion of different information modes permits a decrease in the overall uncertainty of the model estimated by our EDL segmentation framework.