Epistemic and aleatoric uncertainties reduction with rotation variation for medical image segmentation with ConvNets

The deep convolutional neural network (ConvNet) achieves significant segmentation performance on medical images of various modalities. However, the isolated errors in a large testing set with various tumor conditions are not acceptable in clinical practice. This is usually caused in inadequate training and noise inherent during data collection, which are recognized as epistemic and aleatoric uncertainties in deep learning-based approaches. In this paper, we analyze the two types of uncertainties in medical image segmentation tasks and propose a reduction method by training models with data augmentation. The shelter zones in images are reduced with 2D imaging on surfaces of different angles from 3D organs. Rotation transformation and noise are estimated by Monte Carlo simulation with prior parameter distributions, and the aleatoric uncertainty is quantized in this process. Experiments on segmentation of computed tomography images demonstrate that overconfident incorrect predictions are reduced through uncertainty reduction and that our method outperforms prediction baselines based on epistemic and aleatoric estimation.


Introduction
As an essential task in many surgical applications, e.g., diagnosis, treatment and recovery [1], image segmentation is applied in recognizing the boundaries of tumors in organs.Due to large variations in the segmentation target among patients, it is still far from accurate and reliable when applying segmentation to images from organic bodies [2].The application is not usable in clinical practice even only isolated errors occur during segmentation due to the high precision requirement.Therefore, the reduction in uncertainties is critical in accurate and reliable segmentation.With quantization of epistemic and aleatoric uncertainties, we can assess the reliability of the results and guide human intervention when needed.
Segmentation methods based on convolutional neural networks (ConvNets) have investigated the uncertainty estimation issue in image classification and recognition [3][4][5].However, the output always consists of high-level predictions, e.g., image annotations and network bounding box parameters.These methods do not provide uncertainty estimation on the pixelwise level, which is crucial in image segmentation [6].Moreover, the training process requires a large quantity of labeled training data, which is difficult to obtain for medical images due to the specialization of annotation.Consequently, the training samples for deep learning-based approaches in medical image segmentation are comparatively small [7].The limited training samples bring much uncertainty and, in turn, result in unreliable segmentation.One typical mechanism to solve the issue is data augmentation technology [8], which involves flipping, cropping, rotating and scaling.However, there are not enough in-depth studies on the implementation of medical image segmentation.
Uncertainty estimation has been studied in deep neural networks for high-level image label prediction output, which can be categorized into two major types: epistemic and aleatoric.In the field of medical image segmentation, the former uncertainty is introduced in different sensors accuracy and label boundaries of organs or lesions, which can be overcome by using enough accurately labeled training data.The latter one depends on the amount of data lie outside of the training data, which can be reducible through an increase in measurement precision [9].Recent works [10,11] focus on test-time drop-out-based uncertainty during training, which can be recognized as epistemic uncertainty estimation.The aleatoric uncertainty is transferred to epistemic uncertainty through mapping input data with a unified Bayesian deep learning framework [12].However, the proposed method has not been utilized for segmentation with medical data, especially 3D images.
Motivated by the above observations, we propose a data augmentation method for medical image segmentation by utilizing images captured from different angles.An image captured from an organ is recognized as the result of an acquisition process that contains geometric transforms, where the overlapped areas are identified in this process.Images captured from other angles of the same area are input into the training process to compensate for the overlapped area, which provides additional data for predicting the segmentation distribution.The prediction is achieved through Monte Carlo (MC) sampling on the predictive output based on the same trained neural network model.The augmentation can recover the miscaptured information and achieve better characterization during training and, in turn, predict better segmentation results.
The contributions of this paper can be summarized as threefold.First, we quantize the epistemic and aleatoric uncertainties that are accumulated during training and segmentation.Second, we propose a data augmentation method with training images of different rotation variations that facilitates uncertainty reduction, especially in medical image segmentation, which contains a small number of annotated training samples.The segmentation performance is improved without additional labeled training data, which are expensive to acquire.Third, the confidence level information of segmentation results can be calculated for human intervention purposes.The potential missegmented area is provided to the user for better diagnosis results.It is possible to reduce the workload of medical personnel significantly and avoid unnecessary misdiagnoses by machine learning only.

Related work
A Bayesian network estimates the epistemic uncertainty by learning the posterior distribution of weighted parameters in image segmentation.Due to the difficulty in implementation and high computational cost, dropout at test time is cast as an approximation to represent the uncertainty [13].Stochastic variational gradient descent (SVGD) is used in ConvNet [14] to estimate parameter uncertainty.However, it requires adopting multiple stochastic forward passes and calculating the average to obtain the uncertainty of weighted parameters.Moreover, the SVGD implementations focus on image classification and annotation.Other approaches, such as Markov chain Monte Carlo (MCMC) [15], Monte Carlo batch normalization (MCBN) [16] and variational Bayesian methods [17,18], are also proposed to estimate the uncertainty.However, they have not been utilized in training networks for medical image segmentation.A scalable ensemble of multiple models for uncertainty estimation is proposed to learn the epistemic uncertainty [19], but the paper does not provide any solution to uncertainty measurement in medical image segmentation.
Neural networks have relatively high variance, which makes it difficult to replicate the results [25].Bootstrap [26] and deep ensemble [19] methods are used to train multiple neural network models to reduce the variance.A summary of related studies conducted on uncertainty estimation using deep learning technologies is presented in Table 1.Every single network has its characteristics, and the predicted error is applied to epistemic or aleatoric uncertainty estimation.The average operation reduces the sensitivities from the training mechanism, parameter initialization and dataset details, which achieve better performance than any single model.A unified Bayesian deep learning network is proposed to map the input data to the uncertainty originating from variations in images for segmentation and classification [12].However, their dataset used in experiments is natural images rather than medical images.As the first application to medical image analysis, superresolution diffusion MR brain images were utilized in uncertainty modeling with ConvNet in 2016 [27].They provided an in-depth discussion on the accuracy improvement from predictive variance and MC sample variance uncertainty reduction.Later, a lung nodule detection system was proposed by leveraging MC sample variance [28].They estimated the uncertainty in a two-stage network and achieved better segmentation results, but uncertainty measurements were not properly analyzed in this article.Recently, literature [29,30] further discussed MC dropout uncertainty as a potential technique being used in medical image analysis, and learned a dynamic threshold through uncertainty estimation for high-confidence pseudolabels [31].They achieved better performance on small object segmentation, but there was not any analysis of the necessity and impact of uncertainty estimation on medical image segmentation.

Materials and methods
The proposed method is described in detail in this section.In the analysis of epistemic and aleatoric uncertainties, we deduce the confidence of segmentation through a Bayesian neural network (BNN) [32] with Monte Carlo dropout.In the uncertainty reduction method, the training data are augmented to a larger size for better feature representation, such as tumors in human organs.Specifically, we utilize images at different rotation angles to enhance the annotated data in data augmentation, which is helpful in characterizing the overlapped areas and may result in better segmentation performance.The overall flowchart of our method is illustrated in Fig. 1.

Epistemic uncertainty
The neural network model can be recognized as a conditional distribution model, which consists of an input distribution, an output prediction and a weight parameter.The prior distribution is used as the network parameter in BNN training.The output is a distribution of functions rather than a single function in our implementation.The confidence level of the result can be determined by consulting the distribution.The traditional method uses variational inference to calculate the posterior distribution.Although accurate, the training efficiency and computational complexity are significant in BNN training.We use MC dropout to deduce the Bayesian distribution and modify the loss function to determine the uncertainty.
The MC dropout samples the training data for limited iterations and generates an estimation of the posterior distribution.It is an approximate estimation of the posterior distribution that provides information on whether the input data exist in the learned distribution.This piece of information is regarded as epistemic uncertainty.
According to the deduction process in [33], dropout is used both at training and testing time as a random generator.This means that we ignore some of the neurons during training and testing.To obtain the expected model output and the confidence information, we simulate a network output with input and treat the stochastic regularization technique (SRT) [34] as if it is being used during training.Assume the data are fed to the network T times and the average of the result is calculated as: (1)  where ( ) is the predictive mean of our approximate predictive distribution and the predictive variance (the expected uncertainty) can be calculated by performing variance operation on it: where denotes the input features from training images and the predictive mean [ ] denotes the expected model output given the input .This process is repeated T times for T independently identical distributions { 1 ( ), … , T ( )} .These output values are empirical samples from an approximate predictive distribution. 2 denotes the aleatoric uncertainty and will be discussed in the following section.

Aleatoric uncertainty
Unlike epistemic uncertainty reduction, aleatoric uncertainty captures noise inherent during the data collection.
For segmentation tasks on 3D images, voxel-wise uncertainty estimation is more practical.The cost function in training can be deduced according to [33]: where denotes the network parameters, f denotes the models and the corresponding output is {f ( i ), } .The ale- atoric uncertainty on data element i is denoted by ( i ) 2 , which is roughly the same as 2 in the previous section.
The deviation originally brought by training data can be learned in an unsupervised manner.ConvNet is regularized with MC dropout to avoid overfitting, which means the network tries to fit a trend instead of remember every single element during training.The deviation || i − f ( i )|| 2 can be recognized as the noise originally brought by training data, which is divided by ( i ) 2 to offset the noise.In other words, the cost function tries to offset the noise through dividing However, the network tends to predict all 2 to be infinitely large to minimize the cost function if we eliminate ( i ) 2 only.A regularizer 1 2 log( ( i ) 2 ) is added to the cost function to avoid this phenomenon.The network learns to output high values of ( i ) 2 when || i − f ( i ))|| 2 is large enough in this way.For every single training dataset , the model outputs a result f ( i ) and the aleatoric uncertainty ( i ) 2 , which is also used to calculate epistemic uncertainty.
The network does not output aleatoric uncertainty directly in our implementation since the cost function becomes illegal when there is no noise in a single element (2) ( ( i ) 2 = 0 ).We make the network output as log( 2) to avoid this phenomenon and the cost function becomes: and the two kinds of uncertainty is usually added to characterize the deviation between training and testing set and noise contained within the training set.
A threshold of uncertainty is set according to the expected reliability of segmentation, and results with uncertainty higher than the threshold are dropped to reduce the number of false segmentations.Although it is possible to drop some true positive results in very rare cases, it is still helpful in segmentation accuracy improvement.

Data augmentation
Motivated by 'mixup' technology [32], data augmentation is used to manually increase the size of the training dataset for medical image segmentation to deal with the issue of inadequate training samples.Specifically, in our implementation, a rotation variant is utilized to create training images that closely resemble one particular feature, e.g., tumors in human organs.
The uncertainty reduction method is explained in detail with training with computed tomography (CT) images.Typical training data for lung nodule analysis are slice-byslice images captured with a certain spacing.For example, the LUNA16 dataset contains 888 CT images selected from the LiDC/IDRI dataset.Most of them have very similar details since the equipment scans the human body for one slice in a step of 1.3 degrees rotated around the Z-axis.
We combine images from adjacent slices to create mixup samples.A mixup sample is a linear combination of two training features: (x i , y i ) and (x j , y j ) , which is given by: where the parameter lies between [0,1] and obeys the beta distribution ( , ) , where ∈ (0, ∞) .Instead of choosing x i,j at random, we select features from images captured from adjacent slices.According to the data provided in [32], ∈ [0.1, 0.4] leads to the best performance in data augmentation of medical image segmentation.A smaller value of leads to overfitting in training, and the reverse condition results in underfitting.
A typical neural network relies on training backpropagation to update the loss function parameters.Assume the effect of training from the i-th voxel's value is denoted by p i , and the problem transfers to investigating whether the (4) equality in Eq. ( 6) is established.If it is established, it may suggest that mixing samples increase the batch size and, in turn, increase the generalization ability of the learned network.
We try different combinations until we establish Eq. ( 6) and train a network that has the highest generalization ability.As a result, we have the best training sample characterizations given the same quantity of labeled training data, which may have the best segmentation performance.

Results and discussions
We carry out evaluations to validate uncertainty reduction improvement in terms of voxel-level and lesion-level accuracy by means of the true positive rate (TPR) and false detection rate (FDR).It is a widely accepted way in medical image segmentation since a patient's lesion count are indicative of disease and progression.Specifically, we evaluate our method with two state-of-the-art segmentation uncertainty estimation methods: probabilistic U-Net [21] and segmentation variability estimation [10].The proposed method is evaluated with a single prediction, which does not contain any uncertainty reduction mechanism, as the baseline.The segmentation accuracy with the proposed method is studied on publicly available benchmarks from potential pneumonia patients.The results are compared with officially provided annotations and categorized into true positive (TP), false positive (FP), true negative (TN) and false negative (FN) results.The results are counted to calculate the true positive rate (TPR) and false detection rate (FDR) indexes as follows: respectively.As a quantitative measurement, we also evaluate the results with the Dice accuracy index [35] as: which is essentially a per-voxel detection score that evaluates the degree of overlap between the segmentation mask and ground truth mask.The higher the Dice score is, the better the performance it indicates.A perfect segmentation result has 100% coverage and yields a Dice score of 1.
Since the Dice index cannot reflect the effect of uncertainty estimation directly, we adopt the metric "uncertainty gap" proposed in [31], which is a measurement of (6) the variance difference between right and wrong predictions.Generally, the expected result would be low uncertainty values in the right segmented voxels/lesions and high uncertainty in the wrong ones.Therefore, the uncertainty gap is supposed to be positively correlated, which means that the higher the better.

Data and implementation
We test our method and competitors in two benchmarks: the Lung Nodule Analysis 16 (LUNA16) challenge [36] and the Liver Tumor Segmentation challenge (LiTS) [37].In the training of segmentation network, we adopt U-Net [27] in probabilistic U-Net and our method and FCN [38] in segmentation variability estimation.Adaptive moment estimation (Adam) [39] is used to adjust the learning rate, which is initialized as 10 −3 with a batch size of 5.In one training epoch with PyTorch, we choose = 0.4 in mixup to increase the generalization ability of the trained models.

Voxel-level analysis
The segmentation on 3D medical images results in a group of predicted voxels with foreground or background attribute.This voxel-level metric provides an understanding of the neural network on a voxel-by-voxel performance for each scan.Different from pixels in 2D images, voxels in 3D images contain thickness information.For example, the CT images in LUNA16 have a resolution of 512 × 512 and a voxel depth of 3 mm.The sigmoid output of the network is binarized for segmentation with a threshold, which is varied to explore the range of operation points that the network reaches.Then the voxel-wise counts of TP, FP and FN are accumulated for a given image and calculated for TPR/ FDR data.To see if the uncertainty estimations are useful in the medical image segmentation, these metrics are plotted in a receiver operating characteristic (ROC) curves for the retained voxels at different uncertainty threshold.Voxels with certain level of uncertainty are filtered to calculate TPR/FDR data and a new ROC curve is plotted with the metrics.We try different level of uncertainty in filtering, which is a common approach reported in related academic literature [10], and each threshold is correspond to the percentage of voxels involved in metrics calculation.
The ROC curves for different thresholds of uncertainties at the voxel-level evaluation are illustrated in Fig. 2. The curves suggest that the TPR gradually increases with more and more segmented voxels with low certainty is being dropped, which means uncertainties in predicted voxels are reduced with the augmentation of the training data.The TPR increases gradually when the FDR gradually levels off to 0.5, and an increase in the TP result causes more FP detection.Specifically, the TPR increase is more obvious at the beginning of the test, but the slope of the curve decreases as more FP results are accumulated.Comparing with the baseline curve, a notably high TPR of the ROC curve is attributed to the significant low voxel-level retention.It demonstrates that the network makes incorrect predictions on uncertain voxels, and it is a strong indicator that the uncertainty measures are useful.Since the threshold is selected based on the ratio of retained voxels, a higher threshold is expected to result in lower TPR value.But it is not always true in our implementation because increase of FDR is slower when a more "strict" rule is being used with more and more voxels are being detected.The reason for this phenomenon is that the decrease on FP detection may be quicker than TP with more and more detection are dropped, which is another indicator that uncertainty estimation is useful in image segmentation.However, we are not intended to set the threshold in a very low level.It needs to be set according to the characteristics of the training and testing samples as well as the expected segmentation reliability, which finds a balance between accuracy and segmentation coverage.

Lesion-level analysis
As one of the most important applications of medical image segmentation, clinical diagnosis focuses on lesion detection.We are particularly interested in the segmented voxels associated with lesions.The network performance of detection on lesions can be assessed with a lesion-level TPR/FDR.According to the de facto standard in clinical practice, ground lesions smaller than 3 voxels are ignored in the annotated ground truth to obtain the lesion-level results from segmentation.As a widely accepted standard used in [10], we recognize a lesion as TP detection when the segmented voxel and its 18-connected neighborhood overlaps at least 50% with the ground truth lesion voxels.A lesion is recognized as FP when 3 or more voxels do not overlap with ground truth and insufficient overlap results in FN result.The calculation of TPR/FDR remains the same as the voxel-level analysis.Currently, TPs, FPs and FNs are associated with large and small lesion detection in an equal weight manner.A range of operating points are filtered to obtain lesion-level ROC curves.Same as the  voxel-level analysis, the uncertainty level is calculated for each candidate lesion.The ROC curve reflects a higher TPR and lower FDR when the most uncertain lesions are mainly consist of the FP results.
The ROC curves for different uncertainty thresholds in lesion-level segmentation are illustrated in Fig. 3.The curves suggest that the filtered voxels with low certainty contribute to reduction in FP and FN detection, which in turn improvements in lesion segmentation at all operating points, even if the minimum proportion of uncertainties is reduced.The neural network does not perform well on small lesions, while the uncertainty reduction filters out the most uncertain voxels in segmentation, which are the lesion boundaries.This leads to a small overlapping ratio when compared with the annotated ground truth in the benchmark.The curve in LUNA16 has slightly higher TPR than LiTS since the LiTS more small lesions, so removal of falsely segmented lesions improves the overall performance.We can see that the percentage of lesions retained in LiTS is lower than that of LUNA16.The two datasets have fewer reserved detection comparing with voxel-level analysis in the whole because a more strict rule is used.The curves after uncertainty reduction have similar trends when compared with the ROC curve without uncertainty reduction.This phenomenon shows that the lesion-level ROC curve represents the segmentation performance better than the voxel-level curve.The isolated filtered segmented voxels also reduce the number of TP results, which counter reflects the improvement of the TPR value.According to a manual check, the lesionlevel TPR improves by 0.5% when only 2% of the most uncertain results are filtered.ROC curves have higher TPR and lower FDR across the uncertainty reduction areas, which indicate more precise prediction and segmentation.

Quantitative result
As a quantitative measurement in validating medical volume segmentation, the Dice score is accumulated in all three methods with and without data augmentation.To study the effect of MC simulation iterations on segmentation accuracy, we try different iterations of MC dropout N. We found that the accuracy improves gradually with an increase in N during training and reaches the highest level when N = 20.
The quantitative evaluations of segmentation uncertainty from different estimation methods are summarized in Table 2. To make the annotation clear, the competitors are replaced by the corresponding network structures.It can be observed that the data augmentation reduces epistemic uncertainty in models trained with all three methods, which levels up the Dice score by 1-2%   on average.Since the LUNA16 benchmark ignores lung nodules smaller than 3 mm, the overall performance on the LUNA16 benchmark is better than that on LiTS.One possible explanation is that the trained models are more uncertain on lesion contours, and the small lesion part contains more boundary information, which causes more voxel removal.Consequently, the segmented results have less overlap with the part manually labeled by a radiologist and lower degree of overlap between the segmentation mask and ground truth mask.The uncertainty gap between correct and incorrect predictions is summarized in Table 3.We can see that our method gains a higher uncertainty gap in two tests with LUNA16 and LiTS compared with segmentation variability estimation and This suggests that the prediction on uncertain areas is closer to the ground truth in our method.The prediction uncertainty values in the LUNA16 challenge are lower than those of LiTS with P-UNet and our method, which means that segmentation with ConvNet has fewer uncertainties than LiTS.One possible reason is that the removal of extremely small lesions makes segmentation easier to implement.The only exception is the uncertainty gap with FCN, where the correct and incorrect predictions are slightly lower in LiTS than in LUNA16.However, the trend in the uncertainty gap remains the same as that of the other two methods.This suggests that the influence of dataset diversity is higher than that of network models.

Qualitative result
The comparison of segmentation on LiTS challenge are collected in this section to illustrate the result in a more intuitive way. Figure 4 compares several examples of uncertainties reduced segmentation results from several representative volumes with the corresponding ground truth.
The figure suggests that the segmentation areas are obviously smaller than the ground truth, which means the uncertain voxels are those on which the network makes incorrect predictions on.In general, the lesion contours contain more uncertain voxels where false positive detections are frequently made.This phenomenon does not go beyond our expectations since delineating lesion boundaries is a challenging task, and even radiologist experts may disagree on the precise boundaries of lesions and organs.Images in one row come from the same slice of the same volume, we can see that the segmented areas are smaller than ground truth and the green areas in the ground truth cover more space than the red mask in the middle column.The narrow boundary of human liver covers fewer region in the light red mask than deep red mask.The reason for this phenomenon is that the segment voxels with lower confidence are removed in uncertainty reduction.This removal is crucial in obtaining accurate lesion segmentation.Benefiting from our data augmentation method, the segmented organs match better with the ground truth, especially in small axial image cross sections.

Conclusion
Focusing on epistemic and aleatoric uncertainties in training and testing ConvNet models for medical image segmentation, we proposed a data augmentation method with rotation variation from images captured on different slices of organs.The two types of uncertainties are quantized and a proper mechanism is adopted to enrich the training samples, which facilitates learning more texture information with the same quantity of labeled training data.Experiments demonstrate that the proposed method is helpful in reducing the uncertainties produced during data collection and training.It provides the fundamental basis for diagnosis applications with medical images segmentation.
In the future, we will integrate the uncertainty measures obtained from MC dropout samples into a loss function and explore other epistemic uncertainty measurements in medical image segmentation.We will also investigate the task-specific advantages of different measures, which will provide significant advancement in image segmentation on one type of image modality.

Fig. 1
Fig. 1 Flowchart of epistemic and aleatoric uncertainties reduction method with rotation variation for medical image segmentation with ConvNets

Fig. 2
Fig.2ROC curves of voxel-level performance on the retained predictions for different uncertainty thresholds.Percentage in the brackets shows ratio of voxels retained at that particular thresh-

Fig. 3
Fig.3ROC curves of lesion-level performance on the retained predictions for different uncertainty thresholds.Percentage in the brackets shows ratio of lesions retained at that particular thresh-

Table 1
A summary of uncertainty estimation methods in deep learning pipelines for image classification, detection and segmentation Originating from the LiDC-IDRI dataset, LUNA16 contains CT images from lung slices smaller than 3 mm.They can be recognized as pulmonary nodules of different rotations, which is ideal for data augmentation in the training of neural networks.Since LUNA16 has 10 equally distributed samples, we select one of them as the test set and the remaining 9 sets as training samples for cross validation.The final result is a combination of 10 results from cross validations.The LiTS challenge contains 130 sets of training samples and 70 sets for testing.It is subtracted from the 3D Image Reconstruction for Comparison of Algorithm Database (3Dircadb), which includes patient 3D medical images and the manual segmentation of the various structures of interest performed by clinical experts.It is famous for the small number of training samples and large variations in data quality, appearance and spacing.

Table 2
Dice scores (%) accumulated with three methods in different network frameworks with and without data augmentation