Keywords

1 Introduction

Optical Coherence Tomography (OCT) is a popular non-invasive imaging modality for retinal imaging. OCT provides volumetric scans of retinal layers for the diagnosis and evaluation of different diseases such as Glaucoma and Age regated macular degeneration (AMD). For example, [1] have shown the correlation between outer retinal layer thickness and visual acuity in early AMD patients. It has also been shown that retinal layer features can be used to predict vision loss and progression [6].

The segmentation of retinal layers in OCT has been tackled in a number of ways, such as dynamic programming [13], graph-based shortest path algorithms [4], graph-based minimum s-t cut formulations [8] and level sets [3, 14]. Machine-learning based approaches have also been proposed, where the retinal layer and boundary probability maps are detected using a trained classifier. The final segmentation is then obtained by imposing a model such as active contours [20] or minimum s-t cut framework [12] on the soft labels.

In the past few years, Convolutional Neural Networks (CNNs) based methods such as Unet [15, 17] and fully convolutional Densenet (FC-DN) [10] have achieved remarkable performance gain in medical image and natural image segmentation. The networks are trained end-to-end, pixels-to-pixels on semantic segmentation exceeded the most state-of-the-art methods without further machinery. [2, 16] used Unet like network to perform pixelwise semantic segmentation of retinal layers. In another approach, [5], used CNN and graph search method for layer boundary classification. Once trained, these methods acts as a black box where one has to assume that the segmentation output is accurate which is not always the case. For example, the model will produce incorrect segmentation when the test image is different from the distribution of images used to train the model. This may happen when the model is trained using limited number of training images. In other scenario, the model will produce inaccurate segmentation when trained using normal images, yet pathologies are observed in the test image or the test image is noisy. Quantification of uncertainties associated with the segmentation output is therefore important to determine the region of incorrect segmentation, e.g., region associated with higher uncertainty can either be excluded from subsequent analysis or highlighted for manual attention. In another scenario, when the retinal layer segmentation map is used to diagnose the diseases such as AMD and Glaucoma, the uncertainty map can be used to determine the confidence of final automatic or clinical diagnosis.

Previous works have explored the uncertainty quantification in biomedical segmentation [9], however, these approaches do not utilize the representative power of deep learning. Recent research has shown that Bayesian probability theory offers a mathematically grounded technique to reason about uncertainty in deep learning models [7, 11]. In this paper, we explore Bayesian fully convolutional neural network for segmentation and uncertainty quantification of retinal layers in OCT images. We experimentally demonstrate that in addition to the uncertainty based confidence measure, our method provides improved layer segmentation accuracy and robustness towards noise in the test images.

2 Methodology

We model two types of uncertainties for retinal layer segmentation; epistemic uncertainty and aleatoric uncertainty. The epistemic uncertainty captures the uncertainty related to the model parameters, e.g., when the model does not take into account certain aspect of the training data. Therefore, the epistemic uncertainty can be reduced by training the model using more images. Aleoretic uncertainty, on the other hand, captures the noise inherent in the images, therefore, it cannot be reduced with additional training images. We model the aleatoric uncertainty as an additional output variance for both deep learning networks.

We enhance fully convolutional Densenet (FC-DN) [10] for segmentation and uncertainty quantification of retinal layers. FC-DN is a fully convolutional neural network with several dense-blocks connected in encoder-decoder architecture with skip connections across them which effectively combines coarse semantic features with fine image details for pixel-wise semantic segmentation. Each layer in the dense block is connected to all the preceding layers by iterative concatenation of previous feature maps. This allows all layers to access feature maps from their preceding layers which encourages heavy feature reuse. As a result, FC-DN uses less parameter and is less prone to over-fitting. The networks is then trained using the proposed class weighted Bayesian loss function by taking into account the output variance which is described in Sect. 2.1. Once the networks are trained, in the test phase, we use dropout variational inference technique [7] to compute the epistemic uncertainty which we describe in Sect. 2.2.

Let \(F_{\mathbf {W}}(X)\) be a FC-DN model parameterized by \(\mathbf {W}\) which takes input image X and produces the logit vector \(\mathbf {z}\) for each pixel as \(\mathbf {z}=F_{\mathbf {W}}(X)\). The logit vector \(\mathbf {z}\) consists of logits for each class as \(\mathbf {z}=\left( z_{1,\cdots }z_{C}\right) \) where C is the number of classes i.e., number of retinal layers for segmentation. The final probability vector for a pixel \(\mathbf {y}=\left( y_{1,\cdots }y_{C}\right) \) can be computed by applying the softmax function over the logits as \(\mathbf {y}=\text {Softmax} (\mathbf {z})\). The softmax function gives the relative probabilities between classes, but fails to measure the model’s uncertainty.

2.1 Bayesian Fully Convolution Network

Here we present a method to convert FC-DN to output the pixelwise uncertainty map in addition to the pixel-wise segmentation map. We name the proposed method Bayesian FC-DN (BFC-DN). In BFC-DN, we apply \(1 \text {x}1\) convolution to the feature maps of last layers followed by softplus activation to output the variance \(\mathbf {v}\) for each pixel in addition to the logit vector \(\mathbf {z}\) i.e., \((\mathbf {z},\mathbf {v})=F_{\mathbf {W}}(X)\). This variance gives aleatoric uncertainty of the model which the network learns to predict during the training. In addition, we include the dropout layer before every convolution layer which allows us to compute epistemic uncertainty which will be described in Sect. 2.2.

The output of the model is the Gaussian distribution \(\mathcal {{N}}\left( \mathbf {z},\mathbf {v}\right) \). Computing the categorical cross entropy loss over this distribution is not feasible. Therefore, we approximated it using the monte-carlo integration. Given a set of training images and corresponding ground truth segmentation mask, \(D=\left\{ X_{n},Y_{n}\right\} _{n=1}^{N}\), output logit for each sample in the mini-batch is perturbed T times with a Gaussian noise \(\epsilon _{t}\sim \mathcal {{N}}\left( 0,\mathbf {v}\right) \) as \(\hat{ \mathbf {z_{t}}}=\mathbf z +\epsilon _{t}\) and the final pixel-wise bayesian loss is computed as:

$$\begin{aligned} L(W)=-\frac{1}{T}\sum _{t=1}^{T}\sum _{c=1}^{C}\beta _{c}\sum _{ \forall Y_{c}}\log y_{c}^{t} \end{aligned}$$
(1)

where \(y_{c}^{t}\) is obtained by applying softmax to the logit vector \(\hat{ \mathbf {z_{t}}}\); \(Y_{c}\) denotes the pixels region of the \(c^{th}\) class in the ground truth Y and the scale factor \(\beta _{c}=1/|Y_{c}|\) weights the contribution of each class to mitigate the class imbalances of different OCT layers and the background by increasing the weight of under represented classes while decreasing the effect of over represented classes. The proposed Bayesian loss function encourage the network to minimize the larger losses by increasing the variance, therefore is more robust towards noise.

We train the proposed BFC-DN using the bayesian loss given by Eq. 1 for 40000 iterations. We have used mini-batch gradient descent and the Adam optimizer with momentum and a batch size of 2. The learning rate is set to \(10^{-5}\) which is decreased by one tenth after 10000 iterations of the training. Data augmentation is an important step in training deep networks. We augment the training images and corresponding label map masks through a mirror-image reflection and random rotation within the range of \([-15, 15]\) degrees.

2.2 Segmentation and Uncertainty Quantification

Epistemic uncertainty is generally computed by assuming distribution over the network weights which allows the computation of distribution of class probabilities rather than point estimate [18]. Such methods require optimization over weights distribution and therefore is computationally expensive [7]. We adopt more practical approach introduced by [7] which is based on the dropout variational inference. We train the BFC-DN with a dropout layer before every convolution layer and use the dropout in test phase as well. Specifically, segmentation samples from the output predictive distribution are obtained by performing T stochastic forward passes through the network, i.e., \((\mathbf {z}^{t},\mathbf {v}^{t})=F_{\mathbf {\hat{W}}_{t}}(X), t=1,\cdots , T\) where \(\mathbf {\hat{W}}_{t}\) is an effective network weight after the dropout. In each forward pass, the fraction of network weights (denoted by dropout rate) are disabled and the segmentation score is computed using only the remaining weights. The segmentation score vector \(\mathbf {\bar{y}}\) and the aleatoric variance \(\mathbf {\bar{v}}\) is obtained by averaging the T samples, via monte carlo integration:

$$\begin{aligned} \mathbf {\bar{y}}= & {} \frac{1}{T}\sum _{t=1}^{T}\text {Softmax}(\mathbf {z}^{t}) \end{aligned}$$
(2)
$$\begin{aligned} \mathbf {\bar{v}}= & {} \frac{1}{T}\sum _{t=1}^{T}\mathbf {v}^{t} \end{aligned}$$
(3)

The average score vector contains the probability score for each retinal layers class, i.e. \(\mathbf {\bar{y}} = [\bar{y}_1,\cdots ,\bar{y}_C ] \). The overall segmentation uncertainty for each pixel can then be obtained as:

$$\begin{aligned} U(\mathbf {\bar{y}})=-\sum _{c=1}^{C}\bar{y}_{c}\text {log }\bar{y}_{c} +\mathbf {\bar{v}} \end{aligned}$$
(4)

where the first term denotes epistemic uncertainty of the score computed as the entropy of the average score vector obtained by averaging T stochastic predictions (Eq. 2) and the second term is the uncertainty output produced by the network itself (Eq. 3). We set the dropout rate = 0.4 and \(T=50\) to allow sufficient sampling of network weights for final prediction.

For uncertain predictions, network assigns higher probabilities to different classes for different forward passes, resulting in higher epistemic uncertainty given by Eq. 4. For the certain predictions, network assigns higher probability to the true class for different forward passes, resulting in lower epistemic uncertainty. Since epistemic uncertainty is related to the model parameters weights, it can be reduced by observing more data. This is because, the network becomes robust towards weight dropout in test phase as it observes more data.

3 Experiments

The dataset [19] consists of 1487 images from 15 spectral-domain optical coherence tomography (OCT) volumes from unique normal subjects acquired on a Spectralis scanner. The size of each volume is \(512 \times 496 \times N_{slices} \) where \(N_{slices}\) is different for each volume and ranges from 49–100. All scans have axial resolution of \(3.87\,\upmu \)m. The ground truth has been obtained by manual annotation of the nine boundaries from eight retinal layers [12]. To facilitate the pixel-wise semantic segmentation, we convert the layer boundaries to the probability map for the eight layers regions and the background region. Therefore, the number of classes is \(C=9\).

Out of 1487 images, we select 1116 images from 12 volumes to create a training set and remaining 291 images from 3 volumes for validation. We compare our method with the baseline FC-DN [10] which do not take into account uncertainty, i.e., the networks do not output aleatoric variance and segmentation is performed in a single forward pass by disabling the dropout is the test phase. To train these networks, we use non-bayesian class weighted cross entropy loss function which can be derived by setting \(T=1\) and \(v=0\) in Eq. 1.

Table 1. Performance of our proposed retinal layer segmentation method compared with the state-of-the-art Jégou et al. [10] and Lang et al. [12] segmentation methods.

Table 1 compares the average Dice coefficient (DC) between the ground truth and predicted segmentation of the 8 layers using the proposed Bayesian method (BFC-DN) and non-Bayesian method (FC-DN [10]). The proposed method BFC-DN resulted in highest DC of 0.97 for GCL+IPL layer and lowest DC of 0.91 for OPL and IS layer. Moreover, BFC-DN resulted in improved segmentation for most of the layers in comparison to FC-DN. Table 1 also compares the average absolute error for 9 boundaries of our method with [12]. We observe that BFC-DN resulted in lower error than [12] which indicates proposed uncertainty based method is effective in segmenting retinal layers.

Figure 1 shows the examples of segmentation and uncertainty maps produced by our proposed method on few images from the validation set. It can be seen that our method produces pixel-wise uncertainty associated with the segmentation output where high uncertainty correlates with the inaccurate segmentation in the corresponding region. In order to validate the robustness of our proposed method against noise, we evaluate the performance by adding random block noise to the test images as shown in the last row image of Fig. 1. We observe that BFC-DN performs much better than FC-DN in presence of large noise levels as shown in Fig. 2. This demonstrates that BFC-DN is more robust towards the noisy images than FC-DN.

Fig. 1.
figure 1

Examples of retinal layer segmentation and uncertainty quantification using proposed BFCN-Densenet. (a) test images, (b) ground truth, (c) predicted segmentation map from FBC-Densenet (d) uncertainty map (warmer color denotes regions with higher uncertainty). The last row shows an example of layer segmentation in test images with added random block noise.

Fig. 2.
figure 2

Comparison of average segmentation performance of proposed BFC-DN with FC-DN [10] for different noise levels. The number of random block noise components at a given noise level is double than that at previous level.

The average execution time for the retinal layer segmentation for BFC-DN is 2.5 s per image Tesla-K40 GPU which is somewhat slower than that of FC-DN which took 300 ms. This is because our model requires T forward passes in the test phase in contrast to FC-DN which requires one forward pass.

4 Conclusion

In this paper, we proposed a Bayesian deep learning based method for retinal layer segmentation in OCT images. Our method produces layer segmentation and corresponding uncertainty maps depicting the pixel-wise confidence measure of the segmentation output. Experimental results demonstrate that our method compares favorably with non-bayesian DL methods, particularly in the presence of noise and outperforms sate of the art boundary based segmentation method. We have shown qualitatively that the resulting uncertainty maps correlates with the inaccuracies in segmentation output. The proposed method is applicable in determining the confidence of image analysis modules that utilizes the segmentation output for downstream analysis. Such uncertainty visualization can also be useful in computer-assisted diagnostic systems where clinician have additional insight about various measurements generated by the system to make necessarily adjustments and make more informed decisions Also, the resulting uncertainty map can be integrated within active learning systems to correct the segmentation output.