Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Retinal vessels are of much diagnostic significance, as they are commonly examined to evaluate and monitor various ophthalmological diseases. However, manual segmentation of retinal vessels is both tedious and time-consuming. To assist with this task, many approaches have been introduced in the last two decades to segment retinal vessels automatically. For example, Marin et al. employed the gray-level vector and moment invariant features to classify each pixel using a neural network [8]. Nguyen et al. utilized a multi-scale line detection scheme to compute vessel segmentation [11]. Orlando et al. performed vessel segmentation using a fully-connected Conditional Random Field (CRF) whose configuration is learned using a structured-output support vector machine [12]. Existing methods such as these, however, lack sufficiently discriminative representations and are easily affected by pathological regions, as shown in Fig. 1.

Fig. 1.
figure 1

Retinal vessel segmentation results. Existing vessel segmentation methods (e.g., Nguyen et al. [11], and Orlando et al. [12]) are affected by the optic disc and pathological regions (highlighted by red arrows), while our DeepVessel deals well with these regions.

Deep learning (DL) have recently been demonstrated to yield highly discriminative representations that have aided in many computer vision tasks. For example, Convolutional Neural Networks (CNNs) have brought heightened performance in image classification and semantic image segmentation. Xie et al. employed a holistically-nested edge detection (HED) system with deep supervision to resolve the challenging ambiguity in object boundary detection [16]. Zheng et al. reformulated the Conditional Random Field (CRF) as a Recurrent Neural Network (RNN) to improve semantic image segmentation [18]. These works inspire us to learn rich hierarchical representation based on a DL architecture.

A DL-based vessel segmentation method is proposed in [9], which addressed the problem as pixel classification using a deep neural network. In [7], Li et al. employed cross-modality data transformation from retinal image to vessel map, and outputted the label map of all pixels for a given image patch. These methods has two drawbacks: first, it does not account for non-local correlations in classifying individual pixels/patches, which leads to failures caused by noise and local pathological regions; second, the classification strategy is computationally intensive for both the training and testing phases. In our paper, we address retinal vessel segmentation as a boundary detection task that is solved using a novel DL system called DeepVessel, which utilizes a CNN with a side-output layer to learn discriminative representations, and also a CRF layer that accounts for non-local pixel correlations. With this approach, our DeepVessel system achieves state-of-the-art performance on publicly-available datasets (DRIVE, STARE, and CHASE_DB1) with relatively efficient processing.

2 Proposed Method

Our DeepVessel architecture consists of three main layers. The first is a convolutional layer used to learn a multi-scale discriminative representation. The second is a side-output layer that operates with the early layers to generate a companion local output. The last one is a CRF layer, which is employed to further take into account the non-local pixel correlations. The overall architecture of our DeepVessel system is illustrated in Fig. 2.

Fig. 2.
figure 2

Architecture of our DeepVessel system, which consists of convolutional, side-output, and CRF layers. The front network is a four-stage HED-like architecture [16], where the side-output layer is inserted after the last convolutional layers in each stage (marked in Bold). The convolutional layer parameters are denoted as “Conv<receptive field size>-<number of channels>”. The CRF layer is represented as an RNN as done in [18]. The ReLU activation function is not shown for brevity. The red blocks exist only in the training phase.

Convolutional Layer is used to learn local feature representations based on patches randomly sampled from the image. Suppose \(\mathbf L _j^{(n)}\) is the j-th output map of the n-th layer, and \(\mathbf L _i^{(n-1)}\) is the i-th input map of the n-th layer. The output of the convolutional layer is then defined as:

$$\begin{aligned} \mathbf L _j^{(n)} = f (\sum _i \mathbf L _i^{(n-1)} *\mathbf W _{ij}^{(n)} + b_j^{(n)}{} \mathbf 1 ), \end{aligned}$$
(1)

where \( \mathbf W _{ij}^{(n)}\) is the kernel linking the i-th input map to the j-th output map, \(*\) denotes the convolution operator, and \(b_j^{(n)}\) is the bias element.

Side-output Layer acts as a classifier that produces a companion local output for early layers [6]. Suppose \(\mathbf {W}\) denotes the parameters of all the convolutional layers, and there are M side-output layers in the network, where the corresponding weights are denoted as \(\mathbf {w}=(\mathbf {w}^{(1)},...,\mathbf {w}^{(M)})\). The objective function of the side-output layer is given as:

$$\begin{aligned} \mathcal {L}_{s}(\mathbf {W}, \mathbf {w}) = \sum ^M_{m=1} \alpha _m L^{(m)}_s(\mathbf {W}, \mathbf {w}^{(m)}), \end{aligned}$$
(2)

where \(\alpha _m\) is the loss function fusion-weight or each side-output layer, and \(L^{(m)}_s\) denotes the image-level loss function, which is computed over all pixels in the training retinal image X and its vessel ground truth Y. For the retinal image, the pixels of the vessel and background are imbalanced, thus we follow HED [16] to utilize a class-balanced cross-entropy loss function:

$$\begin{aligned} L^{(m)}_s(\mathbf {W}, \mathbf {w}^{(m)}) = -\frac{|Y^-|}{|Y|} \sum _{j\in Y^+} \log \sigma ( a_j^{(m)}) -\frac{|Y^+|}{|Y|} \sum _{j\in Y^-} \log ( 1 - \sigma (a_j^{(m)})), \end{aligned}$$
(3)

where \(|Y^+|\) and \(|Y^-|\) denote the vessel and background pixels in the ground truth Y, and \(\sigma (a_j^{(m)})\) is the sigmoid function on pixel j of the activation map \(A_s^{(m)} \equiv {a_j^{(m)}, j=1,...,|Y|}\) in side-output layer m. Simultaneously, we can obtain the vessel prediction map of each side-output layer m by \(\hat{Y}_s^{(m)} = \sigma (A_s^{(m)} )\).

Conditional Random Field (CRF) Layer is used to model non-local pixel correlations. Although the CNN can produce a satisfactory vessel probability map, it still has some problems. First, a traditional CNN has convolutional filters with large receptive fields and hence produces maps too coarse for pixel-level vessel segmentation (e.g., non-sharp boundaries and blob-like shapes). Second, a CNN lacks smoothness constraints, which may result in small spurious regions in the segmentation output. Thus, we utilize a CRF layer to obtain the final vessel segmentation result. Following the fully-connected CRF model of [5], each node is a neighbor of each other, and it takes into account long-range interactions in the whole image. We denote \(\mathbf v = \{v_i\}\) as a labeling over all pixels of the image, with \(v_i =1\) for vessel and \(v_i=0\) for background. The energy of a label assignment \(\mathbf v \) is given by:

$$\begin{aligned} E(\mathbf v ) = \sum _i \psi _u (v_i) + \sum _{i<j} \psi _p (v_i, v_j), \end{aligned}$$
(4)

with:

$$\begin{aligned} \psi _u (v_i) = \frac{1}{M}\sum _{m=1}^M a_i^{(m)}, \;\; \mathbf and , \;\; \psi _p (v_i, v_j) = \mu (v_i, v_j) \sum _{d=1}^D h^{(d)} k^{(d)} (\mathbf f _i, \mathbf f _j), \end{aligned}$$
(5)

where \(\psi _u (v_i)\) and \(\psi _p (v_i, v_j)\) are the unary and pairwise terms, respectively. \(a_j^{(m)}\) is the value at pixel i in the activation map \(A_s^{(m)}\) of side-output layer m, and \(k^{(d)}\) for \(d=1,...,D\) is the Gaussian kernel applied on feature vectors. The feature vector of pixel i, denoted by \(\mathbf f _i\), is derived from image features such as spatial location and RGB values. An effective solution to minimize the CRF energy \(E(\mathbf v )\) in Eq. (4) is through mean-field approximation [5]. In our system, we employ the implementation of [18], in which the CRF is reformulated as a Recurrent Neural Network (RNN) layer and can be utilized in an end-to-end DL architecture.

Fig. 3.
figure 3

The vessel prediction map for each side-output layer in our architecture.

Our DeepVessel Architecture is an end-to-end system illustrated in Fig. 2, which contains four CNN stages and one CRF stage. Each CNN stage includes multiple convolutional and ReLU layers, and one side-output layer. The side-output layer is connected to the last convolutional layer in each stage to support deep layer supervision. The objective function of the whole system is:

$$\begin{aligned} (\mathbf {W}, \mathbf {w}, \mathbf {h}) =\arg \min \left( \mathcal {L}_s(\mathbf {W}, \mathbf {w}) + L^{CRF}_s (\mathbf {W}, \mathbf {w}, \mathbf {h}) \right) , \end{aligned}$$
(6)

where \(\mathbf {h}\) is a CRF layer parameter, \(\mathcal {L}_s\) is the CNN layer loss function in Eq. (2), and \(L^{CRF}_s\) is the CRF layer loss function, specifically the class-balanced cross-entropy loss function in Eq. (3). We minimize the objective function via standard stochastic gradient descent. In our DeepVessel architecture, we only employ four CNN stages with side-output layers. The main reason is that retinal vessels in fundus images are different from general object edges in natural images. An object edge separates two regions of different appearance, which allows the boundary to be detectable even at deeper layers. By contrast, a retinal vessel appears merely as a curved line, which is too thin to respond in the higher stride layers. Thus, we only employ four side-output layers. The vessel prediction map example for each side-output layer is shown in Fig. 3, where earlier side-output layers have a smaller receptive field size and respond to local details, while deeper layers represent appearance at a larger scale.

3 Experiments

We implement our framework using the Caffe library and build on top of the implementation of HED [16]. The model parameters follow the configuration used in [16]. We employed a two-step fine-tuning approach that first utilizes the ARIA dataset [2] to fine-tune the initial parameters, and then the DRIVE training set [15] to obtain the final fine-tuning parameters. We rotate all training images to eight different angles, and rescale the ARIA images to the same size as the DRIVE images. The whole fine-tuning phase takes about two days on a single NVIDIA K40 GPU (10, 000 iterations). For a 565 \(\times \) 584 image, it takes about 1.3 s to generate the final vessel map.

3.1 Experimental Results

We evaluate our methodFootnote 1 on three publicly datasets: DRIVE [15], STARE [4], and CHASE_DB1 [3]. These datasets provide two manual segmentations generated by two different experts for each image. The first observer is selected as ground truth and used for performance evaluation in the literature. We performed the evaluation in terms of Accuracy (\(Acc =\frac{TP+TN}{TP+FN+TN+FP}\)), and Sensitivity (\(Sen = \frac{TP}{TP+FN}\)), where TP, TN, FP and FN represent the number of true positives, true negatives, false positives and false negatives, respectively. Note that there is no training set in the STARE and CHASE_DB1 datasets, thus we only utilize the DRIVE training set to fine-tune the final parameters.

Table 1. Performance of different segmentation methods on three datasets.

We compare our method with several state-of-the-art vessel segmentation methods, and also report the ground truth labeling of the second observer as the performance of a human observer. Our DeepVessel system outputs a probability map, and Otsu’s thresholding method [13] is employed to obtain the binary labeling result automatically in the experiments. Table 1 lists the performances on the three datasets, where the reported performance scores from the original papers are used. Our method obtains the best Accuracy scores among the methods, which include the other DL method [9] on the DRIVE dataset. And our method obtains Accuracy performance similar to the human observer on the CHASE_DB1 dataset and a better Accuracy score on the other two datasets.

We provide the results produced by the individual and average fusion results of the side-output layers in Table 1. We also report our results without side-output layers (DeepVessel w/o S). We observe that the second and third side-output layers obtain better performance than the other two layers, which is also observed in Fig. 3. The side-output fusion combines all the side-output layer outputs and generally performs better than any of the individual layers and the version without side-output layers. Figure 4 displays some results. It can be observed that our DeepVessel with CRF produces a clearer vessel segmentation result than the fusion result from only the side-output layers, especially for pathological regions as shown in the second row of Fig. 4.

Fig. 4.
figure 4

Examples of results from the dataset. From top to bottom are the fundus images from the DRIVE, STARE, and CHASE_DB1 datasets. From left to right: (A) Fundus images, (B) Ground truth, (C) Fusion results of side-output layers, (D) Our DeepVessel results, (E) Thresholded DeepVessel results

4 Conclusion

In this paper, we have developed a retinal vessel segmentation method, called DeepVessel, based on a novel deep learning architecture. A discriminative representation is learned by a CNN with side-output layers, and a high quality vessel probability map is produced using a CRF layer. We have demonstrated that our system produces state-of-the-art results on three publicly available datasets.