Keywords

1 Introduction

Recent studies have shown that DL models are able to detect and diagnosing various retinal diseases interpreting ocular data derived from different diagnostic modalities including digital photographs, optical coherence tomography (OCT), and visual fields [1]. DL systems can already be applied to teleophthalmology programs to identify abnormal retinal images reducing the clinic workload for disease screening. Furthermore, DL tools could enable ophthalmic self-monitoring by patients via smartphone retinal photography. Most of the initial studies have centered around automatic detection of diabetic retinopathy, age-related macular degeneration, and glaucoma [2,3,4, 14], while only a few methods have been developed for the automatic diagnosis of genetically heterogeneous retinal disorders. Many of these genetic eye disorders lead to blindness and an early diagnosis, even by means of a simple ophthalmoscopy, can reduce preventable vision loss. Automatic diagnostic systems are able to analyze ocular data and could be also used by non-ophthalmologists to screen patients who do not yet show signs of weakness in visual acuity. The aim of this work is to investigate a DL system for segmenting pigment signs (PSs) that represent a symptom of the Retinitis Pigmentosa (RP). The RP is one of the most common disease caused by genetic eye disorders and leads to night blindness and a progressive constriction of the visual field from the periphery to the center. Progression leads to central acuity loss and legal blindness in most patients. At present, no cure exists to stop the progression of the disease but, if an early diagnosis of RP is available, the progressive degeneration of RP can be delayed through the intake of vitamin A and other nutritional interventions [5]. Clinical diagnosis is possible through fundus examination revealing the presence of PSs, arteriolar attenuation, and pallor of optic disc. In Fig. 1, a healthy and a severely degenerated retina are shown.

Fig. 1.
figure 1

Fundus images of a healthy retina (left) and a retina with Retinitis Pigmentosa (right). In the image of the diseased eye, peripheral pigment signs, attenuated retinal arterioles and optic-disc pallor are evident.

PSs are a consequence of a degeneration of the photoreceptors and accumulate over years, so they could not be present in younger individuals. However, PSs are the most easily identifiable signs on a retinal fundus image by a non-ophthalmologist and should be prompt referred. Further detailed ophthalmic examinations (visual test, OCT, electroretinography and fundus autofluorescence) are adopted to determine the severity of the disease and to monitor the disease progression [6]. Many automatic methods to quantify RP and to track its progression are based on the analysis of OCT [3, 7, 8]. The diagnosis by fundus camera represents the best solution for RP screening by performing retinal image acquisition in resource-limited setting, since fundus images can be acquired with inexpensive devices.

Though many approaches to automatically analyze the retinal vessel structure and the pallor of the optic disk were developed [9,10,11,12,13], the literature about the automatic detection of PSs in fundus images is extremely limited [14,15,16]. In our previous work [16], we have proposed a supervised method to segment PSs in fundus images, which extract pixel-wise/region-wise hand-crafted features that are fed to machine learning techniques (i.e. Random Forests and AdaBoost.M1) to discriminate between PSs and normal fundus. Furthermore, we made publicly available a dataset of Retinal Images for Pigment Signs (RIPS) [17] for the evaluation of the performance of PSs segmentation algorithms.

DL based segmentation is a hot topic and has gained increasing attention, as deep neural networks learn a hierarchy of feature maps directly from data without requiring any hand-crafted features. Most of the early DL approaches for segmentation translate the segmentation task into a pixel-wise classification problem. However, in order to solve the image classification problems, DL models require a large number of images to be trained. Moreover, the classification of all the pixels in a test image is carried out by sliding a window on the image and classifying the current central pixel, that entails a slow prediction time. Other DL architectures specifically devoted to segmentation are based on an encoder-decoder scheme that learns to decode low-resolution images on pixel-wise predictions. In this work, the adopted DL model to segment PSs is a U-Net based convolutional neural network, that is an encoder-decoder network for pixel-wise prediction [18].

2 The Proposed Model

The proposed deep model is based on U-Net, which has been successfully used for segmenting medical images in several contests [9, 14]. This model is an encoder-decoder network implementing a contracting/expanding path consisting of convolutional, downsampling and upsampling layers.

In this work, the network has been modified with respect to its original architecture. Indeed, two of the five blocks have been dropped out (i.e. convolutions, pooling/upsampling) and the number of filters was halved. The architecture of the network is shown in Fig. 2.

Fig. 2.
figure 2

Architecture of the proposed U-Net based network.

In the encoding part of the network each feature map is downsampled by applying a pooling operation in order to spatially reduce the input as well as the number of parameters to be learned in the following layer. In our case, max-pooling has been adopted for all downsampling layers. Upsampling layers increase the dimension of feature maps by learning to deconvolve them. The decoder feature maps and the corresponding encoder feature maps are concatenated to produce the output. In order to stabilize the learning process and to reduce the number of training epochs, we also introduce batch normalization, while dropout (0.2) is introduced to prevent over-fitting.

In more details, both the encoder and decoder include five convolutional layers, whose filters have size 3 × 3, a stride of 1 and adopt the rectification non linearity (ReLU) and Batch Normalization. Moreover, dropout of 0.2 is applied alternately to odd convolutional layers. In the contracting path, the second convolution of the first two blocks feeds a max-pooling layer that is computed on a window of size 2 × 2 with a stride equal to 2. In the expanding path, the first convolution of the last two blocks is preceded by an upsampling layer, which doubles the size of the feature map and concatenates it with the corresponding feature map coming from the contracting path.

At the end, a soft-max classifier computes the probability of each pixels of being a PS (foreground) or background.

Given the nature of PSs, choosing the right metric to be optimized represents a crucial aspect. Indeed, PSs represent a small percentage of the pixels of the image, that translates in very few positive pixels and a high number of true negatives in the segmented image. The most of works in literature consider the accuracy as the metric to be optimized during training. However, accuracy can be heavily contributed by a large number of true negatives, thus F1-score might be a better measure to use if one need to seek a balance between precision and recall and there is an uneven class distribution. For this reason, the F1-measure has been considered in our model. Furthermore, to increase the robustness of the training process, the Adadelta [19] optimizer has been selected, as it does not require manual tuning of the learning rate and has been shown to be robust to noisy gradient information and different model architectures.

3 Experiments

3.1 Materials and Methods

The experiments have been performed on the Retinal Images for Pigment Signs dataset, namely RIPS. This dataset consists of 120 retinal fundus images with a resolution of 1440 × 2160 pixels captured from four patients, who underwent three different acquisition sessions. During each session, five images per eye were acquired covering different regions of the fundus. The time lapse between two consecutive sessions is at least six months, while time interval between the first and last session always exceeds one year. Images were acquired using the digital retinal camera Canon CR4-45NM (Canon UK, Reigate, UK) and show a high variability in terms of color balancing, contrast and sharpness/focus, also for the same patient. Two binary masks are associated to each image, where the foreground representing PSs has been marked by two experts in the field of ophthalmology. Moreover, for each image, a mask image is provided to delineate the FOV.

3.2 Training Strategy and Image Prediction

The resolution of retinal images makes it unfeasible to train the existing DL architectures on the whole image. Most commonly used approaches to cope with this problem are either to reduce the image resolution or to partition the image in patches. The main drawback of a severe image downsampling is that small PSs could disappear. On the other hand, image partitioning produces a high number of patches with large size when working on high resolution images. For this reason, we adopted a compromise by reducing the image size of a factor equal to 0.5 that allows to extract patches with small size, but still representative enough. Only a subset of patches randomly extracted from each image is included in the training set. Indeed, PSs appears only in some regions of the image and could have a very small size, so considering only patches including at least one pixel marked as PS in the corresponding mask yield a training set sufficiently representative also for the background.

In the testing process, the input image is downsampled of a factor 0.5. The image prediction is performed by extracting patches according to a window sliding with a stride s > 0. Patches are fed to the network and pixels are assigned a probability of being PS. For values of s smaller than the window size, patch overlapping occurs, so that each pixel receives multiple predictions as belonging to several patches. The global score of a pixel is computed by summing up all predictions. Scores are normalized in the range [0, 1] and the foreground image is obtained by applying a threshold of 0.5.

3.3 Experimental Setup and Results

In the experiments, a per patient cross-validation protocol was applied considering samples from three out of the four patients for the training and the data of the fourth patient for the validation. The number of training epochs has been set to 30, while the batch size is equal to 32. To train our network, we used a NVIDIA GeForce GTX 1050 with 4 Gb of RAM.

The first experiment aims to verify that F-measure provides better performance than accuracy when used as loss function to train the network. In this experiment, the patch size is fixed to 48 × 48 pixels and the stride of the sliding window is set to 6 in the prediction process. Results are reported in Table 1.

Table 1. Performance measures of the proposed U-Net based network when F-measure or accuracy is used as loss function.

The values in Table 1 show that the accuracy is very high in both cases, while the precision increases considerably when the F-measure is used as a loss function. This is because accuracy is heavily influenced by the number of true negatives, so it approaches to 1.0 even when the precision is very low. The experimental results confirm that the F-measure outperforms the accuracy for the PSs segmentation task.

In the second experiment we have analyzed the improvements obtained in terms of f-measure when the patch size increases. The proposed U-Net based architecture has been tested for three different patch sizes, namely 48 × 48, 72 × 72, and 96 × 96 and numerical results are reported in Table 2.

Table 2. Performance measures of the proposed model for different patch sizes.

Results in Table 2 mainly highlight that F-measure proportionally increases with respect to the patch size. In particular, we have observed that the larger the size of the patch is, the better the model performs in discriminating PSs from blood vessels.

Figure 3 shows the segmented images produced by the proposed model when it is trained with patches of size 48 × 48, 72 × 72, and 96 × 96 pixels, respectively. In Fig. 4, one image for each of the four patient is shown together with the corresponding ground truth and the segmented image produced by our model when patches of 96 × 96 pixels are considered.

Fig. 3.
figure 3

Segmented images obtained for different patch sizes: input image (top-left), 48 × 48 pixels (top-right), 72 × 72 pixels (bottom-left), and 96 × 96 pixels (bottom-right).

Fig. 4.
figure 4

Results from the RIPS dataset. Top to bottom: patients from 1 to 4. Left to right: the original image, the ground truth, the result of the proposed model.

The performance of the proposed U-Net based model has been compared with state of the art approaches. In particular, the machine learning based approach proposed in [16] was considered. Numerical results are reported in Table 3.

Table 3. Performance measures of different methods.

4 Conclusions

In this study, a deep-learning based approach for segmenting PSs in retinal fundus images is presented. The segmentation has been performed in an end-to-end way by using a DL model. We have proposed a U-Net based model, since U-Net has been largely used for segmenting medical images in several contests. In particular, it has been successfully used to segment structures in retinal fundus images. We have modified the original architecture of U-Net to reduce the number of parameters, and consequently the computation time and memory requirements. The number of blocks has been reduced from five to three and the number of filters per block has been halved. The model implements a patch based strategy both for training and testing. The performance of the proposed model has been assessed on the publicly available RIPS dataset. Several experiments have been performed varying the size of the extracted patches and using different loss functions for the training phase. Experimental results show that using the F-measure in place of the accuracy improves the quality of segmentation. Moreover, the quality of the segmentation increases proportionally with the patch size. The proposed model also outperforms a pixel based machine learning method proposed in literature, as it produces an increment of 15% in terms of F-measure.