1 Introduction

The introduction of in vivo confocal microscopy (IVCM), as a technique for acquiring corneal images, has revolutionized the study and analysis of the structures of this tissue [1, 2]. In particular, thanks to this tool, it is possible to quickly and noninvasively acquire images of the sub-basal plexus (specific layer within the epithelium), allowing the visualization of the corneal nerve fibers.

As the cornea is the most innervated tissue in the human body [3], the analysis by confocal microscopy has made it possible to greatly expand the knowledge related to corneal structures and has allowed us to deepen the relationship between the characteristics of the corneal nerves and some pathologies, both ocular (dry eye, keratoconus, herpes keratitis) and systemic (diabetes, etc.) [4,5,6,7,8,9].

Moreover, several investigators have examined these structures and verified that they provide important clinical information related to age, prolonged use of contact lenses, surgery (such as LASIK or PRK), and transplantation [9,10,11,12].

However, quantitative analysis of corneal nerve fibers remains complex for daily clinical practice due to the execution time required and the difficulty of manual or semi-automatic analysis.

Additionally, manual analysis varies according to the clinician who performs the analysis and according to his or her experience. This subjectivity in manual analysis means that the clinical parameters, derived from the tracing, are also subjective and prone to errors. In the literature, some studies have been presented for the automatic analysis of images of the sub-basal plexus and, in particular, for the tracing of the nerve fibers. To the best of our knowledge, none of them have been included in the diagnostic process yet.

Scarpa et al. [13] proposed a method based on improving contrast through Gabor filters, followed by a threshold and clustering according to the fuzzy c-means technique. Subsequently, Dabbah et al. [14] presented an extension of some previous enhancement techniques, promoting a multi-scale approach, followed by an artificial neural network to classify each pixel as a foreground or background. On the basis of the last, Chen et al. [15] figured out the best configuration to obtain nerve segmentation. Poletti and Ruggeri [16] proposed a new approach based on a sparse tracking scheme. Annunziata et al. [17], presented a hybrid segmentation method, which combines an appearance model, based on a scale and curvature-invariant ridge detector (SCIRD), with a context model, including multi-range learned context filters. Al-Fahdawi et al. [18] proposed a three-step segmentation model. These steps included anisotropic diffusion filtering (to enhance nerves and remove noise), morphological operations (to remove unwanted objects, such as epithelial cells and small nerve segments), and edge detection (to detect all nerves in the input image). Afterward, Guimarães et al. [19] proposed another automatic method that processed the images with a filter bank (which includes the log-Gabor filter). Hysteresis thresholding was used to isolate candidate nerve pixels, and each pixel was then classified through a support vector machine.

One of the major problems related to the methods presented is their execution time (up to hundreds of seconds for a single image), followed by a not always excellent correspondence with manual analysis. In Dehghani et al. [20], the ability of automatic, semi-automatic, and manual methods was analyzed to detect the decrease in the length of the corneal nerves in patients with diabetes. They demonstrated that all three methods could distinguish healthy subjects from diabetic ones, but the manual method was found to distinguish subjects better, especially in low-contrast images.

Thanks to the new deep learning techniques, it is possible to significantly reduce analysis times and improve the correspondence with manual analysis, compared to previous algorithms. In our previous work [21], this technique was used for the first time for the analysis of corneal nerves, implementing a convolutional neural network (CNN), based on a U-shaped architecture. Zhang et al. [22] also used a UNet to segment nerve fibers, but they first implemented pre-processing steps to obtain image normalization. Mehrgardt et al. [23] presented a new multi-step approach, called UNet segmented adjacent angle detection (USAAD), for nerve fiber segmentation (also based on UNet) and automatic tortuosity estimation.

In this work, we continued our previous investigation (presented at the MICCAI 2018 conference [21]) on the corneal nerve fibers segmentation, based on a convolutional neural network. In the related article, as mentioned above, we mainly focused on the classical UNet encoder-decoder architecture [24] and demonstrate the ability of that network to individuate and trace the corneal nerves in IVCM images. In this work, we investigate whether some architectural convolutional neural network blocks added to the simple UNet could improve the performance. Moreover, we investigate how the use of different loss functions will influence the results, and we introduce a new loss function, which aims to consider a margin of tolerance in the tracing of nerve fibers.

2 Materials and methods

2.1 Materials

In this investigation, we used the same dataset presented by Colonna et al. [21] for training all networks. It consists of 8909 confocal images of the sub-basal nerve plexus from healthy and pathological (with type 1 or 2 diabetes) subjects. Each image covers a field of 400 \(\times \) 400 \(\upmu \)m (384 \(\times \) 384 pixels) and was acquired using the Heidelberg retina tomograph (HRT-II) with the Rostock cornea module (Heidelberg Engineering GmbH, Heidelberg, Germany). The acquisition was carried out at different clinical centers, and all data have been anonymized. Due to the difficulty and execution time required for manual analysis of such a large dataset, we decided to obtain the labeled images for the training by using the algorithm proposed by Guimarães et al. [25]

For the testing phase, we used a dataset composed of 90 images, acquired with the same tool, from healthy and pathological subjects. Each image was manually analyzed using the NeuronJ [26] tracing plug-in for ImageJ (the software is available in the public domain at http://imagescience.org/meijering/software/neuronj/).

2.2 Convolutional neural network modules

As a baseline for this work, we chose the UNet [24], which has proven to be an optimal framework for semantic segmentation of biomedical images in general and IVCM images in particular. It is composed of the encoding and the corresponding decoding units, with a four-layer depth. The architecture has been designed in such a way as to have the input and output size equal to the original image dimensions (384 \(\times \) 384 pixels).

A brief description of the coding and decoding blocks is now reported. The encoding path is made up of four-layer depth, each, in turn, made up of 2 convolutional blocks (consists of two convolutional layers to learn feature maps, followed by batch normalization to increase the CNN stability, and a ReLU as a non-linear activation function). Each convolutional block is followed by a max-pooling operation, obtained using a generic 2 \(\times \) 2 window, allowing to down-sample the feature maps by half of their original xy size.

After the encoding path, there is a block called the ‘bridge block,’ composed of two convolutional blocks and followed by the start of the decoding part. The decoding path is made up of four-layer groups. The transposed convolution allows us to double the xy dimensions of the features maps received in input (therefore, acts as the opposite operation of the max-pooling in the encoding part); the output of the transposed convolution is concatenated with the corresponding features maps in the encoding path. The concatenated features are subjected to two convolutional blocks (as the ones described above in the encoding part) and then again to a transposed convolution. The structure is repeated in the same way until it reaches xy dimensions equal to the input dimensions. Finally, the output of the decoding path is subjected to a 1 \(\times \) 1 convolution with sigmoid as activation function, obtaining a probability of ‘how likely it is, for each pixel, to be part of a nerve fiber.’

In addition to the architecture of the UNet described above, we investigated the ability of some architectures, which use the same UNet as a baseline, to increase performance. We modify the UNet by adding residual connections, the atruos-spatial pyramid pooling (ASPP) module, and the attention modules.

2.2.1 Residual connections

Several studies have added residual connections to classic CNN architectures [27,28,29,30]. This insertion has been shown to fight the vanishing of the gradient and the degradation of accuracy [27], also increasing the performance of the network [28,29,30]. As shown in Fig. 1, the residual connection was introduced between the input and the output of each convolutional block (both in the encoding and decoding path). Since in each convolutional block the number of features changes (respectively, doubles in the encoding part and halves in the decoding phase), in each residual connection a 1 \(\times \) 1 convolution has been added to adjust the number of features.

Fig. 1
figure 1

Residual Connection: On the left, graph of how the residual connection is inserted with respect to the convolutional block, on the right detailed implementation of the connection

2.2.2 Atrous-spatial pyramid pooling module

The atrous-spatial pyramid pooling (ASPP) module is used to obtain information on the multi-scale context, through the use of parallel convolutions with different dilation rates. As proposed by Chen et al [31], we decided to implement this module as shown in Fig. 2. Four parallel branches are created: in the first three branches, atrous convolutions are carried out with a different dilation rate, while, in the last branch, an image-level feature is extracted, through a global average pooling. After applying all the operations in parallel, the outputs are concatenated, and a 1 \(\times \) 1 convolution is applied to obtain the overall output of the module.

Fig. 2
figure 2

Atrous-Spatial Pyramid Pooling (ASPP) module: on the left, graph of how it is inserted with respect to the bottom of the UNet, on the right detailed implementation of the module

2.2.3 Attention module

The first time, the attention module was proposed for the classification task [32]. In particular, the purpose was to perform some operations that would allow analyzing with greater attention those regions of the image that most affect the prediction of the model.

Subsequently, this idea was adapted and generalized to improve performance on the segmentation task [33, 34], amplifying relevant spatial information and reducing the weight of background features.

This module, schematized in Fig. 3, was developed as described by Schlemper et al [34]. The attention module is inserted within each of the skipped connections deriving from the encoding path. The module takes as input the skipped connection and the output of the corresponding transposed convolution of the decoding branch, while the output is concatenated with the up-sampled data.

Fig. 3
figure 3

Attention module (AM): on the left, graph of how it is inserted with respect to the skipped connections of the UNet; on the right, a detailed implementation of the module

2.3 Loss function

The learning process of a deep learning algorithm is strongly reliant on the loss/objective function chosen during the design of the architecture. For quick and accurately, the loss function must be able to mathematically represent the target, even in its borderline cases. In this work, we choose to compare the performance of four widely used loss functions in semantic segmentation. We also proposed a new tolerance Dice loss function and showcased its efficiency.

2.3.1 Balanced binary cross-entropy loss

Balanced binary cross-entropy is a variant of the well-known cross-entropy, which is defined as the difference between two probability distributions for a given random variable [35]. Cross-entropy is widely used in the field of classification and semantic segmentation via deep learning, but its balanced variant proves to be more effective when it comes to working with unbalanced data (as in the case of corneal nerves where pixels belonging to the structures are very lower than the ones corresponding to the background). The balanced binary cross-entropy allows giving weight to both positive (\(\beta \)) and negative (1-\(\beta \)) examples, and is defined as follows:

$$\begin{aligned} \displaystyle L_{BCE}=-\beta ylog(\hat{y})+(1-\beta )(1-y)log(1-\hat{y}) \end{aligned}$$
(1)

where y represents the true value, and \(\hat{y}\) the predicted value. In this work, \(\beta \) is derived from the frequency of the True value in the image. The limit of the cross-entropy (and therefore also of its variants) is to calculate the loss as the average of the loss per pixel, without considering the adjacent pixels and therefore without considering the continuity of the object to segment.

2.3.2 Dice loss

Dice loss is another widely used loss function in the semantic segmentation task [36, 37]. It is based on the coefficient of the same name [38], which calculates the similarity between two images considering the overlap between the two samples. The Dice coefficient is defined as

$$\begin{aligned} \displaystyle Dice=\frac{2\Vert Y \cap \ \hat{Y}\Vert }{\Vert Y\Vert +\Vert \hat{Y}\Vert } = \frac{2TP}{2TP+FP+FN} \end{aligned}$$
(2)

where \(\Vert Y\Vert \) (respectively, \(\Vert \hat{Y}\Vert \)) represents the number of pixels of object (nerve fibers in our case) in the True set Y (respectively, predicted set \(\hat{Y}\)), and \(\Vert Y \cap \ \hat{Y}\Vert \) represents the common pixels between the two datasets. We also report an equivalent definition of the coefficient: In this case, it is highlighted how false positives (FP) and false negatives (FN) are weighed equally, and consequently precision and recall are also weighed in the same way. The Dice loss function is derived from this coefficient and it is calculated as

$$\begin{aligned} \displaystyle L_{Dice}=1-Dice \end{aligned}$$
(3)

2.3.3 Tversky loss

The Tversky loss function [39] is derived from the homonymous coefficient [40]. The Tversky coefficient is considered as a generalization of the Dice coefficient descripted above, and allows balancing FP and FN. The coefficient is defined as follows:

$$\begin{aligned} \displaystyle Tversky =\frac{TP}{TP+\alpha *FP+\beta *FN} \end{aligned}$$
(4)

The tuning of \(\alpha \) and \(\beta \) can put more emphasis on FPs or FNs, respectively (e.g., increasing the \(\beta \) value, increases the emphasis associated with FNs and leads to trying to decrease them during the training process, thus increasing recall). As for the Dice loss, also in this casa the Tversky loss function is derived from the coefficient as

$$\begin{aligned} \displaystyle L_{Tversky}=1-Tversky \end{aligned}$$
(5)

2.3.4 Focal Tversky loss

The focal Tversky Loss [41] is a variant of the Tversky loss described above: in this case, a focal parameter \(\gamma \) is applied.This parameter attempts to learn hard examples (i.e., examples with small regions of interest). The focal Tversky loss is defined as

$$\begin{aligned} \displaystyle L_{FT\ }=(1-Tversky)^\frac{1}{\gamma } \end{aligned}$$
(6)

A value of \(\gamma <1\) increases the focus on learning hard examples, while a value of \(\gamma =1\) simplifies the loss function to the Tversky Loss.

2.3.5 Dice with tolerance loss

In the semantic segmentation of corneal nerves, interest is mainly related to the recognition of the central line that identifies the nerve fiber. In tracing this line, it is correct to set a tolerance margin within which a nerve fiber can be recognized (i.e., a pixel recognized to be a nerve will be accepted as TP if it was within the margin of tolerance from the reference nerve). Since the purpose of the study of the corneal nerves is to recognize and segment the nerve fibers present in the images, trying to maximize both precision and sensitivity, we decided to adapt the Dice loss to our task: To obtain the Dice with tolerance coefficient, we calculated TP, FP and FN considering the tolerance margin. The loss is calculated as follows:

$$\begin{aligned} \displaystyle L_{DL}=1-\frac{2 \widetilde{TP}}{2\widetilde{TP}+\widetilde{FP}+\widetilde{FN}} \end{aligned}$$
(7)

where \(\widetilde{TP}\), \(\widetilde{FP}\), and \(\widetilde{FN}\) indicate the corresponding indices calculated with tolerance.

3 Experiments

As mentioned in previous sections, we conducted experiments on several models applying several loss functions. The main aspect of our work was to investigate the performance of each architecture and the loss function, in the tracing of corneal nerve fibers. Starting from the modules described above (in paragraph ‘Convolutional neural network modules’) and using UNet as the baseline for each architecture, we decided to investigate the following convolutional neural networks:

  • Simple UNet (the one used as baseline).

  • UNet with attention modules (AM-UNet).

  • UNet with residual connections and attention modules (AM-ResUnet).

  • UNet with the ASPP module (ASPP-UNet).

  • UNet with attention modules and the ASPP module (AM-ASPP-UNet).

We trained each model for 60 epochs, considering each time one of the 5 loss functions presented above.

4 Results

As at the beginning of the previous section, models were trained on 8909 images, which were labeled using the algorithm proposed by Guimarães et al. [25], and tested on 90 images analyzed manually.

For each model trained, we have analyzed the goodness of its performance taking into account two indices:

  • The true positive rate - TPR, (or recall) which corresponds to the proportion of nerves correctly identified by the proposed algorithm (the higher, the better) and which is calculated as: \(\widetilde{TPR}=\frac{\ \widetilde{TP}}{\ \widetilde{TP}+\widetilde{FN}}\)

  • The false discovery rate - FDR (or 1-precision) which corresponds to the proportion of nerves mistakenly identified as such (the lower, the better), calculated as: \(\widetilde{FDR}=\frac{\widetilde{FP}}{\ \widetilde{TP}+\widetilde{FP}}\)

Both indices were calculated considering a tolerance margin of 3 pixels (i.e., a pixel classified by the algorithm as a nerve, will be considered a TP if it is found within an area of 3 pixels thick with respect to the reference nerve). The obtained values of TPR and FDR for each model and each loss function tested are shown in Tables 1 and 2.

5 Discussion

Taking the Unet (trained with binary cross-entropy loss function) as a baseline for comparison, it can be seen how the performances improve both by modifying the structure and by using some different losses.

In particular, it should be noted that the introduction of a module (or combinations of them) brings an improvement in the evaluation indices. Only in one case (ASPP-UNet), the TPR decreases, but it is important to observe the significant decrease also in the FDR. Moreover, after training with cost functions other than binary cross-entropy, we observe (especially for Dice, Tversky, and focal Tversky) a decrease in the TPR values and, at the same time, a significant decrease in the FDR index. In the case of the proposed loss function, the TPR values obtained are very similar to the values obtained with the binary cross-entropy, while the FDR values are considerably reduced (up to 50% in the case of the UNet).

Table 1 TPR score obtained from the evaluation
Table 2 FDR score obtained from the evaluation

Since it is not easy to assess whether simultaneous variation in TPR and FDR leads to a better result, it was decided to introduce an additional performance evaluation index. Starting from the two indices, the value of the dice with tolerance index is obtained as:

$$\begin{aligned} \scriptstyle DICE_{tol}=\frac{2\widetilde{TP}}{2\widetilde{TP}+\widetilde{FP}+\widetilde{FN}}=2\frac{\widetilde{TPR}*(1-\widetilde{FDR})}{\widetilde{TPR}+(1-\widetilde{FDR})} \end{aligned}$$
(8)

This index tends to 1 if \(\widetilde{TPR}\) is 1 and \(\widetilde{FDR}\) is zero (case of perfect correspondence between manual and automatic analysis); it gradually decreases with the decrease of \(\widetilde{TPR}\) and the increase of \(\widetilde{FDR}\) up to 0 (case in which manual and automatic analysis are complementary).

Table 3 Dice with tolerance score obtained from the evaluation of the various models examined

The values of this index for each network tested are shown in Table 3. It can be observed that the decrease in TPR mentioned previously is actually justified by an evident decrease in the FDR values, as it leads to a higher \(DICE_{tol}\) value. To further compare the results obtained, boxplots have been created. From the latter, it is possible to observe the variability of the indices between single images: These results are shown in Figs. 4 and  5. From the box plots representing the TPR values, it can be observed that the loss function proposed presents, for almost every model, a reduced variability compared to the other losses tested. Furthermore, the TPR values of single images never drop below 60%. Regarding the FDR values, it is evident that in the case of training through Tversky loss and focal Tversky loss, there is a decrease in both the values and the variability with respect to the baseline, this is linked to the choice of values of \(\alpha \) \(\beta \) and \(\gamma \). By observing the values relating to the loss proposed, it can be seen that the inclusion of the ASPP module considerably reduces the variability of the index.

Fig. 4
figure 4

5 Boxplot comparison of models’ performance in terms of TPR. Each box shows the score of different architectures, comparing the different loss functions under examination

Fig. 5
figure 5

Boxplot comparison of models’ performance in terms of FDR. Each box shows the score of different architectures, comparing the different loss functions under examination

The reduction of FDR is linked to the reduction of FP, and this is evident in the more complex images where the nerve fibers are very thin, and with relatively low contrast, the Unet, considered as a baseline, struggles to give continuity in the segmentation. With the introduction of the ASPP module, better results are obtained: the pixels classified as FP are reduced. Figure 6 shows the result obtained through 2 analyzed models: the baseline and the one chosen as the best model (higher Dice with tolerance score). The image shows thin nerve fibers, poor contrast in some areas, and the presence of dendritic cells. The baseline appears to mistake dendritic cells for nerve fibers and, at the same time, makes it difficult to trace nerve fibers that appear thinner and with low contrast. On the other hand, the UNet with Attention modules and Atrous-Spatial Pyramid Pooling module recognizes the low-contrast fibers and discards the luminous patterns that belong to the dendritic cells. Moreover, it seems to recognize some fibers that have not been traced manually, such as the one in the lower part of the image (traced in red since it corresponds to an FP in the analysis).

Fig. 6
figure 6

From left to right: original corneal confocal images, manual tracing, automatic tracing with baseline (Unet trained with binary cross-entropy loss function) and automatic tracing with the best model (AM-ASPP-UNet trained with Dice with tolerance loss proposed in this work). The green color represents the true positive (TP), the blue color represents false negative (FN), and the red color represents false positive (FP)

6 Conclusion

In vivo confocal microscopy is a technique that allows the acquisition of images of the corneal layers rapidly and noninvasively. The acquisition of images of the corneal sub-basal plexus allows the analysis of the nerve fibers present in it, which are strictly correlated to the presence of ocular or systemic diseases. It is important to note that all clinical parameters (useful for the diagnostic process) depend on identifying and tracing the nerve fibers.

The analysis of IVCM images in clinical practice is complicated: The manual tracing phase is very time-consuming, and, to the best of our knowledge, there is still no universally accepted automatic technique for performing it.

In recent years, thanks to new deep learning techniques, better results have been obtained in the field of tracing structures present in images: With an adequate dataset and appropriate training process, it is possible to obtain excellent results. In this paper, we presented an extension and improvement of our previous work on tracing the corneal nerve fibers in IVCM images of the sub-basal plexus. We investigated the ability to improve tracing through architectural improvements to the baseline model (UNet). We improved the architecture by adding residual connections, Atrous-Spatial Pyramid Pooling (ASPP) module, and attention modules (AM).

To boost the prediction performance, we also investigated four different loss functions and proposed a new tolerance Dice loss function. We trained all the architecture (with all five loss functions) using the automatic nerve tracings obtained by Guimarães et al. [25], which may present errors (missing or misclassifying nerves). To evaluate the performance of all the models and loss functions under investigation, the true positive rate (TPR) and false discovery rate (FDR) with a tolerance margin were calculated. Furthermore, to better compare the performances, starting from these last two indices, a third was obtained: the Dice with tolerance index (\(DICE_{tol}\)). In almost all loss functions cases, the introduction of new structures had outperformed the UNet (baseline architecture): This is clear looking at the \(DICE_{tol}\) index table.

Looking at the same architecture and examining the results obtained using a different loss function during training, the proposed loss function presents a higher Dice with tolerance score in almost all cases. Even the TPR index is almost always the best, while the FDR only improves with respect to the binary cross-entropy case. Furthermore, the use of the proposed loss function makes the continuity of the nerve fibers even more evident, giving a better result.

For future work, it will be interesting to analyze the data used during the training phase, keeping only those images that present results that are acceptable from a clinical point of view (reducing the error in the ground truth dataset, the performance will improve). Selecting images already analyzed automatically or improving an automatic analysis is less time-consuming than completely manual analysis.