Joint optic disc and cup segmentation based on multi-scale feature analysis and attention pyramid architecture for glaucoma screening

Automatic segmentation of optic disc (OD) and optic cup (OC) is an essential task for analysing colour fundus images. In clinical practice, accurate OD and OC segmentation assist ophthalmologists in diagnosing glaucoma. In this paper, we propose a unified convolutional neural network, named ResFPN-Net, which learns the boundary feature and the inner relation between OD and OC for automatic segmentation. The proposed ResFPN-Net is mainly composed of multi-scale feature extractor, multi-scale segmentation transition and attention pyramid architecture. The multi-scale feature extractor achieved the feature encoding of fundus images and captured the boundary representations. The multi-scale segmentation transition is employed to retain the features of different scales. Moreover, an attention pyramid architecture is proposed to learn rich representations and the mutual connection in the OD and OC. To verify the effectiveness of the proposed method, we conducted extensive experiments on two public datasets. On the Drishti-GS database, we achieved a Dice coefficient of 97.59%, 89.87%, the accuracy of 99.21%, 98.77%, and the Averaged Hausdorff distance of 0.099, 0.882 on the OD and OC segmentation, respectively. We achieved a Dice coefficient of 96.41%, 83.91%, the accuracy of 99.30%, 99.24%, and the Averaged Hausdorff distance of 0.166, 1.210 on the RIM-ONE database for OD and OC segmentation, respectively. Comprehensive results show that the proposed method outperforms other competitive OD and OC segmentation methods and appears more adaptable in cross-dataset scenarios. The introduced multi-scale loss function achieved significantly lower training loss and higher accuracy compared with other loss functions. Furthermore, the proposed method is further validated in OC to OD ratio calculation task and achieved the best MAE of 0.0499 and 0.0630 on the Drishti-GS and RIM-ONE datasets, respectively. Finally, we evaluated the effectiveness of the glaucoma screening on Drishti-GS and RIM-ONE datasets, achieving the AUC of 0.8947 and 0.7964. These results proved that the proposed ResFPN-Net is effective in analysing fundus images for glaucoma screening and can be applied in other relative biomedical image segmentation applications.


Introduction
Glaucoma is the second leading cause of blindness in the world (after cataracts) and the first irreversible cause of blindness [26]. It is estimated that glaucoma will affect over 111.8 million people by 2040 [40]. As a chronic disease, glaucoma affects the physiological structure of patients' eyes, causing the thinning of ganglion cells with internal plexiform layer (GCIPL), the increase of cup-disc ratio, and the narrowing of optic disc rim [15]. Normally, no evident symptoms appear in the early stage of glaucoma, which causes numerous patients diagnosed with glaucoma in the late stage when the damage to visual function is irreversible. Therefore, early screening is essential for the treatment of glaucoma and prevents the loss of vision.
Currently, colour fundus images and optical coherence tomography (OCT) are the most broadly implemented imaging techniques in the early screening of glaucoma. Compared with OCT, colour fundus image is less expensive and more frequently used for detecting glaucoma. The optic cup (OC) to optic disc (OD) ratio (CDR) of fundus images is an important indicator in the screening and diagnosis of glaucoma [9]. As shown in Fig. 1, the CDR of healthy eyes is generally between 0.3 to 0.4. When the value of CDR reaches 0.65, it is clinically considered to be glaucoma. Manually checking OD and OC is a time-consuming task, and it normally takes a professional ophthalmologist about 8 minutes on average to completely segment the OD and OC in a fundus image [21]. Hence, developing automatic algorithms to segment OD and OC from fundus images is significant for lightening the burden of ophthalmologists and promoting large-scale screenings of glaucoma.
Most of the early segmentation methods of OD and OC are based on hand-crafted features (e.g. colour, gradient and texture features), which include adaptive thresholdbased method [2,27], regional growth method [28] and segmentation method based on Wavelet transform [6]. However, these hand-crafted features are easily affected by the physiological structure of the fundus images.
In recent years, deep learning has achieved excellent performance in tasks such as image classification [16], object detection [30], and image segmentation [24]. A large number of OD and OC segmentation methods based on deep learning have been proposed [12,34,36]. Due to the uncertainty of the boundary of the OD and OC in the fundus image, the accurate segmentation of OD and OC is still a challenging task. Most of the existing methods divide the segmentation of OD and OC into two stages or only conduct OD segmentation, which overlooks the inner connection between OD and OC. Moreover, most methods only use a single scale to process the image, which cannot fully capture the detailed features of the OD and OC, especially edge information.
In this paper, we propose a convolutional neural network, named ResFPN-Net, for joint OD and OC segmentation. The main contribution of our work can be summarized as follows: (1) A segmentation network for joint OD and OC segmentation: Through multi-scale loss supervision, the network can accurately segment the OD and OC from fundus images by fully taking advantage of the internal relationship between OD and OC. (2) A multi-scale feature extractor: It takes images of different scales as input and merges information from various feature maps, which can adequately express the feature information of the fundus image and preserve the edge features. (3) An attention pyramid structure: This structure combines attention mechanism with feature pyramid architecture to enhance the representation of OD and OC in the fundus image, which improves the segmentation performance of the network.

Related works
In the early stage, most research on OD and OC segmentation is based on hand-craft features. These features mainly include colour, texture, contrast, and gradient information. Abdel-Ghafar et al. [1] proposed a thresholdbased segmentation method to segment the OD. This method utilizes the Sobel operator to enhance the fundus image; subsequently, the image is processed by the local threshold and applies Hough transform to get the OD region. Osareh et al. [29] proposed an OD location method based on colour channels. Juneja et al. [39] applied fuzzy C-means clustering method to segment the OD and OC, and the Canny operator is employed for post-processing. In the segmentation method of OD and OC, edge detection algorithms such as the Sobel operator and the Canny operator can improve the accuracy of segmentation. Different from the edge detection operators, the pixel classification-based method transforms the edge detection problem into the pixel segmentation problem and achieves satisfactory results. Jun Cheng et al. [8] proposed a superpixel classification to segment OD and OC and applied histograms and centre-surround statistics to divide each superpixel into disc region and non-disc region. In [42], a method based on deformation is proposed to locate the OD and OC. In addition, template-based methods [20] and reconstruction-based learning method [41] are also widely used in OD and OC segmentation. However, these methods heavily rely on hand-crafted features, which largely affects their performance.
Recently, deep learning has made great achievements in natural image segmentation and medical image segmentation, such as Mask-RCNN [13], U-Net [31]. Many OD and OC segmentation methods based on deep learning have also emerged. In [34], a modified U-Net architecture is proposed to segment the OD and OC, which achieves the lowest possible prediction time compared with traditional convolutional networks. In [18], an end-to-end convolutional neural network, named JointRCNN, is proposed to segment OD and OC, which applied the atrous convolution to boost the performance of segmentation results. However, these methods separate OD and OC segmentation separately. Gu et al. [12] proposed a CE-Net to capture more advanced information and retain spatial information for segmenting OD. Motivated by conventional U-Net architecture, Baid et al. [5] proposed a ResUnet Architecture to segment OD. Al-Bander et al. [33] used VGG as the backbone and transfer learning to solve the problem of OD segmentation. However, based on these methods, only the optic disc region is segmented. Therefore, they ignored the intimate relationship between the OD and OC. Subsequently, the Stack-U-Net [35] was further proposed, which takes U-Net as the backbone and assists the thought training network of iterative refinement. In [43], using ResNet-34 as an encoding layer, a modified U-Net architecture was proposed for the segmentation of OD and OC. Al-Bander et al. [3] proposed a new segmentation network that utilized DenseNet incorporated with a fully convolutional network. Fu et al. [10] used polar transformation to flat the image based on OD centre and applied interpolation to enlarge the cup region. However, the transformation of polar coordinates causes the edges of the OD to be not smooth.

Methodology
Inspired by RetinaNet [23], we proposed the ResFPN-Net, as shown in Fig. 2. The framework has four components: multi-scale feature extractor, multi-scale segmentation transition, attention pyramid architecture, and multi-scale loss supervision. The multi-scale feature extractor receives various scale fundus images as input. The multi-scale segmentation transition is used to achieve multi-level feature maps fusion and preserve feature maps of different scales. And then, the feature maps are transmitted into an attention pyramid structure to capture the inner connection within OD and OC. Finally, the segmentation result of the OD and OC is achieved. The entire network is trained by multi-scale loss supervision. The following sub-sections will introduce the details of this architecture.

Multi-scale extractor
The extraction of OD and OC edge information in fundus images can improve segmentation accuracy. However, in the fundus image, the boundary information of OD and OC is usually not clear, so it is difficult to retain the details based on a single scale. In general, the convolution with a large receptive field is suitable for large objects, while the convolution with a small receptive field can capture detailed information. Therefore, we take the multi-scale fundus image as input to construct various receptive fields and completely learn the edge features. As shown in Fig. 2, we modify the ResNet [14] as our feature extractor. ResNet is an efficient residual network for image classification. Specifically, all fundus images are resized into 512 Â 512, 256 Â 256, 128 Â 128, and 64 Â 64 pixels. We initially applied convolution with a kernel size of 7 Â 7 on the fundus images with a size of 512 Â 512 pixels. Batch normalization (BN) and ReLU activation function are applied to derive the feature map, denoted as s 2 . Then we construct different convolution layers to receive multiscale fundus images, whose kernel size is 3 Â 3, the channel number is 64, 128, and 256, respectively. And, each convolution is followed by a rectified linear unit (ReLU). Finally, the feature map derived from the fundus images of the other three scales is denoted as s 3 , s 4 , s 5 .

Multi-scale segmentation transition
The encoder-decoder structure is generally employed in many frameworks for image segmentation. In this paper, our segmentation architecture is also based on this structure. In an encoder-decoder structure, the encoder is used to compress and encode the feature information of the image; the decoder is deployed to restore the encoded information. However, some segmentation methods [4,45] based on encoder-decoder structure do not fully preserve multi-scale feature information. In our segmentation task, the multi-scale input is integrated into the decoder layer to broaden the network width of the decoder path.
To transfer the detailed feature and the multi-scale information to the decoder. We generate a set of feature maps produced by different multi-scale feature maps as information transitions between encoder and decoder. Specifically, the feature map s 2 is fed to a residual block, which consists of a set of convolution and downsamples operations. The feature map derived from the residual block is denoted as c 2 . However, there are significant feature gaps between the features extracted from multi-size fundus images. Directly merging these features can weaken the representation of the multi-scale image. In this paper, we proposed a fusion attention module to alleviate gaps among these feature maps, as shown in Fig. 3. Firstly, we merge two feature maps by channel-wise concatenation followed by convolution layer and BN. This procedure can be formulated as follows.
Then, we collect global contextual information by global average pooling. We apply 1 Â 1 convolution operation and Softmax activate function to derive the attention matrix based on global context information. And the attention matrix is multiplied with V to get the fusion feature map. Finally, the fusion feature map is forwarded to the corresponding residual block. Following the above illustration, multi-level features used to build by fusion attention module and residual blocks are denoted as {c 2 , c 3 , c 4 , c 5 }, which correspond channels are {256, 512, 1024, 2028}, as shown in Fig. 2.

Attention pyramid architecture
We collect four feature maps of different scales through the multi-scale segmentation transition: {c 2 , c 3 , c 4 , c 5 }. Then, we utilize Feature Pyramid Network (FPN) [22] to explore features at different scales. The FPN was originally employed in the object detection task to solve the problem of multi-scale object detection. It adds different feature maps through Top-down pathway and lateral connections to aggregate multi-scale features. However, there are significant differences in these four feature maps. Specifically, the feature maps in the deeper layer are spatially coarser but have more semantic information. In contrast, the feature maps in the lower layer contain rich location information but fewer semantic features. We believe that this simple addition method will weaken the expression of some features and cannot fully learn the close relation between OD and OC. More importantly, fundus vessels in the OD and OC region make it difficult to segment the OD and the OC accurately.
In this paper, we propose an attention pyramid mechanism that concatenates multi-scale features to solve the above problems. In this architecture, an attention module integrates the high-level feature map and the low-level feature map, which bridges the gaps between the deeper feature map and the lower feature map. On the other hand, each region of the input image is given different weights to extract more critical information and help the model distinguish between the target region and the background. Specifically, feature maps obtained by the multi-scale transition: {c 2 , c 3 , c 4 , c 5 } are fed to the corresponding convolution layer of the pyramid network. Subsequently, the attention module concatenates high-level features with low-level features to achieve feature fusion, as shown in Fig. 4.
Our attention module is based on CBAM [32] and is shown in Fig. 5, where p i , p j represents the feature maps from diverse convolution layer. We first feed the p j with bilinear interpolation and add it to p i to produce the intermediate feature map f. Then, an Adaptive Average Pooling, Adaptive Max Pooling and 3 Â 3 kernel convolution layer followed by ReLu and Sigmoid activate function to generate two new feature maps S 2 R CÂHÂW and L 2 R CÂHÂW , where C indicates the number of channels, and H and W is the height and width of the feature map. Finally, these two new feature maps are added together to receive the final feature map O 2 R CÂHÂW .

Loss function
The OD and OC segmentation is formulated as a multilabel problem in our task. In the original fundus image, the proportion of the background region is more significant than that of OD and OC. The performance of the network is affected by the imbalance of categories in the training process. Therefore, we use focal loss [23] as the loss function for multi-class segmentation, which balances the proportion of the target region and background region by adding weights to the corresponding loss of the sample.
To adequately train the network, we introduced the suboutput layers to construct multi-scale loss. The advantage of the multi-scale loss is that it prevents the gradient from disappearing during training. In sub-output, the segmentation loss between the mask and the fundus image is formulated by Eq. (2).
where P t is the probability of truth class in the network, and a is an equilibrium variable to balance the number of positive and negative samples. c is a hyperparameter used to focus the model on samples that are difficult to classify during training. Besides, we integrate sub-outputs to calculate the fusion loss (L fusion ). There are four sub-outputs in our task, denoted as O 1 , O 2 , O 3 , O 4 , and the fusion of four suboutputs O can be formulated as: L fusion is defined as follows: Finally, the multi-scale loss function of the segmentation network is formulated as: where N represents the number of sub-outputs.

Datasets and evaluation method
Experiments are conducted on two public datasets. The first dataset is the Drishti-GS dataset [37], collected by Aravind Eye Hospital, Madurai, India. It contains 101 colour fundus images, which are divided into a training set and a testing set. The training set contains 50 images with ground truth for OD and OC segmentation. The remaining 51 images are used for the testing. The second database is RIM-ONE [11]. It contains 159 fundus images, including 85 images from healthy eyes as well as 74 images from eyes with glaucoma at different stages. RIM-ONE database provides pixel-level segmentation of OD and OC labelled by two ophthalmologists as the ground truth.
Three evaluation metrics are adopted to evaluate our proposed algorithm: Dice coefficient (DC), accuracy (acc) and Hausdorff distance (HD).

Implementation details
The network was implemented by PyTorch 1 , and Adam optimization algorithm [19] was used to train the network. The network was trained on a GPU of NVIDIA GeForce 3090 Super with 24 GBs graphic memory. Our multi-scale extractor employs pre-trained parameters based on Ima-geNet as initialization. During the training, we set the initial learning rate to 0.0001 and used Cosine Decay to adjust the learning rate. In our implementation, we set a and b to 0.25 and c to 3. We set the mini-batch size to 8 for all training and performed 300 iterations on the network. To improve the performance of the model, all images were cropped to 800 Â 800 pixels centred on the OD. We used various transformations to augment the training set, including rotation by an angle of 90, 180, and 270 degrees.

Comparison of loss functions
Different loss functions are compared using the Drishti-GS dataset. Cross-Entropy loss, Lovasz Softmax loss, and Dice loss were applied to train our network, respectively. The model was trained with an initial learning rate of 0.0001. As displayed in Fig. 6, when using multi-scale loss to train the network, the model converges at the loss of 0.008 around 300 epochs. When using Dice loss to train the network, the loss can converge to about 0.06. However, the convergence effect of Lovasz Softmax loss and cross-entropy loss is not satisfactory, and it only converges to about 0.21 after 300 iterations. Therefore, the proposed multiscale loss is proved to be more suitable for the training of the OD and OC segmentation network.

Segmentation results
Extensive experiments were conducted on two public databases. As shown in Table 1 Based on the OD and OC segmentation results, the corresponding CDR values can be further calculated, which can be used to assist ophthalmologists in the diagnosis of glaucoma. We use the mean absolute error (MAE) to evaluate the accuracy of CDR estimation, which calculates the average error rate of all samples: where N represents the number of test samples, CDR G and CDR S represent the ground truth of CDR provided by trained clinicians, and the CDR calculated by segmentation results of OD and OC, respectively. Our proposed method achieves MAE of 0.0499 and 0.0630 on the Drishti-GS and RIM-ONE datasets, respectively.

Accuracy analysis results
The performance comparison with the state-of-the-art approaches on two public databases is shown in considerable improvement has also been achieved for OD and OC segmentation on Drishti-GS and RIM-ONE datasets. Compared with the state-of-the-art approaches, the proposed method showed superiority in three metrics, as shown in Table 1.
To compare the adaptability of the model on different databases, we provide a comprehensive cross-dataset performance analysis. Firstly, we used the Drishti-GS training dataset to train the model and directly evaluated it on the RIM-ONE testing datasets. Moreover, we also used the RIM-ONE training datasets to train the model and tested it on the Drishti-GS datasets. Since the first two methods in Table 2 do not compare the cross-dataset performance of the model and do not disclose the specific implementation, we cannot obtain its cross-dataset performance. From Table 2, the proposed method remarkably outperforms the U-Net, M-Net [10], AGNet [44], and CCNet models, indicating a solid generalization ability. On the RIM-ONE database, compared to AGNet, the proposed method achieved 7.27% and 1.95% improvements in Dice and acc for OD segmentation. And, it achieved 22.23% and 2.86% improvements in Dice and acc for OC segmentation. On the Drishti-GS database, compared with CCNet, the Dice increases by 6.82% and the acc increases by 3.04% for OD segmentation. For OC segmentation, the Dice increases by 0.99% and the acc increases by 2.83%. This improvement can also be witnessed for the HD metric, which  The confusion matrix of segmentation results achieved by other competitive methods and our proposed method is shown in Fig. 7. Compared with other methods, our method can better distinguish the target region from the background and not divide the OC region into the background. Moreover, the number of misclassified pixels in OD and OC regions is lower than that of other methods.

Visual analysis results
We showed some typical results of the OD and OC segmentation in Fig. 8 to visually compare the proposed method with the competitive methods, including M-Net, AGNet and CCNet. From the comparison, it can be found that our method generates accurate segmentation results and exceeds other approaches. We constructed a multiscale feature extractor to capture the edge information of the OD and OC. Compared with the previous methods (such as MNet, CCNet), our method is more accurate in depicting the edge information of the OD and OC. Meanwhile, our method used attention pyramid architecture to correlate the task of OD and OC segmentation, which can implicitly learn the relationship between them. It can be seen from Fig. 8, compared with other approaches, the proposed method is more accurate in locating the OD and OC.
We also conducted experiments on CDR calculation. The scatterplot of corresponding CDR values calculated based on OC and OD segmentation results derived by our proposed method and other competitive methods are visualized in Fig. 9. It can be observed that the CDR calculated by the proposed method has the highest correlation with the ground truth. On the Drishti-GS database, the M-Net achieved an MAE of 0.1003, and the AGNet achieved an MAE of 0.0816. In comparison, the proposed method achieved an MAE of 0.0499, which is a relative reduction of 0.0111 from 0.0610 by CCNet. While on the RIM-ONE dataset, the M-Net implemented an MAE of 0.0995, and the AGNet implemented an MAE of 0.0813. The proposed method implemented an MAE of 0.0630, which is a relative reduction of 0.0133 from 0.0763 by CCNet. Compared with other methods, the proposed method achieved the highest accuracy on CDR calculation.

Glaucoma screening
In this section, we evaluated the proposed method on glaucoma screening by using the calculated CDR value on Drishti-GS and RIM-ONE datasets. Moreover, we described the receiver operating characteristic (ROC) curve and area under the ROC curve (AUC) as the metric of the diagnostic accuracy shown in Fig. 10. From the ROC curves and AUC scores, it can be seen that the proposed method achieved the best performances on two public datasets. Comparing with the CCNet, the AUC scores increased from 0.8725 to 0.8947 on the Drishti-GS dataset. In the other database, comparing with the second-best method achieved by M-Net, the AUC scores increased by

Ablation experiments
Ablation experiments were conducted on the Drishti-GS dataset. For the sake of description, we used ME, MT, AP and MF to represent the multi-scale extractor, multi-scale segmentation transition, attention pyramid architecture and multi-loss function, respectively. The result achieved by different components of the model is shown in Table 3. We used the ResNet50?FPN network as the baseline model and adopted focal loss to train the model. When ME, MT, AP and MF were gradually added into the segmentation model, all the evaluation indexes continuedly increased. Hence, the contribution of each improvement of the proposed model is verified. The ME module captures multi-scale features to preserve the boundary and other detailed information, which brings significant benefits to the OD and OC segmentation. Compared with baseline, the Dice increased by 1.30%, acc increased by 0.44% and the Avg: HD decreased by 0.105 for OD segmentation. For OC segmentation, the Dice increased by 4.96%, the acc increased by 0.70% and the Avg: HD decreased by 0.725. The MT module is integrated into the network to retain the multi-scale feature maps and reduces the burden of the decoder. From Table 3, it can be seen that the MT module has a great contribution to the improvement of segmentation accuracy. The AP module not only eliminates different levels of semantic gaps but also implicitly learns the internal relationship between the OD and OC. When the AP module replaces the corresponding module in the baseline model, the segmentation accuracy is also improved in varying degrees. Finally, we showed that MF supervision could improve the accuracy of the OD and OC. Experiments showed that combined learning these components and used the MF to trained, the network can achieve excellent segmentation results. Therefore, the MF is useful for our segmentation task.

Conclusion
In this work, we proposed a novel deep learning architecture that can achieve OD and OC segmentation simultaneously. The proposed ResFPN-Net is trained under multiloss supervision and converges quickly in a limited time. We have evaluated our method on two public datasets, i.e. Drishti-GS and RIM-ONE. Comprehensive experiments demonstrated the superiority of each improvement and proved that our method could accurately segment OD/OC and outperformed other methods. The proposed multi-scale loss functions converge much quicker, and reached significant lower training loss than the compared loss    functions. By sharing the features from OD and OC for segmentation tasks, the proposed one-stage OD and OC segmentation network achieved both high accuracy and high efficiency. Cross-dataset experiments demonstrated the generalization performance of the network. Ablation experiments proved the contribution of each improvement of the proposed method. Based on the OD and OC segmentation results derived by the proposed ResFPN-Net, more accurate CDR can be calculated, which can provide key support for glaucoma diagnose. The proposed framework also has strong potential for other relative biomedical image segmentation tasks.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.