Introduction

Diabetic retinopathy (DR) is a common complication of diabetes mellitus (DM) [1] and the most prominent cause of avoidable blindness in many nations for working-age individuals [2, 3]. Due to the extensive epidemic of DM [4, 5], the prevalence of DR has reached an alarming level and continues apace [2, 3]. It was estimated that more than 160 million people will suffer from DR by 2045 globally, and 44.82 million are vision-threatening [6], which imposes huge medical and economic burdens. The pathogenesis of DR by elevated plasma glucose levels triggers biochemical environment changes that lead to microvascular damage. One prominent clinical lesion is the non-perfusion area (NPA) in the retina [7]. NPA is a non-perfused capillary patch caused by shunt and viscosity changes of blood secondary to vascular wall damage [8, 9], manifested as the occlusion or closure of local capillaries and dilation of adjoining ones [9]. Current NPA detection relies on fundus fluorescein angiography (FFA) or optical coherence tomography angiography (OCTA). Although OCTA has emerged as a non-invasive examination, its application has been limited because of the high cost and restrictions like artifact correction [10], making FFA remains the gold standard. On FFA, fluorescein induced into the bloodstream allows direct visualization of retinal vascular, whereas NPA is identified as hypofluorescent dark areas surrounded by hyperfluorescent dilated vessels [11], presenting as scattered, irregularly bordered patches in DR patients. Quantification of NPA has been revealed as a biomarker in DR assessment and follow-up [12, 13], with important implications in quantifying disease severity [13], predicting progression [13,14,15], and even guiding treatment scheme [14, 16]. However, manual quantification of NPA is time-consuming and labor-intensive, so auto-quantification is essential to make it practical in real-world clinics. Previous studies have proposed algorithms based on image processing [17, 18] for NPA automatic detection. These methods extracted NPA features as lesser grayscale and monotonic texture compared to regular perfused regions on FFA [18,19,20], inevitably depending on illumination correction and noise removal to minimize disturbances caused by image capturing, building models by over-segmenting in primary regions [19] or using topographic characteristics to designate NPA as valley [20] or pool [18], and provided comparable results with the manual label with an area under the curve (AUC) of around 0.8 [18, 20]. However, these methods will inevitably require manual feature extraction and empirical parameter selection, and lack robustness in practice.

In recent years, deep learning (DL) techniques have made breakthroughs in a variety of fields, such as computer vision, natural language processing and speech recognition, due to their powerful feature representation capabilities [21]. In the area of assisted diagnosis and screening of ophthalmic diseases, DL showed excellent advancement in detecting clinical features such as hemorrhage and exudation [22], diagnosing DR [23], and grading severity [24, 25]. Meanwhile, DL has also been applied to auto-detect some quantifiable indicators, such as vascular segmentation [26] and fluid quantification [27], and a few studies have produced promising outcomes for NPA detection on FFA [28, 29]. Tang et al. [30] proposed a deep learning model for segmenting non-perfusion regions from FFA images using dense atrous and deformable convolution (DADC), dense atrous convolution (DAC) block, and residual multi-kernel pooling to learn better feature representations. Jin et al. [29] explored different deep learning models DenseNet, ResNet50 and VGG16 to simultaneously identify NPAs, microaneurysms and leaks in FFA images. Despite the promising performance achieved, the accurate identification of non-perfusion areas from FFA images remains a challenging task. This is mainly due to several reasons: (1) the shape and size of non-perfusion areas in FFA images are usually irregular and diffuse. Some small, fuzzy boundaries are often missed by the model; and (2) The contrast between the non-perfusion region and the surrounding area is small, making it more difficult for the model to identify it accurately.

To address the above issues, in this paper, we propose a new deep learning model NPA-Net for the accurate identification of non-perfusion regions from FFA images. Firstly, considering the low contrast between the non-perfusion region and the surrounding region, we use a Contrast Limited Adaptive Histogram Equalization (CLAHE) technique [31] to enhance the contrast of the image to improve the recognition performance of the model. Secondly, as the shape of the non-perfusion regions varies, incorporating multi-scale features and contextual information in the model training process may help the model to better cope with irregular non-perfusion regions. To this end, we use an adaptive encoder feature fusion module (AEFF), a multilayer deep supervised loss and an atrous spatial pyramid pooling module (ASPP) to adaptively fuse multi-scale features and contextual information in the segmentation to further improve the recognition performance of the model for the non-perfusion regions. We have conducted extensive experiments on a stitched FFA image dataset acquired from a clinical setting, and the results show that NPA-Net outperforms other traditional and deep learning methods by a large margin.

Fig. 1
figure 1

The overall framework of NPA-Net

Methodology

The overall structure of NPA-Net is shown in Fig. 1. Our segmentation model NPA-Net is a U-shaped structure with an encoder-decoder structure, containing a total of four encoders and four corresponding decoders, and finally a classification layer and a softmax function to output the predictions of the model. Each module contains a convolution layer, a batch normalization layer, and a dropout layer with a dropout rate of 0.2. The dropout and batch normalization layers are used to reduce overfitting and accelerate convergence respectively. The number of channels per encoder is 64, 256, 512 and 1024 respectively. Each encoder is followed by a pooling layer with a step size of \(2 \times 2\) to reduce the feature map, while an upsampling layer is applied to enlarge the feature map during decoding. In addition, skip connections are used to combine the feature maps extracted by each encoder with those of the corresponding decoder.

To improve the recognition performance of the model for non-perfusion regions, we introduced three key designs: Adaptive encoder feature fusion module (AEFF), Multilayer deep supervised loss and Atrous spatial pyramid pooling module (ASPP).

Adaptive encoder feature fusion module

Fig. 2
figure 2

The structure of the adaptive encoder feature fusion module

In FFA images, non-perfusion regions are usually irregular and of varying shape and size. Traditional deep learning segmentation models such as the U-net cannot effectively handle irregularly regions. We believe that the introduction of multi-scale features and contextual information can help the model to better identify non-perfusion regions. To this end, we propose an adaptive encoder feature fusion module that fuses the multi-scale feature maps generated by different encoder layers and is able to adaptively learn the weights of different scale features (Fig. 2). Specifically, we first obtain the output \(E_{l} (l \in \{1,2,3,4\})\) of each encoder layer, which corresponds to the multi-scale deep feature representation extracted by encoder 1–encoder 4, respectively. As the resolution and scale of the feature representations extracted by different encoders differ, we scale the feature maps of encoder 1, encoder 2 and encoder 3 all to the size of the feature map of encoder 4 uniformly through the convolution and pooling layers. We then introduce an adaptive weighting layer to fuse the feature representations at different scales, and the fused feature representation is:

$$\begin{aligned} E_{fused}=w_{1} \cdot E_{1}+w_{2} \cdot E_{2}+w_{3} \cdot E_{3}+w_{4} \cdot E_{4} \end{aligned}$$
(1)

where \(w_{1}, w_{2}, w_{3}, w_{4}\) represent scalar variables used to weight the feature representations at different scales and \(w_{1}+w_{2}+w_{3}+w_{4}=1\). These weights can be calculated by:

$$\begin{aligned} w_{m}=\frac{e^{\lambda _{m}}}{e^{\lambda _{1}}+e^{\lambda _{2}}+e^{\lambda _{3}}+e^{\lambda _{4}}} \end{aligned}$$
(2)

where \(\lambda _{1}, \lambda _{2},\lambda _{3},\lambda _{4}\) represent the learnable parameters and we can update these weight parameters by back propagation.

Multilayer deep supervised loss

To further exploit the multi-scale features to improve the segmentation performance of the model, we further introduce a multi-layer deep supervised loss. Specifically, as shown in Fig. 1, we insert a prediction branch after each decoder layer, and different decoder layers are able to generate segmentation results at different scales. The overall segmentation loss is thus defined as:

$${\mathbb{L}} = \sum\limits_{{i = 1}}^{5} {\left( {{\mathbb{L}}_{{Bce}} \left( {Y_{i} ,Y_{i}^{\prime } } \right.} \right)} + {\mathbb{L}}_{{Dice}} \left( {\left. {Y_{i} ,Y_{i}^{\prime } } \right)} \right){\text{ }}$$
(3)

where \({\mathbb {L}}_{Bce}\) represents the binary cross-entropy loss and \({\mathbb {L}}_{Dice}\) represents the Dice loss used to mitigate the imbalance problem. \(Y_{i}\) and \(Y_{i}^{\prime }\) represent the ground truth labels and model predictions, respectively, and the ground truth labels at other scales are obtained by downsampling the original ground truth labels.

Atrous spatial pyramid pooling module

Fig. 3
figure 3

The structure of the atrous spatial pyramid pooling module

We also introduced the atrous spatial pyramid pooling module to further extend the receptive field of the model to extract multi-scale feature representations, which has achieved significant segmentation performance improvements in the field of natural images [32]. As shown in Fig. 3, for an input feature map, we apply four convolution layers with different dilation rates to extract different scales of feature representations, and then fuse the different scales of feature representations to obtain the final output. We add this module to the last layer of the segmentation model to obtain the prediction results.

Experiments and results

To validate the segmentation performance of the proposed deep learning algorithm, we conduct experiments on a clinical medical dataset.

Dataset

FFA images from DR patients with type II diabetes who were referred to the ophthalmology department at Beijing Tsinghua Changgung Hospital between February 2015 and August 2022 were randomly selected. All images were collected by a Heidelberg imaging system (Heidelberg imaging system (SN: Spec-CAM-07889-S1600)). The imaging field is \(55 \times 55\) degrees, and mosaic images were obtained manually (Photoshop, version 22.4.0, Adobe Inc) from images in the venous phase (45 s -3 min). The NPA in the FFA images was annotated by two ophthalmologists and revised by one retinal expert. Mosaic images labeled with NPA were collected as the database for this study, and demographic information was collected from electronic medical records, with no personal information accessed. Approval for retrospective anonymized data collection and analysis from the institutional review board of Beijing Tsinghua Changgung Hospital was obtained. The study was conducted per the tenets of the Declaration of Helsinki.

A total of 163 eyes from 130 patients were included in this study, among which 116 eyes were randomly selected as the training set and 47 eyes as the test set. The demographic characteristics are shown in Table 1.

Table 1 Demographic characteristics for the training and test sets

Pre-processing

We perform a series of pre-processing on the FFA images. Firstly, we normalize each pixel in the original FFA image to a zero mean, unit standard deviation, and then map the normalized pixel values to the [0,255] range. Secondly, we apply the Contrast Limited Adaptive Histogram Equalization (CLAHE) technique [31] to each FFA image to enhance the contrast of the image.

Fig. 4
figure 4

Visualization results of the CLAHE algorithm. a original FFA image; b FFA image after CLAHE processing

Specifically, we first divide each FFA image into \(8 \times 8\) blocks and then compute the histogram on each block. If any histogram bin has a value higher than 2, these pixels are clipped and evenly distributed to the other bins before histogram equalisation is applied. Compared to traditional histogram equalisation algorithms, CLAHE can focus on local areas and avoid over-amplifying background noise. The example visualization results of the CLAHE algorithm are shown in Fig. 4. It can be seen that after CLAHE, the contrast of some regions of interest in the FFA image is enhanced, which will facilitate more accurate recognition by the segmentation model later.

Finally, given the large size of the original FFA images, direct input to the model would result in out-of-memory, and direct scaling to smaller sizes would lose a large amount of detail information, we use a patch-based training and evaluation strategy. Specifically, we randomly sample 100,000 \(64 \times 64\) patches from the training set and input them to the model for training. In the test phase, we also use a patch-based evaluation strategy, where we extract patches from the test FFA images using a \(64 \times 64\) sliding window in steps of 32, and feed these patches into the model to obtain segmentation results. As the same pixel may appear in different patches, we average the prediction probabilities of different patches to obtain the final prediction probability for each pixel. A similar training and evaluation approach is used for the vessel segmentation task to increase the number of images in the training set [33].

Experimental setup and implementation details

We run all experiments based on the pytorch deep learning framework. We use the Stochastic Gradient Descent optimizer to train the segmentation model, with an initial learning rate set to 0.001 and a weight decay of 0.0003. The number of iterations of the model is set to 100,000, the batch size is 128, and we multiply the learning rate by 0.1 every 10,000 iterations. During the training process, we also use data augmentation strategies such as random rotation, random horizontal/vertical flipping to enhance the model’s robustness and generalization performance. For performance evaluation, we use the following metrics: Area Under the ROC Curve (AUC), accuracy (ACC), sensitivity (SEN), specificity (SPE), intersection over union(IOU), and Dice coefficient (Dice).

Comparison with the state-of-the-art methods

In order to verify the superiority of the proposed model NPA-Net, we compare it with some currently existing segmentation models. Specifically, we implement a traditional segmentation model based on Graph Cuts [34]. In addition, we also implement five deep learning segmentation models: U-net [35], CE-net [36], deeplab [37], ConvNeXt [38] and InternImage [39]. U-net is the most classical deep learning segmentation model that has been successfully applied to several medical image analysis tasks, and CE-net by dense atrous convolution (DAC) block and residual multi-kernel pooling (RMP) block to exploit the contextual spatial information to improve the segmentation performance of the model. ConvNeXt uses a more advanced segmentation model backbone, and deeplab uses atrous convolution to capture multi-scale features. InternImage [39] uses deformable convolution as the core operator, and introduces long-range dependencies and adaptive spatial aggregation to learn stronger and more robust feature representations. Table 2 shows the segmentation performance of different algorithms. It can be seen that the traditional semi-automatic segmentation algorithm Graph Cuts performs the worst, which is mainly because it still requires manual feature design, and lacks robustness and better generalization performance in practical applications. In contrast, the deep learning methods all achieve better segmentation performance, which indicates they have more powerful feature extraction capability and better model generalization performance. Finally, it can be found that our model achieves the best segmentation performance. Compared with the best comparison algorithm InternImage, our model has higher sensitivity and specificity, which indicates its ability to make full use of the contextual information and multi-scale features in FFA images to improve the model’s recognition of non-perfusion regions.

Table 2 Segmentation performance of different algorithms on the test set
Fig. 5
figure 5

Visualization of segmentation results of different algorithms

Figure 5 shows the segmentation results of different algorithms. It can be seen that the traditional method is difficult to accurately identify the non-perfusion regions in the FFA images. The deep learning model has better recognition performance due to its ability to perform feature extraction automatically. The better recognition performance of our segmentation model NPA-Net, which greatly reduces the number of false positives and the probability of missing non-perfusion regions, indicates that it is more suitable for segmenting non-perfusion regions from FFA images and is expected to be applied in clinical scenarios.

Figure 6 shows the comparison of the area of the non-perfusion region predicted by NPA-Net and labeled by the doctors on the test set images. It can be seen that NPA-Net accurately identifies non-perfusion regions in global FFA images, with its predicted areas closely matching those annotated by doctors. This shows its potential to automatically identify the non-perfusion region from the FFA image in clinical applications.

Fig. 6
figure 6

Comparison of the area of the non-perfusion region (number of pixels) predicted by the segmentation model NPA-Net and those labeled by the doctor on the test set images

We perform ablation experiments to verify the effectiveness of the different components. As shown in Table 3, AEFF denotes adaptive encoder feature fusion module, ASPP denotes Atrous spatial pyramid pooling module, and MDS denotes Multilayer deep supervised loss. It can be observed that all of these  components lead to improved segmentation performance, and our segmentation model integrates the different components to achieve the best segmentation performance.

Table 3 Ablation experiments of different components

Discussion

Non-perfusion area (NPA) is an important clinical feature of DR and an important link and component of the pathogenesis. With the availability and advancement of examination techniques, numerous studies have confirmed that NPA is closely related to the severity of DR [7, 40]. In cross-sectional studies, larger NPA was frequently observed in more severe eyes, with a more pronounced difference in peripheral retinal regions [41, 42]. Using manually labeled NPA, Antaki F. et al. [43] reported increased NPA associated with macular thickening and visual deterioration, indicating the prognostic value of NPA, which was further supported by a longitudinal study [14], as eyes with larger NPA was reported to have a higher risk for DR exacerbation, in which posterior NPA contributes to a higher hazard ratio in disease worsening.

As a key event in the progression of the disease to the proliferative phase, retinal neovascularization tends to occur when large areas of non-perfusion are present [44, 45]. In addition, NPA size varied in groups with different neovascularization locations [13], and larger NPAs found in individuals with optic disc neovascularization (a risk factor for severe visual loss [46]) compared to elsewhere [13, 47]. Studied further from the mechanism, NPA leads to local retinal hypoxia, increasing oxidate stress, tissue inflammation, and cytokines release [44]; it has also been discovered that the levels of pro-angiogenic factors like VEGF, which promote neovascularization, were correlated with the degree of NPA. As the current first-line DR treatment, trials involving anti-VEGF, including RISE/RIDE [48], VISTA [49], PERMEATE [49], and RECOVERY [16] showed an increased best-corrected visual acuity (BCVA) and paralleled delayed development of NPA in a dose-dependent manner, which provide a theoretical basis for NPA in guiding precision and individualization anti-VEGF schedule and adjustment during follow-up.

Although anti-VEGF is more effective in preserving the retina and can better maintain peripheral and night vision [50], some patients respond poorly to it [51]. At the same time, considering the economic cost of anti-VEGF treatment and the strong dependence on follow-up, laser photocoagulation remains an irreplaceable and effective treatment option with visual benefits not significantly different from anti-VEGF injections [50]. Laser photocoagulation is usually applied when retinal neovascularization is present, but since the appearance of new vessels marks the irreversible progression of DR to a severe stage, if laser is performed only after the appearance of neovascularization is observed, a head start in preventing vision loss is lost. Experienced ophthalmologists often perform laser treatment to control the further progression of DR in the presence of large areas of nonperfusion in the retina. However, there is no specific answer as to what constitutes a ”large area”, making the timing of laser treatment a very subjective decision. With this in mind, NPA quantification, as a biomarker that can sensitively and effectively reflect the state of retinal ischemia, will undoubtedly provide a strong, actionable and objective indicator of the above issues.

In recent years, deep learning techniques have developed rapidly and have achieved great success in several fields. Convolution neural networks based on deep learning techniques are able to automatically extract deep feature representations from the original image and then perform recognition and prediction, avoiding the tedious manual feature extraction step, integrating feature extraction and recognition into a unified framework that can be trained in an end-to-end manner. In the past few years, several automatic segmentation models for NPA have been proposed, but most of them are based on the local NPA segmentation model and do not effectively address some key challenges in the recognition of NPA [28,29,30]. First, in FFA images, the contrast between the NPA and the surrounding regions is low and easily affected by some background noise, and the stitching introduces more disturbances, e.g., inconsistent illumination, making the segmentation task more challenging; secondly, the NPAs are usually irregular and diffuse, and it is difficult for the traditional segmentation models to effectively handle the NPAs with different shapes and sizes. To this end, in this work, we propose a new segmentation model based on deep learning techniques for automatic identification of NPAs from mosaic stitch-based FFA images. We employ different techniques to deal with the above problems separately. Specifically, for the low contrast problem of the NPAs in FFA images, we use the CLAHE technique to enhance the contrast of the images to improve the recognition ability of the model for the NPAs. Then, considering that the NPAs are usually irregular and diffuse, introducing contextual information and multi-scale features in the segmentation model might improve the recognition ability of the model for NPAs of different shapes and sizes. To this end, we propose three modules based on the U-net segmentation backbone model: AEFF, MDS and ASPP to make full use of multiscale features and contextual information, which greatly improves the segmentation performance of the model for NPAs and effectively reduces false positive predictions while avoiding missing some small dispersed NPAs.

We conducted extensive experiments on a dataset of FFA images acquired from a clinical setting and compared our approach with traditional methods and some of the latest deep learning models. The experimental results in Table 2 show that our model NPA-Net greatly outperforms other comparative algorithms, with an AUC 0.9752, accuracy 0.9431, sensitivity 0.8794, specificity 0.9459, IOU 0.3876, and Dice 0.5686 on the test set. As can be seen from the segmentation results in Fig. 5, NPA-Net is able to mitigate the prediction of false positives and avoid the interference of background noise, as well as identify well for those small, diffuse NPAs. In addition, we verified the effectiveness of the different modules through ablation experiments in Table 3. We also quantified the NPAs predicted by the model, as shown in Fig. 6, the area of the NPA predicted by NPA-Net is very close to the area of the NPA labeled by the human physician, which demonstrates its ability to automatically segment the NPA. Future work is to validate the effectiveness of NPA-Net on clinical datasets from additional medical institutions.

Conclusion

In this work, a new DL model NPA-Net was developed to detect NPA in FFA images. We introduced three modules, Adaptive Encoder Feature Fusion (AEFF), Multilayer Deep Supervised Loss, and Atrous Spatial Pyramid Pooling (ASPP) to incorporate multiscale features and contextual information from different perspectives, effectively enhancing the model’s ability to recognize NPA of different sizes. This NPA segmentation model is expected to automatically identify biomarker NPAs from FFA images, provide reference for clinical diagnosis, grading and follow-up of DR patients, and provide evaluation for the formulation of treatment plans such as anti-vascular endothelial growth factor and laser photocoagulation.