Introduction

The retina is one of the most delicate parts of our body, which is composed of a very complicated structure. It might be captured by the fundus camera. The captured images provide an amount of significant information about structural changes of pathology and can be used for the prediction of some ocular diseases such as diabetic retinopathy (DR), cataract, hypertension, etc. [1, 2]. These diseases often change blood structure. For example, several abnormal blood vessels appear and start growing in the advanced stage of DR. It is also known as neovascularization [3]. As DR progresses, more pathological changes would appear: macular edema caused by increased vascular permeability could damage central vision; the neovascularization and accompanying contraction of fibrous tissue can cause retinal traction detachment, leading to severe and often irreversible vision loss. The neovascularization may bleed, which can cause further complications of the retina. These pathological changes of the retina can lead to blindness [4]. Therefore, analyzing the features of blood vessels can provide an important pathological basis for the early diagnosis of eye diseases.

Before performing analysis, the ophthalmologist needs to separate the blood vessels, also known as segmentation, from the retinal image. Manual blood vessel segmentation is a very tedious task and it requires retinal ophthalmologists to spend a lot of time distinguishing between blood vessels and other areas of the fundus. This is mainly because there are many segmentation features of blood vessels, including length, width, branch angle, tortuosity, etc. Inexperienced doctors can easily make mistakes in this step. In addition, the segmentation results may also be different for various patients, and the uneven level of the segmentation personnel would also have a negative impact on the subsequent diagnosis process. Considering the above factors, with many patients, and limited medical resources, strictly manual segmentation of the vessels is impossible to perform. Therefore, automatic vessel segmentation in general, and retinal fundus vessel segmentation, in particular, should play a key role in the disease diagnosis of ophthalmology.

However, the structure of retinal vessels is extremely complicated [5]. Therefore, automatic vessel segmentation is a challenge. So far, this difficult problem has been widely studied in the literature. Machine learning methods are often used to deal with such complex problems. With achievements in deep learning studies in recent years, the problem of retinal vessel segmentation achieved important results. In particular, those methods based on U-Net can often achieve new state-of-the-art performance. This mainly benefits from the special structural design of U-Net. The U-Net [6] is designed to have a U-shape, where the down-sampling and up-sampling processes are formed a symmetrical structure. The network first extracts the detailed feature information of the image through the down-sampling process and then restores the location information of these features through the up-sampling process, so information on the context is captured and then delivered to higher resolutions to acquire a better segmentation result. Therefore, U-Net usually gives very good performance for segmenting medical images.

In summary, even though existing vessel segmentation methods, especially on U-Net architecture, usually achieve high performance, several challenges in developing processing methods still exist for small and very complex vessel structures. The fine features are easily lost in the process of multiple down-sampling and up-sampling.

In order to address the above problem, we propose a U-Net-based method named Local feature enhancement and Attention U-Net (LEA U-Net). The improvement of the proposed method over U-Net is as follows. First, a Local feature enhancement module is designed and applied to the network. The dilated convolution with different parameters is used to obtain larger local features while reducing the loss of small features caused by the down-sampling and up-sampling processes. This is mainly due to the fact that dilated convolution can easily increase the receptive field and change the size of the output feature maps by modifying its parameters. The former enables convolutional layers to capture larger ranges of vessel features. The latter allows the network to avoid using pooling during the process of obtaining feature maps with different sizes, thus preserving more detailed information. This module is a multi-output structure, and each output of the module is fused with the current feature maps at the corresponding up-sampled node in the network to supplement the up-sampled information. Second, an attention mechanism is integrated into the skip connection of the network to adjust the weight of each feature map and highlight the features related to blood vessel segmentation. This is equivalent to performing a denoising operation on the feature maps passed through the skip connection, further removing the information that is unrelated to blood vessels. The above two parts work together to improve the network’s segmentation effect of tiny blood vessels.

The rest of this paper is organized as follows. The next section is for the related methods. The third section is the details of the proposed network architecture. The fourth section presents experimental results. Finally, the last section is for the conclusion.

Overview of related works

With the achievements of medical imaging technology over the past few years, there are several models developed for retinal vessel segmentation. According to the existing technologies, machine learning-based and deep learning-based methods are two primary approaches for the vessel segmentation problem.

For the traditional machine learning approach, there are two internal stages: extracting features from images and mapping the extracted features to the labels. Various feature extraction methods have been introduced, such as Gabor filter [7] and the Gaussian filter [8]. There are also many classifiers that have been proposed to handle different tasks, for instance, support vector machine (SVM) [9, 10], artificial neural networks [11], k-NN classifier [12], AdaBoost [13], etc. The algorithms composed of the above two parts were used widely in retinal vessel segmentation. Marin et al. computed a 7-D vector composed of gray-level and moment invariants-based features for pixel representation and used a neural network (NN) scheme for pixel classification [14]. Aslani et al. proposed a new retinal vessel segmentation method. The proposed method combined features extracted by different methods, into a hybrid feature vector and trained with a Random Forest (RF) classifier [15]. The combination of different types of features resulted in increased local information with better discrimination for vessel and non-vessel pixels. Dash et al. presented a recursive method for retinal vessel segmentation [16]. This method first used adaptive thresholding technology to iteratively extract blood vessels on the pre-processed image and then applied morphological cleaning operation to generate a final vessel segmented image. For these traditional methods, the quality of the features extracted from input images has a great influence on the final result. However, features are often defined empirically by the designer, which could cause bias.

Deep learning is an advanced technique using backpropagation and a multi-layer neural network. It can automatically learn feature definitions from deep-level feature extraction methods, thereby avoiding human intervention [17]. Several deep learning architectures have been widely used and have achieved excellent results for various tasks including medical segmentation tasks [18, 19].

There are a number of studies based on deep learning that have investigated the retinal vessel segmentation problem. Wang et al. [20] combined two superior classifiers to carry out the segmentation. In this approach, Convolutional Neural Networks (CNNs) performed as a trainable feature extractor and then ensembling RFs worked as a trainable classifier. Fu et al. [21] formulated the vessel segmentation as a boundary detection problem and utilized the fully convolutional network (FCN) to generate the segmentation result. Moreover, a fully-connected Conditional Random Fields (CRFs) and a probability map of the discriminative vessel, and long-range pixel interactions are combined together. Maji et al. [22] presented a CNNs-ensemble-based framework for the detection of blood vessels in fundus color images. Ban et al. [23] presented a technique for multimode medical images based on spatial histogram. Qin et al. [24] presented a multi-focus image fusion method based on sparse decomposition. Liskowski et al. [25] proposed a segmentation method that classified multiple pixels simultaneously, which used a deep neural network trained on a large training dataset. Sappa et al. [26] presented an improved CNNs-based architecture to segment fluid abnormalities. To obtain multi-scale contextual information, the authors integrated several skip connections with atrous spatial pyramid pooling (ASPP). Tan et al. [27] used a 10-layer CNN to simultaneously segment multiple pathological features in fundus images. Similarly, another study with a 7-layer CNN proposed by Tan et al. to simultaneously segment optic disk, fovea, and blood vessels [28]. Sathananthavathi et al. proposed a parallel FCN architecture for vessel segmentation [29]. They also studied the impact of using different levels of pre-processed images on the model. Wu et al. proposed a network architecture in which the front network converts input images into probabilistic retinal vascular maps and the subsequent network further refines these maps [30]. In addition, the model utilizes skip connections to connect two identical multi-scale backbones, allowing useful multi-scale features to be directly transmitted from shallow layers to deep layers, thereby improving segmentation performance. Sultana et al. designed an encoder-decoder model in an unconventional way [31]. First, the encoder part of the model uses up-sampling to enlarge the image and extract more detailed image features. Then, the decoder part of the model uses down-sampling to restore the feature maps to their original resolution, in order to achieve better segmentation results.

Fig. 1
figure 1

Architecture of the proposed LEA U-Net

In 2015, a network architecture called U-net for medical image segmentation was proposed and achieved good performance in many tasks [6]. Recently, researchers have begun to try to use U-NET in the proposed method for retinal vessel segmentation, and continue to achieve new state-of-the-art performance. Zhang et al. utilized residual connection and cooperate with an architecture based on U-net to detect vessels [32] by adding additional labels on boundary areas and using an edge-aware mechanism to convert the original task into a multi-class task. Alom et al. utilized U-NET to propose a Recurrent Residual Convolutional Neural Network (RRCNN) [33]. In this method, recurrent convolutional layers were used to improve the effect of feature extraction, and residual units were used to help train deep architecture. Jin et al. integrated the deformable convolution into U-Net to extract contextual information and empower precise localization by combining low-level and high-level features [34]. By adaptively adjusting the receptive fields, it can recognize retinal vessels with different shapes and scales. Li et al. proposed Iter-Net [35], which is composed of U-Net-like components that are iterated multiple times, resulting in a network that is 4 times deeper than a standard U-Net. Iter-Net also employs weight sharing and skip connections to facilitate training. Zhou et al. proposed Unet++ [36], which integrates U-Net models with different depths and re-designs skip connections to obtain a highly flexible feature fusion scheme, in order to improve model performance. Li et al. proposed Res2Unet [37], which uses a multi-scale strategy to extract vessels of different widths and utilizes a channel attention mechanism to facilitate communication between channels, recalibrating the relationships between channel features. In addition, the author also proposed two post-processing methods, one for mining disconnected vessels and the other for removing false positive and false negative samples. Dong et al. proposed CRAUNet [38], which can be seen as a series of concatenated U-Net structures to obtain representations from coarse to fine. In addition, DropBlock is utilized in CRAUNet to reduce overfitting. Chen et al. proposed PCAT-UNet [39], which is a U-shaped network based on Transformer with a Convolution branch. PCAT-UNet uses skip connections to fuse deep and shallow features from both sides, effectively capturing global dependency relationships and details in the low-level feature space.

In addition, due to the particularity of medical tasks, they require a high level of reliability from the models used, which needs to be supported by the models’ own interpretability. However, deep learning models have a black box nature, which makes them inferior in terms of interpretability. In recent years, many scholars have been conducting related research and trying to link the internal processes of the network with the final results to enhance the interpretability of the models [40,41,42,43].

Overall, the existing deep learning methods, especially U-Net-based methods, have achieved good results in retinal blood vessel segmentation. But there still exists a research gap to improve the segmentation results of small and tiny vessels.

Proposed method

In this study, LEA U-Net is proposed with the aim of segmenting retinal fundus vessels. LEA U-Net is a deep-learning model and developed based on U-NET. We integrate U-Net with a local feature enhancement module and optimize skip connections to form attention blocks to improve performance.

Overall structure of LEA U-Net

Fig. 1 illustrates the architecture of the proposed LEA U-Net, which mainly consists of three parts, a U-shaped structure including down-sampling and up-sampling processes, a local feature enhancement module, and attention blocks. The convolutional layers in the network use ReLU as the activation function by default, except for those in attention blocks.

To balance computational complexity and efficiency, a \(48\times 48\) patch of the pre-processed grayscale image of fundus is used as the input of the network, which is the same referred to [33]. The input image is sent to the local feature enhancement module and the down-sampling part of the U-shaped structure. In the local feature enhancement module, multiple scale features are extracted to generate feature maps with different sizes, which contain many fine features that could be missed during the down-sampling process and are passed to the subsequent network. The down-sampling part contains three identical convolution-max-pooling structures, each of which is composed of two \(3\times 3\) convolutional layers and a max-pooling layer. Each max-pooling layer reduced the height and width of the feature maps by half, and then the first convolutional layer enlarges the number of channels twice. After the entire down-sampling process and two \(3\times 3\) convolutional layers, a \(6\times 6\times 256\) size output is generated and sent to the up-sampling part. In addition, the feature maps output by the convolutional layers before each max-pooling layer are also sent to the corresponding attention block. In the attention block, the importance of the features is redistributed to highlight those that are more relevant to the task, and then output to the up-sampling part of the U-shaped structure. The structure of the up-sampling part is symmetrical to the down-sampling part. During the entire up-sampling process, the size of feature maps is continuously enlarged, and the number of channels is continuously reduced. The output of each up-sampling layer is concatenated with the corresponding output of the local feature enhancement module and the attention block in each channel to supplement more feature information and then passed to the subsequent convolutional layer. The up-sampling part finally produces a 48*48*32 output. Subsequently, a \(1\times 1\) convolutional layer followed by softmax is added to adjust the number of channels and produce segmentation results.

To facilitate understanding, we provide pseudo code to describe the internal operation process of the model, as shown in Algorithm 1.

figure a

In the following two parts, we will elaborate on the Local feature enhancement module and Attention block.

Local feature enhancement module

The U-shaped structure of U-Net could cause the loss of the features of small blood vessels. Furthermore, due to the uneven distribution of retinal blood vessels, the convolutional and pooling operations in U-Net are susceptible to the limitation of the visual field when extracting the features of continuous large-area blood vessels. Therefore, we introduce the dilated convolution, and construct a multi-scale local feature enhancement module with it, to deliver more useful information to the up-sampling part of the network.

The dilated convolution uses the dilation parameter to define the distance between two adjacent pixels to be calculated in convolutional operation. By adjusting the dilation parameter, the receptive field can be increased at the same computational complexity. The dilated convolution with various dilation parameters is presented in Fig. 2. The colored area represents the receptive field of a filter on the input feature map, the red part means the pixels that need to be convolved, and the blue part is those pixels that do not need to be calculated. Visible, the dilated convolution is to get a more receptive field by ignoring some pixels, which means that it also has a certain loss of detail. Therefore, we did not directly replace the convolution and pooling in U-Net with dilated convolution but added a local feature enhancement module to complement the information flow in the network.

Fig. 2
figure 2

Illustration of dilated convolution with different parameters. a \(3\times 3\) 1-dilated convolution. b \(3\times 3\) 2-dilated convolution. c \(5\times 5\) 3-dilated convolution

Fig. 3
figure 3

Schematic of the local feature enhancement module

Figure 3 shows the design of the local feature enhancement module, which contains 3 blocks with similar internal structures. In each block, the information passes through a dilated convolutional layer and a \(1\times 1\) convolutional layer successively, and finally reaches the respective output ends. The input of Block 1 is the input image, and for the next two blocks, the input is the output of the dilated convolutional layer in the previous block. It can be found that the parameters of the dilated convolutional layer in the three blocks are different. The size of the filters and the dilation parameter in blocks 2 and 3 are significantly larger than that of block 1. Let us explain the reason for this design.

For Block 1, the size of the output feature maps needs to be consistent with the input image, so we need to add zero padding around the input image. In this case, if larger filters and dilation parameters are used, the proportion of the padding area in the input image needs to be increased correspondingly, which could cause the problem of sparse effective information on the edges of the image. In order to avoid this situation, we chose smaller filters and dilation parameters in Block 1. For Block 2 and Block 3, the height and width of the feature maps output by the dilated convolutional layer are both halves of the input. Zero-padding is not needed to maintain the size of the feature maps, and it could not encounter the problem of sparse effective information. So, we use larger parameters here to get better performance. In addition, the parameters of these two dilated convolutional layers are carefully designed so that the height and width of the output feature maps are exactly half of the input, and the model performance and the computational complexity are also traded off.

In order to achieve cross-channel interaction and information integration, after the dilated convolutional layer in each block, there is a \(1\times 1\) convolutional layer. The number of channels corresponding to out1, out2, and out3 is 32, 64, and 128, respectively. The final output result of each block is fused with the information on other paths in the network to optimize the up-sampling process of the model.

Attention block

Table 1 The parameter count and computational complexity of each part in LEA U-Net
Fig. 4
figure 4

Schematic of the attention block. I and O represent the input and the output of this block, respectively. \(\bigoplus \) denotes element-wise sum, and \(\bigotimes \) denotes element-wise product

In U-Net, the partial feature maps generated by the down-sampling processes are passed to the up-sampling part through skip connections and merged with other information. We expect to use an adaptive weight redistribution module to influence the information transmitted via skip connections, highlighting those features related to vessel segmentation. Therefore, an attention module is integrated into the original skip connection to form an attention block. The structure of this block refers to the non-local block [44], with some modifications, which is shown in Fig. 4.

In the attention block, the input first passes through three \(1\times 1\) convolutional layers at the same time to generate three groups of feature maps A, B, and C, respectively. A and B are combined with element-wise addition, and the result is passed through a \(1\times 1\) convolutional layer to produce the attention weight matrix \(\Gamma \). The sigmoid function is used to map the non-fixed value range of the output of the convolutional layer into a weight coefficient between 0 and 1. Then the attention weight matrix \(\Gamma \) is calculated with the linearly transformed input C, and the final output O is obtained.

The operation performed by the attention module is for each pixel of all channels. First, the attention weight matrix is generated by integrating all channel information, and then the attention weight matrix is used to allocate the importance of each pixel. In order to make the feature map passed to the subsequent network consistent with U-Net, the attention block does not change the size of the feature maps.

Computational complexity analysis

Table 1 presents the parameter and computational complexity of each part of LEA U-Net, measured in millions. The model’s learning ability is to some extent related to its parameter count, while FLOPs (Floating Point Operations) is a commonly used metric to measure the computational complexity of a model, and can intuitively reflect the computational resources required for a model to run once. The growth of LEA U-Net in these two metrics is mainly due to the local feature enhancement module. But compared to U-Net backbone, the proportion of the local feature enhancement module is not significant. If regular convolutions with the same receptive field are used to compose this module, the FLOPs of it will be more than twice that of U-Net backbone. This also intuitively reflects the benefits of using dilated convolution for our model.

Fig. 5
figure 5

Example images of DRIVE. a Color image; b corresponding ground truth; c corresponding mask

Experiments

To demonstrate the performance of the LEA U-Net model, we test it on the DRIVE dataset and compare the results with other recently published methods. Furthermore, in order to directly show the improvements that the local feature enhancement module and attention block bring to the model, respectively, i.e., to perform ablation experiments, we also add an LEA U-net model without the attention mechanism to the experiments, which is named LE U-net. Our models are implemented using Python 3.6, the Keras, and TensorFlow frameworks. We use cross-entropy loss, SGD with batch size 32, and an initial learning rate of 0.01. We drop the learning rate to 0.001 after 80 epochs and trained a total of 100 epochs.

Dataset and image preprocessing

DRIVE is a well-known retinal blood vessel database. It consists of forty color images (in RGB color space) of retinal fundus images with the blood vessels. The database is randomly split into a training set and a validation set with a rate of 50:50, i.e., each set contains 20 images. The validation set is only used for testing and not for model training. For each color image, there are two corresponding binary mask images in the data set, i.e., the ground truth and the binary field of view mask, as shown in Fig. 5. The former is obtained by experts’ manual segmentation, and the latter indicates the extent of the fundus area in each color image. The plane resolution of all images in DRIVE is \(565 \times 584\).

For medical image segmentation, image preprocessing is very important whether using traditional methods or deep learning-based models [45, 46]. Proper preprocessing can improve the performance of the model. Here, we use grayscale image conversion, Contrast Limited Adaptive Histogram Equalization (CLAHE) [47], and Gamma correction sequentially to preprocess the color fundus images.

First, a color image in RGB is transformed into a monochromatic grayscale image, which can reduce to a certain extent the different colors of the collected pictures due to different equipment and lighting. For the retinal vessel segmentation, this process is particularly important, which largely affects the final segmentation results [48]. We use Formula 1 to convert the grayscale image:

$$\begin{aligned} I_{Gray} = 0.3I_{Red}+0.59I_{Green}+0.11I_{Blue} , \end{aligned}$$
(1)

where \(I_{Gray}, I_{Red}, I_{Green}, I_{Blue}\) are, respectively, pixel intensity in Grayscale mode, Red channel, Green channel, and Blue channel.

Then, CLAHE was utilized to increase the foreground-background contrast. This method can improve image contrast while reducing the noise generated in the process. We need to set a threshold first. The pixels above the threshold will be cropped on the histogram and evenly distributed on other gray levels to form a new histogram, and then perform adaptive equalization on the new histogram. In addition, it is necessary to set an equalization grid size to divide the image into non-overlapping blocks, and then process each area separately. This ensures the stability of the processing process. We set the threshold to 2 and the equalization grid size to 8.

Finally, we use Gamma correction to further adjust the contrast of the image to make the difference between light and dark around the blood vessel more obvious. Gamma correction is to change the contrast between the low and high pixel intensity areas of the image by adjusting the gamma curve. Here, we need to set a parameter \(\gamma \). When \(\gamma \) is less than 1, the contrast of the high pixel intensity area will be decreased, and the contrast of the low pixel intensity area will be increased. When \(\gamma \) is greater than 1, the opposite effect will be produced. Since the gray value of blood vessels in the image is generally low, a \(\gamma \) greater than 1 should be used. After comparison, we finally choose Gamma correction with a \(\gamma \) value of 1.2 to further process the images.

The step-by-step preprocessing is shown in Fig. 6. We can notice that after several preprocessing, the blood vessels of the fundus with low contrast and blurry colors have become clearer, the difference in brightness between blood vessels and surrounding non-vascular parts has become larger, and the structure of small blood vessels can be better distinguished. The images will be cropped to \(48\times 48\) patches before being used by the models, as shown in Fig. 7. This processing step increases the number of images used for model training by hundreds of times, which can effectively alleviate the overfitting problem.

Fig. 6
figure 6

Example images after each preprocessing step. a Original images; b gray scaled images; c images after CLAHE operation; d images after Gamma correction

Image segmentation performance evaluation metrics

To evaluate the segmentation performance, the following metrics: accuracy (ACC), F-measure (F1), true-positive rate (TPR), true-negative rate (TNR), the area under curve (AUC), and AUC of Precision–Recall curve (PRC), are used:

$$\begin{aligned}{} & {} \textrm{ACC }= \frac{\mathrm{TP+TN}}{\mathrm{TP+FP+TN+FN}} \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \textrm{F1 }= 2 \times \frac{\textrm{PPV} \times \textrm{TPR}}{\mathrm{PPV+TPR}} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \textrm{TPR }= \frac{\textrm{TP}}{\mathrm{TP+FN}} \end{aligned}$$
(4)
$$\begin{aligned}{} & {} \textrm{TNR} = \frac{\textrm{TN}}{\mathrm{TN+FP}} \end{aligned}$$
(5)
$$\begin{aligned}{} & {} \textrm{PPV} = \frac{\textrm{TP}}{\mathrm{TP+FP}} \end{aligned}$$
(6)

where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative.

As the main evaluation metric, ACC measures the proportion of correctly classified pixels in the image. TPR measures the proportion of blood vessel pixels that are correctly identified, and TPR measures the proportion of those background pixels that are correctly identified. F1 is used to evaluate the comprehensive performance of the model.

Results

First, we compare our method with several state-of-the-art methods as in Table 2. Generally speaking, the performance of the methods in the STA family is worse than one of the methods of DNN family. Since they rely primarily on hand-made features, and if we consider the condition of sufficient training data, the quality of these features is not good compared with the DNN-based methods. Also, the performance of several U-Net-based methods is relatively high, and the LEA U-Net achieved the best performance on the two most important metrics. It achieved the highest global accuracy of 0.9563 and the highest F1 of 0.8230. The Residual U-net and the Recurrent U-net have similar motivations to our method. The former uses the residual structure in U-net to optimize the transfer of features in the network, and the latter replaces the convolutional layers before each down-sampling and up-sampling in U-Net with recurrent convolutional layers to extract more complex features. It can be seen from the experimental results that our method has more advantages in the extraction and transmission of small blood vessel features. In addition, the result of LE U-Net is better than U-Net but worse than LEA U-Net, which demonstrates that both the local feature enhancement module and the attention block in our model have an effect on performance improvement. Next, we will further analyze this by comparing other experimental results of the three models.

Fig. 7
figure 7

Example patches in the left and corresponding ground truth are shown in the right

Table 2 Comparisons against existing approaches on DRIVE

Figures 8 and  9, respectively, show the change of loss and ACC during the training process of the three models, where the horizontal axis indicates the number of epochs. We can see that all the blue curves corresponding to the training set are very smooth, but the red curves corresponding to the validation set are different. The red curve of U-Net still has a large range of fluctuations even after 70 epochs, while the curves of the other two models have basically stabilized. Comparing U-Net with LE U-Net, it can be observed that: at the beginning of the training phase, the loss of U-Net is lower than that of LE U-Net, but the latter has a higher accuracy, which indicates that a more stable training process will improve the performance of the model. Compared with LE U-Net, the most obvious advantage of LEA U-Net is in the convergence speed in the early stage. In short, the local feature enhancement module makes the model training process more stable, while the attention block speeds up the convergence speed of the model, which leads to better performance.

Next, we assess the model performance by ROC curves and PRC curves. The results are shown in Figs. 10 and 11. For both types of curves, if the area under the curve (AUC) is larger, the corresponding model is considered better. Since LE U-Net and LEA U-Net achieve higher values of AUC for both of the curves than one of U-Net, the performance of LE U-Net and LEA U-Net are also better than U-Net. In addition, the performance of LEA U-Net with the integration of attention mechanism is also better than one of LE U-Net.

Fig. 8
figure 8

The change of loss during the training process of different models. a U-Net; b LE U-Net; c LEA U-Net

Fig. 9
figure 9

The change of ACC during the training process of different models. a U-Net; b LE U-Net; c LEA U-Net

Last but not least, the segmentation results of the three models with some details are displayed in Fig. 12. From the local magnification view, it is obvious that the three methods have different segmentation for a tiny blood vessel. Since these small features are easily missed, it is very difficult to perform segmentation precisely. Due to the limitation of networks, U-Net only extracted a few coarse pieces of information. With the help of the local feature enhancement module, LE U-Net extracted information better than U-Net. The use of the attention mechanism also makes features related to blood vessels easier to capture by the model. As a result, LEA U-Net can identify tiny vessels and hence, improve segmentation results for LE U-Net. More segmentation results of LEA U-Net are shown in Fig. 13.

Conclusions

Fig. 10
figure 10

ROC curves of different models. a U-Net; b LE U-Net; c LEA U-Net

Fig. 11
figure 11

PRC curves of different models. a U-Net; b LE U-Net; c LEA U-Net

Fig. 12
figure 12

Magnified view of boxed patches predicted by different models

Fig. 13
figure 13

Retinal images after preprocessing (upper row), ground truth (middle row), segmentation result (bottom row)

In this paper, we proposed a U-Net-based model, named LEA U-Net, to perform retinal blood vessel segmentation in a pixel-wise manner. LEA U-Net improved U-net by two modules. (M1) A local feature enhancement module to strengthen the extraction of local features. This module uses dilated convolution with different parameters to extract multi-scale vessel features, in order to supplement those tiny features that are easily lost in the network down-sampling-up-sampling process. (M2) An attention mechanism to focus on highly relevant features. Experiments on the DRIVE database demonstrate the effectiveness of our model. Taking into account the requirements for timeliness in practical applications, we plan to optimize the computational complexity of LEA U-Net in the future, reducing the time consumption of model training and segmentation under the premise of ensuring accuracy. Also, we shall extend our approach to deal with different applications related to face recognition [58], fundus image [59], clinical dataset [60], low-dose CT scan images [61] etc.,