R2U++: a multiscale recurrent residual U-Net with dense skip connections for medical image segmentation

Mubashar, Mehreen; Ali, Hazrat; Grönlund, Christer; Azmat, Shoaib

doi:10.1007/s00521-022-07419-7

R2U++: a multiscale recurrent residual U-Net with dense skip connections for medical image segmentation

Original Article
Open access
Published: 03 June 2022

Volume 34, pages 17723–17739, (2022)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

R2U++: a multiscale recurrent residual U-Net with dense skip connections for medical image segmentation

Download PDF

Mehreen Mubashar¹,
Hazrat Ali²^nAff1,
Christer Grönlund² &
…
Shoaib Azmat¹

5636 Accesses
3 Altmetric
Explore all metrics

Abstract

U-Net is a widely adopted neural network in the domain of medical image segmentation. Despite its quick embracement by the medical imaging community, its performance suffers on complicated datasets. The problem can be ascribed to its simple feature extracting blocks: encoder/decoder, and the semantic gap between encoder and decoder. Variants of U-Net (such as R2U-Net) have been proposed to address the problem of simple feature extracting blocks by making the network deeper, but it does not deal with the semantic gap problem. On the other hand, another variant UNET++ deals with the semantic gap problem by introducing dense skip connections but has simple feature extraction blocks. To overcome these issues, we propose a new U-Net based medical image segmentation architecture R2U++. In the proposed architecture, the adapted changes from vanilla U-Net are: (1) the plain convolutional backbone is replaced by a deeper recurrent residual convolution block. The increased field of view with these blocks aids in extracting crucial features for segmentation which is proven by improvement in the overall performance of the network. (2) The semantic gap between encoder and decoder is reduced by dense skip pathways. These pathways accumulate features coming from multiple scales and apply concatenation accordingly. The modified architecture has embedded multi-depth models, and an ensemble of outputs taken from varying depths improves the performance on foreground objects appearing at various scales in the images. The performance of R2U++ is evaluated on four distinct medical imaging modalities: electron microscopy, X-rays, fundus, and computed tomography. The average gain achieved in IoU score is 1.5 ± 0.37% and in dice score is 0.9 ± 0.33% over UNET++, whereas, 4.21 ± 2.72 in IoU and 3.47 ± 1.89 in dice score over R2U-Net across different medical imaging segmentation datasets.

U-Net: Convolutional Networks for Biomedical Image Segmentation

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Image processing techniques have been applied to examine biomedical images for decades, and even to this day, designing computer-aided diagnostic systems (CAD) is one of the hot research areas [1]. The purpose of CADs is to design systems that can perform an accurate diagnosis of the underlying disease quickly, which can eventually aid in the treatment of a large number of patients. Quick diagnosis of diseases has shown a considerable decline in death rate, for example, in certain kinds of cancer tumors like brain tumors, kidney stones, stomach cancer, lung cancer, and breast cancer [2]. In this regard, a substantial amount of research effort has been put in this area with the target to improve and aid the processes of disease diagnosis from medical imagery.

The laborious nature of manual segmentation has increased the demand for automatic segmentation. Example images with segmentation masks are shown in Fig. 1. The traditional methods for CAD mostly based on hand-crafted features [3, 4] are now being replaced by variants of convolutional neural networks (CNN) models, such as AlexNet [5], VGGNet [6], and GoogleNet [7]. The proven success of CNNs over traditional methods has led to new variants of these techniques such as encoder-decoder architectures and deep generative models for different medical imaging applications [8, 9].

From the architectural standpoint, the models used for classification have a slightly different architecture than the ones used for segmentation. The classification models use an encoder and generate class probabilities as an output. On the contrary, as the segmentation demands capturing the context of an image alongside its location, it is crucial to have both encoding and decoding units in a network. The segmentation tasks in medical imaging, in general, are more sensitive and require extra refinement compared to natural images due to the associated healthcare decision-making. For example, the slight speculation around a lung nodule in a CT image is an indication of it being malignant; and its elimination from generated segmentation label would result in wrong clinical diagnosis. Therefore, there is always a need for improvement in segmentation models, so that they can correctly segment all the fine details of the object of interest.

The most adopted encoder-decoder structures in this regard are fully convolutional networks (FCN) [10] and the U-Net [11]. These two commonly used architectures differ in the way the skip connections help to retrieve the lost fine details. In FCN, the skip connections are used to sum up features of encoders with up-sampled decoder feature maps, while U-Net applies concatenation operation on these features. U-Net was the first medical imaging segmentation model shown in Fig. 2a that outperformed all the models on small size medical imaging datasets. Due to U-Net simple architecture with plain convolutions in encoder/decoder, it becomes less efficient for some complicated medical imaging tasks [12,13,14,15].

In U-Net, the skip connections used between encoder and decoder require the concatenation to be at the same level. However, this concatenation, despite being at the same level, is not semantically similar [13, 15]. Therefore, several variants of U-Net have been proposed, with some attempting to change the backbone [16, 17] while others tweaking the skip connections between encoder and decoder [13, 15, 18]. The success of these variants to correctly classify the target objects in complex datasets can be attributed to two things: encoder/decoder blocks and skip connections [11, 13, 15]. The efficiency of the blocks being used as encoder/decoder enables the network to extract the features crucial for segmentation tasks. On the other hand, the skip/shortcut connections residing in between encoder and decoder help to recover the lost fine details of foreground objects. Considering the importance of these two factors, we have proposed an architecture that can enjoy the best of both worlds, i.e., an efficient backbone and improved skip pathways. First, to focus on better feature accumulation, we have replaced plain convolution blocks of U-Net with recurrent residual convolution units adopted from [16] shown in Fig. 2b. These recurrent units unfold to a predefined time step t making the network deeper at each layer. This increases the field of view in the lower layers of the neural network enabling them to extract precise low-level features. As the low-level features: the boundary of certain tumors, lungs, size of infection, are of utmost importance for the prognosis of the underlying disease; hence an accurate extraction helps to boost the network’s performance. Second, the skip connections of vanilla U-Net have been replaced by dense skip connections adopted from U-Net++ [13, 15] shown in Fig. 2c. In vanilla U-Net feature maps coming from the encoder are at a lower level than the feature maps of the decoder, this semantic difference is called semantic gap. These dense skip connections reduce the semantic gap between encoder and decoder features before concatenation. Besides, these dense connections are forwarding the different scale information to the decoder. The decoder can then perform the aggregation on various scale features to enhance the segmentation accuracy. These architectural modifications have introduced multi-depth embedded models partially sharing a common encoder. In addition, training the network under deep supervision performs shared learning on all the embedded depths which is highly beneficial for segmenting multiscale foreground objects. Our main contributions are:

1.
We introduce a new deeper segmentation model namely R2U++ for medical image segmentation. The model uses recurrent residual blocks over vanilla convolutional blocks which provide a large field of view even in the lower layers to extract features enriched with lower-level information. As we replace the plain convolution blocks of U-Net with recurrent residual convolution units, these recurrent units unfold to a predefined time step t making the network deeper at each layer.
2.
We use dense skip pathways. The dense skip pathways reduce the semantic gap of the concatenating encoder and decoder and propagate different scale information to the decoder. The dense skip connections also improve the gradient flow.
3.
The concept of dense skip pathways also enables us to define an architecture where multiple architectures of different depths are merged into a single architecture. The ensemble of multi-depth can capture the information of varying size objects.
4.
Equipped with the above characteristics, our resultant residual recurrent architecture with dense skip connections has consistently outperformed the existing models on medical images of different modalities including electron microscopy (EM) images of skin lesions, computed tomography (CT) images of COVID-19 affected lungs, Chest X-Ray images, and retinal fundoscopic images of retinal vessels.

The remainder of this paper is organized as follows. In Sect. 2, we discuss the related work. The proposed architecture is explained in Sect. 3. The datasets used in the study and the experimental details are presented in Sect. 4. Results are presented in Sect. 5. The paper is concluded in Sect. 6.

2 Related work

Semantic segmentation refers to the kind of labeling where we have to assign a label to each pixel of an image. In the domain of segmentation, the work on fully convolutional neural networks (FCN) introduced the concept of combining what and where information to properly label the pixels of an image [10]. It was achieved by adding a link between the coarse and the fine layers. In [19], Chen et al., proposed deeplab for semantic image segmentation using atrous convolution, which not only increased the field of view but atrous spatial pyramid pooling (ASPP) enabled them to segment objects at multiple scales. SegNet [20] is a corresponding encoder-decoder segmentation network, in which the encoder is similar to the VGG network [6] with no fully connected layers at the end. However, its major contribution was the use of max pooling indices in decoder layers from its corresponding encoder part. Most of these architectures use large data and are designed specifically for computer vision applications. The major problem that initially hampered the success of convolutional neural networks in the domain of medical image segmentation was the unavailability of sufficient medical images for training deep models. However, this problem was first of all tackled by the segmentation network U-Net [11], specifically designed for medical image segmentation tasks and worked relatively well even for smaller datasets. Since then, U-Net has become a popular choice for medical image segmentation tasks.

The U-Net is built upon FCN [10], which comprises two paths: the contracting path and the expanding path. The contracting path has a traditional convolutional encoding unit that performs convolution operations followed by rectified linear units (ReLU) activation. It is then down-sampled via 2 × 2 max pooling. The main modification of this architecture was to have a symmetric expanding path with a large number of feature channels obtained through up-convolution. In the expanding path, up-sampling is followed by up-convolution, which reduces the number of feature maps to half. These features are then concatenated with the feature maps from the corresponding encoding unit. The architecture was adopted quickly due to its several advantages. Firstly, it captures context and location information simultaneously. Secondly, it meets the demand for a network that can provide better results on small medical imaging datasets. Finally, it is trained in an end-to-end fashion and provides a segmentation mask in the forward pass. Nonetheless, U-Net is not restricted to medical imaging only but has also proven its efficiency in many computer vision applications [21]. Several variants of U-Net have been proposed to adopt the simple U-Net architecture to complex datasets. These alterations can be broadly classified into two categories: changing the backbone and reforming the skip connections—as discussed below.

2.1 Modified backbone

The U-Net model uses two convolution layers in each encoder-decoder block which makes it very simple for complex datasets. One of the ways adopted by researchers to deal with the problem is to increase the depth of the network. However, increasing the depth is not as easy as stacking layers. The networks with a depth of tens of layers initially faced the issue of vanishing gradients [22]. The issue has been addressed by using different activation functions like ReLU, Exponential Linear Units (ELU) [6, 7], and by applying normalization in between the layers [23]. He et al. [24] pointed out the degradation problem: increasing the network’s depth saturates the performance and then promptly drops it. To overcome this problem, they proposed the solution of using identity mapping or skip connections in their proposed Residual Network (ResNet). The ResNet learns via residual function and makes the optimization task easier. This approach helped with overcoming the degradation problem and improved the network’s performance. Ever since, deep models and skip connections go hand in hand. These residual connections are quite popular in deep U-Net variants; like in [16], the authors have devised Recurrent Residual U-Net (R2U-Net). The model is a modification of U-Net [11] with replacing simple convolutional units with Recurrent Residual Convolutional Layers (RRCL) [24, 25]. Each encoder-decoder unit has two sub RRCNN blocks where each unfolds to a time step t. The final output is an element-wise summation of output from the second recurrent convolution block and the original input. The increased field of view even in the lower layers and the efficiency of feature summation aids in extracting very low-level features, which are crucial for medical image segmentation. This architecture with fewer parameters outperformed the ones with a large number of parameters. In [26], however, this element-wise feature summation did not benefit in improving the testing performance due to the summation being performed outside the network. Similarly, in M-UNet [27], the authors have made the network sufficiently deep by embedding DenseNet [28] in the architecture. The convolution blocks of the encoder are replaced by DenseNet, while the plain convolutions are kept in the decoder block. The arrangement has made the network deeper that improved performance while keeping a reasonable number of network parameters. DIU-Net [29] is an attempt to make the U-Net model wider and deeper by fusing Inception-Res and dense inception block. Unlike traditional Inception-Res block, each convolution layer is followed by a batch normalization layer to avoid vanishing gradient. The dense inception block comprises densely connected inception blocks. The network uses 3 dense inception blocks, with one in the encoder, one in the decoder, and one in the middle. The dense inception block of synthesis and analysis path has 12 inception blocks, whereas the middle one uses 24 blocks. Experimentation results showed improvement over state-of-the-art models. However, the downside of the network is that increasing the growth rate will lead to too many network parameters, which makes the training process slower and difficult. Likewise, in MultiResUNet [12], the encoder-decoder blocks are replaced by a MultiRes block which makes use of residual connections. The motivation behind MultiRes blocks is to make the network capable of segmenting the foreground objects appearing at various scales in medical images. These blocks implement Inception-like blocks [7] of 3 × 3, 5 × 5, and 7 × 7 with successive 3 × 3 filters and a 1 × 1 convolution added with residual connection to preserve the dimensionality of the image. The architecture has shown significant improvement in performance over U-Net across five medical image modalities. With the focus on extracting advanced segmentation features, probabilistic programming is used in [30] with U-Net to enhance performance on ultrasound nerve segmentation. Similarly, in the residual attention U-Net model [31], the authors have used aggregated residual transformation and soft attention in the decoder. The aggregated residual block made the network efficiently deep, which was highly crucial for extracting efficient features for a complex multi-class problem. The network outperformed the U-Net on segmentation of the COVID-19 dataset. Another encoder-decoder network presented in [32] proposes the residual block and feature variation (FV) unit. These two blocks are used in the first three layers of the encoder. In the fourth layer, progressive atrous spatial pyramid pooling is added to increase the receptive field. However, the decoder of the network comprises simple deconvolution blocks. The architecture demonstrates the importance of the increased receptive field in the performance of a model.

2.2 Modified skip connections

Most of the variants of U-Net, including those designed for targeting 3D medical images [34, 35], have been using the plain skip connection. The effectiveness of skip connections in recovering the lost fine-grained details has also been demonstrated in many other segmentation architectures like [38,39,40,41] and has been proven by Drozdzal et al. [42].

Zhou et al. [13, 15] brought attention toward redesigning the skip connection between the encoder and the decoder networks. In U-Net [11], the features from the encoder are directly concatenated with the decoder which requires that they are at the same scale. However, the authors in [13, 15] argued that even though these feature maps are at the same scale, but not semantically similar and there is no theory to back that this fusion is the best possible strategy. Therefore, they replaced these simple connections with dense convolutional blocks to enrich encoder features with semantic information and bring their semantic level closer to the awaiting decoder before merging. In this way, the optimization task becomes easier. Another contribution was to introduce an ensemble of U-Nets with different depths making the model capable of segmenting objects of varying sizes with high accuracy. These dense skip connections are quickly adopted by researchers in models for various applications such as gallstone segmentation [36], pelvic organ segmentation [37], and brain tumor segmentation [14, 43]. The use of these dense skip connections in [15] has proven efficacy in Mask-RCNN segmentation as well. Likewise, the Dense U-Net++ [14] uses Half Dense U-Net [33] with the dense skip connections along with the skip pathways. The dense block at each layer uses the aggregated features from all the previous layers. It highlights the benefit of combining the dense skip pathways with aggregated features from Half Dense U-Net. MDU-Net [18] redesigned the skip connections to connect each decoder with three encoders depending on their position. In addition to this, the network uses skip connections along each encoder-decoder block to connect it with all the previous blocks. These connections enable them to use features from different scales. The architecture demonstrates the importance of using the features from various scales with feature concatenation from a different encoder for gland segmentation. Different medical imagining segmentation models and variants of U-Net are summarized in Table 1.

Table 1 Development of medical imagining segmentation models over the years

Full size table

3 Proposed network architecture: R2U++

To overcome the challenges of U-Net [11] and its variants as mentioned in Section II, we propose a model R2U++. The three main components for the proposed network, namely the skip pathways, the backbone, and the deep supervision, are described below.

3.1 Skip pathways

Re-designed skip pathways modify the connection between encoder and decoder. Inspired from U-NET++ [13, 15], the feature map coming from the encoder will go through dense skip pathways before entering into the decoder block. The dense skip pathways refer to the dense skip connections to the convolution blocks along the skip pathway. The number of convolution layers along the skip pathways is determined according to its pyramid level. As shown in Fig. 3d, for example, if encoder and decoder are at level 4, encoder block is ${X}^{(\mathrm{0,0})}$ and decoder block is ${X}^{(\mathrm{0,4})}$, there will be three convolution blocks:${X}^{(\mathrm{0,1})},{X}^{(\mathrm{0,2})} \mathrm{and} {X}^{(\mathrm{0,3})}$ in the dense skip pathway. Each convolution layer along the skip pathway applies convolution on the concatenated feature maps coming from all the previous blocks at the same level and the corresponding up-sampled feature map from the lower block. For example, ${X}^{(\mathrm{0,2})}$ applies convolution on the concatenated feature maps coming from the same level blocks: ${X}^{\left(\mathrm{0,0}\right)}$, ${X}^{(\mathrm{0,1})}$ and up-sampled feature map from lower block ${X}^{(\mathrm{1,1})}$. In this way, the multiscale features with the same resolution are combined horizontally, whereas different resolution multiscale features are combined vertically. It will not only reduce the feature gap between encoder and decoder but will also capture the multiscale context.

Mathematically, skip pathways can be formulated by Eq. (1). Let us assume $m$ to be the index of the down-sampling layer in the case of encoder, and $n$ to be the index of convolution layer residing in the skip pathways. The concatenated input to the convolutional layer ${X}^{(m,n)}$ can be expressed as:

$${x}_{i}^{m,n}=\left\{\begin{array}{l}{x}^{m-1,n},\, n=0\\ {[[x}^{m,k}{]}_{k=0}^{n-1}, u({x}^{m+1,n-1})],\, n>0\end{array}\right.$$

(1)

The feature map for the ${X}^{(m,n)}$ convolutional layer then becomes:

$${x}_{o}^{m,n}=H\left({x}_{i}^{m,n}\right)$$

(2)

where H(.) is the representation of recurrent residual convolution explained in III.B. The up-sampling from the lower level is denoted by u(.). The concatenation operation is represented by large square brackets. It is noticed from Fig. 3 that the outermost encoder with n = 0, is fed with only one input from its upper encoder block. However, the encoders with n = 1 receive two inputs; one from the same encoder level and one up-sampled input from the lower level of the encoder. Due to the dense skip connections, for the nodes with a value of n > 1, n inputs are received from the same encoder level, and one input is up-sampled from the lower corresponding encoder level.

3.2 Backbone

The U-Net model and its variants have been reporting leading results on several medical image segmentation datasets. Inspired by one of the variants, the Recurrent Residual U-Net [16], we have used recurrent residual convolutions layers (RRCL) over the simple convolutional layers of U-Net. The recurrent convolution layer (RCL) grows in accordance with time steps [25]. Let us define discrete time step as t. To represent the RRCL, we define the H(.) operation at time step t as RRCL. The feature map according to [16] can be represented as:

$${(O}^{m,n}{)}_{t}=({{w}^{m,n}{)}_{t}^{f}*(x}_{i}^{m,n}{)}_{t}^{f}+{{(w}^{m,n}{)}_{t-1}^{r}*(x}_{i}^{m,n}{)}_{t-1}^{r}$$

(3)

Here, the concatenated inputs for the RCL are expressed as ${(x}_{i}^{m,n}{)}_{t}^{f}$ and ${(x}_{i}^{m,n}{)}_{t-1}^{r}$, respectively. The term $({w}^{m,n}{)}_{t}^{f}$ represents the weights in a standard convolution operation, whereas ${(w}^{m,n}{)}_{t-1}^{r}$ represents weights in a recurrent convolution operation. The output ${(O}^{m,n}{)}_{t}$ generated from recurrent convolution block fed to the activation function ReLU which is represented as:

$${(O}^{m,n}{)}_{t}=f\left({(O}^{m,n}{)}_{t}\right)=\mathrm{max}\left(0,{(O}^{m,n}{)}_{t}\right)$$

(4)

This output of the RCL unit at time step t is then passed to the succeeding RCL unit of RRCL. If ${(F}^{m,n}{)}_{t}$ is the output from the second RCL unit of RRCL then the final output from RRCL is computed as:

$$(x_{o}^{m,n} )_{t} = (x_{i}^{m,n} )_{t} + (F^{m,n} )_{t}$$

(5)

Here, ${(x}_{o}^{m,n}{)}_{t}$ shows the output of the RRCL unit at time step t. This output is then fed to the down-sampling layer in the case of the encoder, to the up-sampling layer in the case of the decoder, and to the next recurrent residual convolution layer (RRCL) in case of skip pathways.

The visual representation of unfolding of RCL for t = 2 is shown in Fig. 4. For the convolution operation at t = 2, the current input at t = 2 and the output from previous time step t = 1 both are applied with convolutional operation according to Eqs. 3 and 4. Each recurrent residual block as shown in Fig. 5 further comprises two recurrent convolution blocks. The input sample when fed into the recurrent residual block passes from two back-to-back recurrent convolution blocks. The final output from recurrent residual block is the feature-wise summation of the original input at time step t and output from the second RCL block at time step t. All the convolutional blocks in R2U++ are recurrent residual convolution blocks.

3.3 Deep supervision

The added dense skip connections enable the network to merge the architectures of various depths into a single architecture, as shown in Fig. 3. Different depths are separately shown in Fig. 3a–d, where 3(a) shows the architecture with only one decoder making the architecture to be a level-1 network. However, level-2 architecture is shown in 3(b) with level-1 ${X}^{(\mathrm{0,0})},{ X}^{(\mathrm{0,1})}$ and ${X}^{(\mathrm{1,0})}$ embedded in it. Similarly, level-3 and level-4 are shown in 3(c) and 3(d). For 3 (a–d), the output is taken from ${L}^{1},{L}^{2}$, ${L}^{3}$ and ${L}^{4}$, respectively. These networks are trained without deep supervision using Eq. 6. Figure 3e refers to the ensemble network; when the final output is taken as an average of output from different depths. Ensemble architecture shown in Fig. 3e is a level 4 network embedded with all lower depths, i.e., ${L}^{1},{L}^{2}$ and ${L}^{3}$. All of these four levels share the same encoders but have their own decoders. Each of the levels is trained separately with its own loss function, i.e., ${X}^{(0,q)}$ where $q\epsilon \{\mathrm{1,2},\mathrm{3,4}\}$. At the inference, the final output will be calculated by taking the average of the output from each depth. It is trained using deep supervision scheme in R2U++, the loss function is applied on the nodes ${X}^{(0,q)}$ where $q\epsilon \{\mathrm{1,2},\mathrm{3,4}\}$. A 1 × 1 convolution layer followed by activation function is added at the output of nodes ${X}^{(\mathrm{0,1})},{X}^{(\mathrm{0,2})}, {X}^{(\mathrm{0,3})}$ and ${X}^{(\mathrm{0,4})}$. This convolution layer has C number of filters for the C segmentation classes in any dataset. We have used the loss function defined for the U-Net++ in [13, 15]. It is a hybrid loss function that comprises pixel-wise cross entropy loss and soft dice coefficient loss. The loss function is calculated for each of the semantic level, i.e., ${X}^{(\mathrm{0,1})},{X}^{(\mathrm{0,2})} {X}^{(\mathrm{0,3})}$ and ${X}^{(\mathrm{0,4})}$. The hybrid loss function can enjoy the perks from both loss functions: smooth gradients and dealing with class imbalance problems. Mathematically, it can be written as:

$$L\left( {Y,P} \right) = - \frac{1}{N}\mathop \sum \limits_{c = 1}^{C} \mathop \sum \limits_{n = 1}^{N} \left( { y_{n,c} \log p_{n,c} + \frac{{2y_{n,c} p_{n,c} }}{{y_{n,c}^{2} + p_{n,c}^{2} }}} \right)$$

(6)

where, Y denotes the ground truth labels, P denotes the predicted probabilities values, C represents the number of segmentation classes. Furthermore, ${y}_{n,c}\in Y$ and ${p}_{n,c} \in P$, where n denotes the nth pixel in a batch with a total of N pixels within a given batch. Finally, the total loss is the weighted sum of the individual loss functions. Mathematically, it can be written as:

$$L = \mathop \sum \limits_{i = 1}^{d} \eta_{i} .L\left( {Y,P^{i} } \right)$$

(7)

The summation runs over the number of decoders represented by d. The value of ${\eta }_{i}$ is set to be one to assign the same weight to all the decoder losses.

To sum up the benefits of our architecture, the Residual Unit helps in training a deeper architecture by avoiding degradation problems. The Recurrent Unit aids in feature accumulation, which enables it to accumulate accurate low-level features highly crucial for segmentation. Having convolution layers on skip pathways reduces the dissimilarity in the features of the encoder and decoder. Having dense skip/shortcut connections on skip pathways improve gradient flow. Finally, ensemble multi-depth outputs ensure better accuracy on multiscale foreground objects.

4 Experiments

The experimentation process involves two main steps; training and testing, as shown in Fig. 6. For training, pre-processed images are fed to R2U++ to train the model using cross-validation. Once the training process is completed, unseen testing data is presented to the trained model to make predictions.

4.1 Datasets

The proposed architecture has been evaluated on a range of biomedical image segmentation datasets, namely: (1) Electron Microscopy (EM) dataset of skin lesions, (2) COVID-19 dataset of lung CT images, (3) DRIVE dataset of retinal fundoscopic images, and (4) JSRT dataset of chest X-ray images. These datasets cover the segmentation of skin lesions, lungs, and retinal blood vessels, as shown in Fig. 1. These datasets are generated from medical image modalities like microscopy, CT scans and X-rays.

1.
Electron Microscopic (EM): This publicly available dataset is a part of the ISBI 2012 EM segmentation challenge [44]. The dataset comprises a total of 30 images, with each having a dimensionality of 512 × 512. These images are extracted from serial section transmission electron microscopy (ssTEM) of the Drosophila first instar larva ventral nerve cord (VNC). The dataset is provided with the fully annotated ground truth labels for each image. The cells are labeled as white, whereas the membranes are represented with the black pixels. For the experimentation purpose, we randomly split the dataset into training 27 images from which 3 images are used for validation while testing is performed on the remaining 3 images. To overcome the small sample size of images, we have used the patch-based strategy for both training and inference. All the patches are generated using the sliding window technique with a patch size of 96 × 96 and a stride of 48.
2.
COVID-19 CT Images Dataset: It is the first publicly available dataset for the COVID-19 segmentation [45]. The dataset comprises a total of 100 CT scans extracted from 19 COVID-19 patients. These images are gathered by the Italian Society of Medical and Interventional Radiology. The ground truths of only 100 slices are publicly available. To overcome the small sample size of labeled ground truth, another dataset is generated in [46] by extracting the unlabeled images from COVID-19 CT segmentation dataset. The unlabeled CT volumes from all 19 patients are extracted and pseudo labels for the 1600 2D slices from these volumes are generated. We have used these pseudo labels from [46] to pre-train our network. Subsequently, these weights are used to initialize our network. From 100 labeled slices, 45 randomly selected images are used for training, 5 for validation, and 50 images are used to evaluate the model’s performance. As these images are not of uniform dimension, so we resized all images to 256 × 256.
3.
JSRT dataset of chest X-ray images: The dataset used for lung image segmentation is produced by the Japanese Society of Radiological Technology (JSRT) [47].The dataset contains 247 chest X-Rays with 154 nodule images and 93 non-nodule images The resolution of images is 2048 × 2048. We have split the dataset into 80% training and 20% testing. From training images, we have used 38 images for validation. We have resized the images to 256 × 256 to reduce the computational complexity.
4.
Blood Vessel Segmentation: In our experimentation, we have used the DRIVE database [48] for retinal blood vessel segmentation. The dataset has in total 40 retinal images. The dataset is split into 20 training images and 20 testing images. Each image has a dimensionality of 565 × 584. In order to square the image dimensions, we have cropped images, taking the portion from 19 to 554 rows and 29 to 564 columns. The resulting images were of size 535 × 535. In our experimentations, we have used patch-based technique for both training and inference. The patches are extracted using the sliding window technique with a patch size of 96 × 96 and stride 5. We have generated 154,880 testing and 154,880 training patches from which 30,976 are used for validation.

4.2 Quantitative analysis approaches

For the analysis of the experimental results the evaluation metrics used in the study are as follows.

(1)
Dice coefficient: The dice coefficient is a commonly used metric for image segmentation which is computed as follows:

$${\text{DC}} = 2\frac{{\left| {{\text{GT}} \cap {\text{PR}}} \right|}}{{\left| {{\text{GT}}} \right| + \left| {{\text{PR}}} \right|}}$$

(8)

Where, GT represents the ground truth labels and PR represents the predicted labels.

(2)
Accuracy: Accuracy is used to measure the pixels that are correctly classified by the network. The formula used to calculate accuracy is given by equation:
$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(9)

where, TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

(3)
Intersection over union: Another commonly used metric for image segmentation is intersection over union (IoU). It is computed as ratio of intersection of ground truth and predicted results with union of ground truth with predicted labels. The formula is given below:

$${\text{IoU}} = \frac{{\left| {{\text{GT}} \cap {\text{PR}}} \right|}}{{\left| {{\text{GT}} \cup {\text{PR}}} \right|}}$$

(10)

4.3 Baseline and implementation

We have compared the performance of our proposed model with U-Net, R2U-Net, and U-Net++. The details of the architecture and number of filters used in the study are shown in Table 2. The numbers of filters used in the proposed model are [32, 64,128, 256, 512]. For each convolution block ${X}^{m,n} ,$ the number of filters used are shown in Table 2, for example for m = 0 and n = 0 to 4, i.e., block${X}^{\mathrm{0,0}-4}$, 32 filters are used. The filter size is kept 3 × 3 in all layers with a stride of 2. The down-sampling is done using max-pooling operation with a filter size of 2 × 2 and a stride of 2. The batch normalization is followed by the activation function ReLU. In the final layer sigmoid activation is used to generate predicted probabilities values. We have used Adam optimizer with the learning rate set to 3e-4. All the experiments are implemented using Keras and Tensorflow libraries on NVIDIA GeForce RTX 2060 with 6 GB dedicated memory. For the training, we have used early-stop method on the validation datasets.

Table 2 Details of the architectures and number of filters used in each convolution block ${X}^{m,n}$

Full size table

5 Results and discussion

The results of the R2U++ are compared with the U-Net, R2U-Net, and U-Net++ model in terms of evaluation metrics IoU and dice coefficient for EM, COVID-19 and JSRT dataset. These networks are trained for 20 independent trials and mean IoU and mean dice coefficient with standard deviation (sd) are reported for these trials. The performance on the Drive data set is evaluated using dice coefficient, sensitivity, specificity, and accuracy. The results reported in Tables 3 and 4 show that R2U++ consistently outperforms U-Net++. In summary, the IoU improvement achieved over UNet++ is up to 3.58% without deep supervision, and up to 1.87% with deep supervision. Similarly, compared with R2U-Net, the IoU improvement is up to 7.11%. The details are as follows.

Table 3 Segmentation results for EM, COVID-19 and JSRT datasets for U-Net, R2U-Net, U-Net++ and R2U++

Full size table

Table 4 The recorded dice coefficient, sensitivity, specificity and accuracy values for U-NET, R2U-NET, U-NET++ and R2U++

Full size table

The improvements in comparison with U-Net++ in terms of mean IoU and dice coefficient without deep supervision are: (2.03↑, 1.61↑) for COVID-19, (3.58↑, 2.11↑) for JSRT, (0.91↑, 0.05↑) for EM. With deep supervision, the improvements in terms of mean IoU and dice coefficient are: (1.72↑, 1.31↑) for COVID-19, (1.87↑, 1.09↑) for JSRT, (1.02↑, 0.52↑) for EM. Similarly, the improvement over R2U-Net with deep supervision is (4.98↑, 5.05↑) for COVID-19, (7.11↑, 4.55↑) for JSRT, (0.56↑, 0.81↑) for EM. It is evident from the results that adding the recurrent residual connection has shown decent improvement in the performance for both cases.

The nature of the complexity of the EM dataset is different than the others because a major part of the image has foreground pixels and very thin blood vessels belong to the background. R2U-Net has more IoU than U-Net++ on EM which shows that recurrent residual connections can help to draw clear boundaries of thin background classes from majority foreground classes. The dice coefficient achieved by our method for COVID-19 is higher than the reported dice coefficient by Inf-Net [46] by a factor of 3.25↑. The segmented images and difference images with ground truths for EM, COVID-19, and JSRT, DRIVE datasets are shown in Figs. 7, 8, 9 and 10, respectively. In the case of EM segmentation in Fig. 7, the comparison of row 2, row 4, row 6, and row 8 shows that with R2U++, the contours of cells are segmented properly while preserving the thickness of cell membranes with no breakage. Similarly, for COVID-19, the contours from R2U+ + are better defined than U-Net+ + which are more rounded as shown in Fig. 8. In addition to this, U-Net++ also has more false positives than R2U++. Similar behavior can be observed for JSRT in Fig. 9.

Experimental results for the DRIVE dataset are reported in Table 4, in comparison with U-Net++ without deep supervision, the increase in dice coefficient, sensitivity, specificity, and accuracy is 0.13↑, 0.02↑, − 0.04↓, and 0.1↑ respectively. With deep supervision, the improvement attained in dice coefficient, sensitivity, specificity and accuracy is 0.59↑, 1.18↑, 0.22↑, and 0.09↑, respectively. It can be observed from difference images shown in Fig. 10 that R2U++ is slightly better than U-Net++ in segmenting thin blood vessels. Similarly, the improvement over R2U-Net is 0.64↑ in dice coefficient, 0.82↑ in specificity values, and 0.44↑ in accuracy value with deep supervision.

The learning curves for the datasets by each model are shown in Figs. 11 and 12 using loss function from Eq. 7 for no deep supervision and with deep supervision, respectively. It is obvious that R2U++ has the lowest validation error in all the cases. The comparison of inference time taken by models under study is shown in Fig. 13. The models have been tested on 20,000 drive patches with the size of 96 × 96. As expected, U-Net having the least number of parameters takes the least amount while our model takes the most.

While the proposed method consistently outperformed U-Net++ and U-Net on the segmentation tasks, we observed that there is a significant increase in the number of trainable parameters and thus, an increase in the required computational resources for training the model. However, we believe that this requirement is alleviated by the larger memory and number of cores in modern GPUs that are rapidly becoming available. Furthermore, by modern standards of deep learning, the proposed model with parameters in the order of 18 M looks smaller when compared to more recent models such as vision-transformers that have parameters in the order of 632 M [49].

6 Conclusion

In this study, we introduced recurrent residual convolution blocks and dense skip connections-based U-Net architecture for medical image segmentation. The proposed architecture extracts the features best representing “what” and “where” information, which is backed by the performance of model. The improvement in the performance of the segmentation task can be attributed to; (1) the use of recurrent residual unit over a plain convolution which enables the network to extract low level features precisely without running into the degradation problem, (2) the dense skip pathways help in reducing the semantic gap between encoder and decoder thus more similar semantic concatenation results in improved performance and (3) the deep supervision enables us to classify the multiscale foreground objects correctly. The performance of R2U++ is evaluated on four distinct medical imaging modalities: electron microscopy (EM), X-rays, fundus, and computed tomography (CT). The average gain achieved in IoU score is 1.5 ± 0.37%, and in dice score is 0.9 ± 0.33% over UNET + + , whereas 4.21 ± 2.72 in IoU, and 3.47 ± 1.89 in dice score over R2U-Net across these different medical imaging segmentation datasets. Our future work will focus on exploring the use of dense skip connections in deep generative models, particularly generative adversarial networks for medical image segmentation.

References

Schindelin J, Rueden CT, Hiner MC, Eliceiri KW (2015) The ImageJ ecosystem: an open platform for biomedical. Mol Reprod Dev 82:518–529
Article Google Scholar
Facts & Figures 2018: Rate of deaths from cancer continues decline. Jan 14, 2018. Accessed on: July 23, 2020. https://www.cancer.org/latest-news/facts-and-figures-2018-rate-of-deaths-from-cancer-continues-decline.html#reviewed_by
Shah SAA, Tang TB, Faye I, Laude A (2017) Blood vessel segmentation in color fundus images based on regional and Hessian features. Graefes Arch Clin Exp Ophthalmol 225:1525–1533
Article Google Scholar
Heimann T, Meinzer HP (2009) Statistical shape models for 3D medical image segmentation: a review. Med Image Anal 13(4):543–563
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Part of Advances in Neural Information Processing Systems (NIPS 2012), vol 25. Curran Associates, Inc.
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large scale image recoginition. arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Laak JAWMV, Ginneken B, Sanchez CI (2017) A survey on deep learning in medical image analysis. Med Image Anal 42:60–68
Article Google Scholar
Yi X, Walia E, Babyn P (2019) Generative adversarial network in medical imaging: a review. Med Image Anal 58:101552
Article Google Scholar
Long J, Shelhamer E, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651
Article Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. arXiv:1505.04597v1
Ibtehaz N, Rahman MS (2019) MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw 121:74–87
Article Google Scholar
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J (2018) UNet++: a nested U-Net architecture for medical image segmentation. arXiv:1807.10165
Chen F, Ding Y, Wu Z, Wu D, Wen J (2018) An improved framework called DU++ applied to brain. In: 15th international computer conference on wavelet active media technology and information processing (ICCWAMTIP)
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J (2019) UNet++: redesigning skip connections to exploit multiscale features in image segmentation. J IEEE Trans Med Imaging
Alom MZ, Hasan M, Yakopcic C, Taha TM, Asari VK (2018) Recurrent residual convolutional neural network based on U-Net (R2U-Net) for medical image segmentation. arXiv:1802.06955
Oktay O, Schlemper J, Folgoc L, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, Glocker B, Rueckert D (2018) Attention U-Net: learning where to look for the pancreas. arXiv:1804.03999v3
Zhang J, Jin Y, Xu J, Xu X, Zhang Y (2018) MDU-net: multi-scale densely connected U-net for biomedical image segmentation. arXiv:1812.00352
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915v2
Badrinarayanan V, Kendall A, Cipolla R (2015) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561v3
Li R, Liu W, Yang L, Sun S, Hu W, Zhang F, Li W (2018) DeepUNet: a deep fully convolutional network for pixel-level sea-land segmentation. IEEE J Sel Top Appl Earth Observ Remote Sens 11(11):3954–3962
Article Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: International conference on artificial intelligence and statistics (AISTATS)
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas
Liang M, Hu X (2015) Recurrent convolutional neural network for object recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition
Kayalıbay B, Jensen G, Smag PVD (2017) CNN-based segmentation of medical imaging data. arXiv:1701.03056v2
Soni A, Koner R, Villuri VGK (2019) M-UNet: Modified U-Net segmentation framework with satellite imagery. In: Proceedings of the global AI congress 2019
Huang G, Liu Z, Maaten LVD (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang Z, Wu C, Coleman S, Kerr D (2020) DENSE-INception U-net for medical image segmentation. In: Computer methods and programs in biomedicine, vol 192
Rubasinghe I, Meedeniya D (2019) Ultrasound nerve segmentation using deep probabilistic programming. J ICT Res Appl 13(3):241–256
Article Google Scholar
Chen X, Yao L, Zhang Y (2020) Residual attention U-Net for automated multi-class segmentation of COVID-19 chest CT images. arXiv:2004.05645
Yan Q, Wang B, Gong D, Luo C, Zhao W, Shen J, Shi Q, Jin S, Zhang L, You Z (2020) COVID-19 chest CT image segmentation—a deep convolutional neural network solution. arXiv:2004.10987
Wu Z, Chen F, Wu D (2018) A novel framework called HDU for segmentation of brain tumor. In: 2018 15th international computer conference on wavelet active media technology and information processing (ICCWAMTIP)
Cicek O, Abdulkadir A, Lienkamp S, Brox T, Ronneberger O (2016) 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: International conference on medical image computing and computer-assisted intervention. Springer, Cham, pp 424–432
Milletari F, Navab N, Ahmadi SA (2016) V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). IEEE, pp 565–571
Song T, Meng F, Paton AR, Li P, Zheng P, Wang X (2019) U-Next: A Novel Convolution Neural Network. IEEE Access 7:166823–166832
Article Google Scholar
Wu S, Wang Z, Liu C, Zhu C, Wu S, Xiao K (2019) Automatical segmentation of pelvic organs after hysterectomy by using dilated convolutions U-Net++. In: Proceedings of IEEE 19th international conference on software quality, reliability and security companion (QRS-C)
Chaurasia A, Culurciello E (2017) LinkNet: exploiting encoder representations for efficient semantic segmentation. In: 2017 IEEE visual communication and image processing (VCIP)
Lin G, Milan A, Shen C, Reid I (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhao H, Qi X, Shen X, Shi J, Jia J (2018) ICNet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European conference on computer vision
Tajbakhsh N, Lai B, Ananth SP, Ding X (2020) ERRORNET: learning error representations from limited data to improve vascular segmentation. In: 2020 IEEE 17th international symposium on biomedical imaging (ISBI). IEEE, pp 1364–1368
Drozdzal M, Vorontsov E, Chartrand G, Kadoury S, Pal C (2016) The importance of skip connections in biomedical image segmentation. In: Deep learning and data labeling for medical applications. Springer, Cham, pp 179–187
Zhou C, Chen S, Ding C, Tao D (2019) Learning contextual and attentive information for brain tumor segmentation. International MICCAI Brainlesion Workshop
Cardona A, Saalfeld S, Preibisch S, Schmid B, Cheng A, Pulokas J, Tomancak P, Hartenstein V (2010) An Integrated micro- and macroarchitectural analysis of the drosophila brain by computer-assisted serial section electron microscopy. PLoS Biol 8(10):e1000502
Article Google Scholar
COVID-19 CT segmentation dataset. https://medicalsegmentation.com/covid19/
Fan D-P, Zhou T, Ji G-P, Zhou Y, Chen G, Fu H, Shen J, Shao L (2020) Inf-Net: Automatic COVID-19 lung infection segmentation from CT scans. IEEE Trans Med Imaging 39(8):2626–2637
Article Google Scholar
Katsuragawa SJS, Ikezoe J, Matsumoto T, Kobayashi T, Komatsu K-I, Matsui M, Fujita H, Kodera Y, Doi K (2000) Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. Am J Roentgenol 174(1):71–74
Article Google Scholar
Drive database. https://drive.grand-challenge.org/
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

Download references

Funding

Open access funding provided by Umea University.

Author information

Hazrat Ali
Present address: Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan

Authors and Affiliations

Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
Mehreen Mubashar & Shoaib Azmat
Department of Radiation Sciences, Umeå University, Umeå, Sweden
Hazrat Ali & Christer Grönlund

Authors

Mehreen Mubashar
View author publications
You can also search for this author in PubMed Google Scholar
Hazrat Ali
View author publications
You can also search for this author in PubMed Google Scholar
Christer Grönlund
View author publications
You can also search for this author in PubMed Google Scholar
Shoaib Azmat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hazrat Ali.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mubashar, M., Ali, H., Grönlund, C. et al. R2U++: a multiscale recurrent residual U-Net with dense skip connections for medical image segmentation. Neural Comput & Applic 34, 17723–17739 (2022). https://doi.org/10.1007/s00521-022-07419-7

Download citation

Received: 29 November 2021
Accepted: 09 May 2022
Published: 03 June 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s00521-022-07419-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

R2U++: a multiscale recurrent residual U-Net with dense skip connections for medical image segmentation

Abstract

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation

1 Introduction