Ψnet: a parallel network with deeply coupled spatial and squeezed features for segmentation of medical images

Elmeslimany, Eman M.; Kishk, Sherif S.; Altantawy, Doaa A.

doi:10.1007/s11042-023-16416-4

Ψnet: a parallel network with deeply coupled spatial and squeezed features for segmentation of medical images

Open access
Published: 08 August 2023

Volume 83, pages 24045–24082, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ψnet: a parallel network with deeply coupled spatial and squeezed features for segmentation of medical images

Download PDF

Eman M. Elmeslimany¹,
Sherif S. Kishk¹ &
Doaa A. Altantawy ORCID: orcid.org/0000-0001-6476-2934¹

1884 Accesses
3 Citations
Explore all metrics

Abstract

The process of delineating a region of interest or an object in an image is called image segmentation. Efficient medical image segmentation can contribute to the early diagnosis of illnesses, and accordingly, patient survival possibilities can be enhanced. Recently, deep semantic segmentation methods demonstrate state-of-the-art (SOTA) performance. In this paper, we propose a generic novel deep medical segmentation framework, denoted as Ψnet. This model introduces a novel parallel encoder-decoder structure that draws up the power of triple U-Nets. In addition, a multi-stage squeezed-based encoder is employed to raise the network sensitivity to relevant features and suppress the unnecessary ones. Moreover, atrous spatial pyramid pooling (ASPP) is employed in the bottleneck of the network which helps in gathering more effective features during the training process, hence better performance can be achieved in segmentation tasks. We have evaluated the proposed Ψnet on a variety of challengeable segmentation tasks, including colonoscopy, microscopy, and dermoscopy images. The employed datasets include Data Science Bowl (DSB) 2018 challenge as a cell nuclei segmentation from microscopy images, International Skin Imaging Collaboration (ISIC) 2017 and 2018 as skin lesion segmentation from dermoscopy images, Kvasir-SEG, CVC-ClinicDB, ETIS-LaribDB, and CVC-ColonDB as polyp segmentation from colonoscopy images. Despite the variety in the employed datasets, the proposed model, with extensive experiments, demonstrates superior performance to advanced SOTA models, such as U-Net, ResUNet, Recurrent Residual U-Net, ResUNet++, UNet++, BCDU-Net, MultiResUNet, MCGU-Net, FRCU-Net, Attention Deeplabv3p, DDANet, ColonSegNet, and TMD-Unet.

FF-UNet: a U-Shaped Deep Convolutional Neural Network for Multimodal Biomedical Image Segmentation

Article 27 June 2022

Collaborative region-boundary interaction network for medical image segmentation

Article 07 September 2023

CMM-Net: Contextual multi-scale multi-level network for efficient biomedical image segmentation

Article Open access 13 May 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In clinical diagnosis, the segmentation of medical images is a critical and essential procedure for later medical image analysis. Hence, several automatic segmentation techniques have been proposed to help radiologists with early manifestations of life-threatening diseases [26, 46]. These automatic segmentation techniques are roughly categorized into learning-based techniques [41, 53] and classical image processing-based techniques [48, 61]. However, lesions in medical images can vary in size, shape, location, color, texture, and contrast. Hence, the development of accurate and robust segmentation solutions is still a very challenging problem due to several complexities. On the other side, the conventional manual annotation of medical images is a costly time-consuming procedure. Moreover, there is a shortage of specific annotation protocols that suit different types of imaging modalities. Furthermore, low-quality images can potentially influence annotation quality. As a result, employing a computer-aided segmentation model can be an alternate efficient solution to manual image segmentation.

With their remarkable outstanding feature representation capability, convolutional networks have revolutionized different fields, including the computer vision field [5, 47], the industrial field [28], and the monitoring field [27, 29]. Recently, segmentation algorithms based on convolutional neural networks (CNNs) have demonstrated SOTA performance for automated biomedical image segmentation [23, 45, 62]. Most of these algorithms have been encoder-decoder-based networks which have shown prominence for many medical segmentation tasks [40, 43].

Deep encoder-decoder-based CNN has demonstrated high segmentation efficiency due to its skip connections, which permit semantic dense feature maps to propagate from the encoder network to the decoder sub-networks. FCN [42] is one of the earliest deep networks proposed for semantic segmentation that is trained end-to-end for pixel-wise prediction. In [42], the authors have proved that FCNs can significantly enhance accuracy by transferring pre-trained classifier weights, fusing various layers, and learning end-to-end, and pixels-to-pixels on whole images. For the process of transferring weights, they adopted contemporary classification networks, i.e., Alex-Net, VGG-Net, and Google-Net, and transferred their learned representations via fine-tuning to the segmentation task. Then, to produce detailed and accurate segmentations, they developed a skip architecture that combined semantic information from a deep coarse layer with appearance information from a shallow fine layer.

Later, FCN [42] was extended to the most common segmentation network, i.e., U-Net. U-Net [49] is a pixel-wise encoder-decoder architecture that has been trained in an end-to-end way. It has achieved good segmentation performance. It is commonly used for lesion segmentation, anatomical segmentation, and classification in the medical image analysis sector. The main benefit of the U-Net network is that it cannot only precisely segment the targeted object and objectively process and analyze medical images, but it also can aid to improve the accuracy of medical image diagnosis. In addition, with a few training samples, U-Net can perform effectively while still capable of employing global location and context information simultaneously. Moreover, U-Net architecture outperforms FCN in different challengeable segmentation tasks and gradually becomes the pioneering model in the field of medical image segmentation. However, due to the existence of many layers in the conventional U-Net version, a significant amount of time is needed for training. In addition, relatively high GPU memory for larger images is a necessity. Moreover, employing skip connections is a double-edged sword. Skipping allows a fewer layer-based network which reduces the complexity. In addition, it decays the influence of vanishing gradients effectively. Moreover, a speedy learning process can be achieved. However, on the other side, a semantic gap between low-and high-level features could occur, and some features may be lost between skip connections. Hence, different amendments have been made to the original U-Net architecture to support its weaknesses and add to its strengths [6, 7, 9, 14, 32, 34, 35, 56, 63, 65, 67,68,69].

In this paper, the main contributions can be summed up as follows.

1
We establish an effective novel framework for medical image segmentation, dubbed as Ψnet, which is a squeezed parallel multi-stage encoder-decoder network. Figure 1 indicates a summarized graphical abstract of the proposed segmentation network.
2
Due to the adopted parallel mechanism, the atrous spatial pyramid pooling (ASPP), and the squeeze-and-excitation behavior in the introduced encoder, semantically significant features are extracted to enhance segmentation performance. The adopted squeeze-and-excitation module boosts the weights of the most essential features. In addition, it improves the representational power of the proposed segmentation network by enabling dynamic channel-wise feature recalibration. Moreover, the parallel scheme helps to draw up the power of triple U-Nets.
3
In practice, multi-scale feature extraction is computationally costly and demands a lot of training data, that is not usually available. However, the proposed parallel Triple scheme with multi-stage encoder-decoder U-Net architecture extracts significant essential features which improve the training efficiency. The proposed segmentation model provides a lightweight and less complex network with a total number of parameters of around 33 M compared to the FRCU-Net of 68 M [9]. The larger the number of parameters, the longer the time needed for convergence.
4
With the multi-scale feature extraction property of the proposed Ψnet, efficient segmentation results of small datasets, like ColonDB and ETIS-Larib obtained in terms of dice coefficient and Jaccard index, while the traditional U-Net can’t perform effectively with small datasets.
5
To demonstrate the generalizability of our model, we have evaluated the proposed Ψnet via a variety of medical image segmentation tasks, such as Kvasir-SEG [36], CVC-ClinicDB [11], CVC-ColonDB [54], ETIS-Larib [50], 2018 Data Science Bowl (DSB) [12], ISIC-2017 [18], and ISIC-2018 [57]. Superior performance is achieved compared to most SOTA models. Figure 2 indicates some visual results of the proposed Ψnet on the employed challenging datasets.
6
In our experiments, we performed two main types of evaluations. The first one is traditional testing where the training and testing procedures are performed on the same single dataset. The other one is cross-testing in which the model is trained with a specific dataset and tested with another one. In both evaluations, the proposed model shows effectiveness compared to most of the SOTA models.

Our paper is organized in the following manner. Section 2 provides an overview of relevant work in medical image segmentation. The proposed Ψnet architecture is presented in detail in Section 3. In the same section, a background of the traditional U-Net is introduced. In Section 4, the employed datasets and metrics are indicated. In addition, the performed experiments are described, and their results are discussed quantitatively and qualitatively. Finally, we conclude our work in Section 5.

2 Related work

Traditional handcrafted features have been used in semantic segmentation before the uprising of deep learning in computer vision. In the last few years, a variety of deep learning-based approaches have developed rapidly and have achieved outstanding results in image segmentation. The main hurdle of deep architectures is their severe hunger for labeled training data. In addition, due to the limitations of manual annotation, providing large-annotated datasets in medical image segmentation is a challenging task [58].

A convolutional neural network (CNN or ConvNet) is a class of artificial neural network (ANN), which is mostly used to analyze visual images [4, 7, 9, 14, 27,28,29, 34, 49, 69]. There are several problems with employing such CNNs, such as losing the image spatial information when the convolutional features are fed into fully connected (FC) layers. In addition, training a CNN-based model may include different problems, such as exploding gradient, overfitting, and class imbalance. These challenges can diminish the model’s performance. To overcome these problems, FCN architecture was proposed in [42].

Ronneberger et al. [49] modified the conventional FCN, proposed by Long et al. [42], by propagating contextual information from the encoder to the decoder. This was done by connecting the encoder and the decoder networks through skip connections that created a U-shaped architecture. This U-shaped architecture becomes later a major innovation of FCN and was named “U-Net”. After that, Zhang et al. [65] presented a deep Residual U-Net (ResUNet) that incorporates the strength of both U-Net and residual neural network. Compared to the original U-Net, ResUNet utilized a better CNN backbone that extracted information at multiple scales which caused better performance.

For medical image segmentation, Chen et al. [14] introduced Dense-Res-Inception Net (DRINET) with a better performance when it is compared to FCN, U-Net, and ResUNet. However, employing a dense-inception block increases the growth rate, which may result in too many parameters, making the model more complex and duller to train. Zhou et al. [67] proposed UNet++ for the task of semantic and instance segmentation. The performance of their proposed network was enhanced by restructuring skip connections and developing a pruning strategy for their architecture. They solved the issue of losing edge information and small objects by down-sampling functions. They tested their model on a variety of medical image segmentation tasks.

Additionally, Jha et al. presented (ResUNet++) [34], which is an advanced form of the basic ResUNet. They employed residual blocks, besides integrating additional layers to their network, including squeeze-and-excitation blocks [31], attention blocks, and ASPP [16]. Compared to ResUNet and U-Net, ResUNet++ achieved higher scores in DSC, IoU, and recall. Reza Azad et al. [7] proposed another modification to the conventional U-Net, denoted as Bi-directional ConvLSTM U-Net (BCDU-Net). Besides the full advantages of U-Net, the performance was improved by capturing more discriminative data utilizing bi-directional ConvLSTM and dense convolutions.

Because of the main problem of skip connection, i.e., the issue of the great semantic gap between high- and low-resolution features which results in fuzzy feature maps, Ibtehaz et al. [32] introduced a model to enhance skip connection, titled as MultiResUNet. Their architecture modified the traditional U-Net with Residual Path (ResPath) wherein encoded features execute extra convolution operations before combining the process with equivalent features in the decoder. Asadi-Aghbolaghi et al. [6] presented another U-Net extension for medical image segmentation, named Multi-level Context Gating U-Net (MCGU-Net). In their architecture, they inserted a squeeze-and-excitation (SE) module in the decoder, besides employing BConvLSTM. They utilized a dense convolutions mechanism for extracting richer discriminative features, which led to more fine segmentation maps. Jha et al. [35] proposed the famous DoubleU-Net, which is a blend of two U-Net networks placed on top of one another. On all used datasets, DoubleU-Net has outperformed different baselines and the traditional U-Net. The evaluation was performed on a variety of medical image segmentation tasks.

Zunair et al. [69] introduced Sharp U-Net as an effective depthwise encoder-decoder fully convolutional network for biomedical image segmentation. In their model, they exclude employing skip connection and instead, they utilized sharpening kernel filter in the encoder path. A depthwise convolution of the encoder feature map with a sharpening kernel filter was performed before merging the encoder and decoder features which produced a sharpened intermediate feature map of the same size as the encoder map. A variety of experiments were performed on six datasets with efficient performance. Azad et al. [9] introduced a new extension to the traditional U-Net, titled a frequency re-calibration U-Net (FRCU-Net). In the skip connection, they employed multi-level BConvLSTM. In addition, they employed SE modules in the decoding path and used densely connected convolutional modules. Moreover, they introduced a frequency-level attention mechanism that utilized a weighted combination of multiple kinds of frequency information to manage and assemble the representation space. However, both FRCU-Net and MCGU-Net need much longer time for convergence because they have a greater number of training parameters. Tran et al. [56] introduced a structural network, titled TMD-Unet. Their model had three major contributions compared to the traditional U-Net. Firstly, they employed three sub-Unet models. Secondly, they utilized dilated convolution (DC) rather than normal convolution. Thirdly, rather than a standard skip connection, they adopted a dense skip connection. A variety of experiments have been performed in this work on different datasets.

3 The proposed Ψnet architecture

The main objective of a supervised network is to learn how to predict the targeted output y from a given input image x, i.e., mapping the input image into the labeled target (P : x → y) , where P is the employed network. The network can learn to extract texture and contextual similarity between the same labeled pixels and the difference between differently labeled neighboring pixels, thereby realistic segmentation can be produced.

Deep learning-based models can provide quick diagnosis and accordingly can support specialists throughout their treatments. Medical image segmentation tasks have mostly adopted U-Net [49] and its related segmentation models [6, 7, 9, 32, 34, 35, 56, 68] to gather both high- and low-level details. U-Net [49], shown in Fig. 3, is a contracting-expanding (encoder-decoder) model which is originally made of a stack of transformers acting as encoder and decoder linked via skip connections. The conventional U-Net is made up of the same number of down-, up-sampling, and convolution layers. Additionally, U-Net connects each pair of down- and up-sampling layers using skip connection operation allowing the spatial information to be straightly transferred to much deeper layers. Hence, highly precise segmentation results can be generated.

Here, we introduce Ψnet as a new deep learning-based segmentation framework. The main goal of the proposed model is to use fewer parameters and, at the same time, maintain high accuracy on a variety of medical image segmentation tasks. The overall view of the proposed Ψnet system is indicated in Fig. 1. The proposed architecture is based on an end-to-end deep learning approach comprised of three U-Net structures. These three U-Nets are connected parallel to each other in which three single U-Nets are fed with input image simultaneously, see Fig. 4. Employing multiple U-Nets helps in capturing more contextual and semantic features efficiently. The proposed segmentation model depends mainly on three parts, i.e., encoder-decoder backbone with squeeze-and-excitation (SE) block, atrous spatial pyramid pooling (ASPP), as well as output module, in a parallel scheme. The proposed segmentation framework has fewer trainable parameters (~33 M) in comparison to FRCU-Net [9] (~68 M), and FCN-8 s [2] (~134 M), which makes it more suitable for real-time performance. In the following subsections, details about the basic building blocks of the proposed segmentation model are demonstrated.

3.1 Encoders

Every encoder tries to encode the input data into representative features at various levels. The encoder increases the channels while decreasing the spatial dimensions in each layer. Every encoder in the model receives an input image and its ground truth mask as inputs. A pretrained VGG-19 [51], that has previously been learned on ImageNet features [20], is used in the first encoder of NET 1, while the second and third encoder are built from scratch in NET 2 and NET 3, respectively. The substantial reasons and advantages of adopting VGG-19 are as follows. (1) In comparison to other pre-trained models, it is a lightweight model. (2) VGG-19 and U-Net have similar architecture, which simplifies the integration between them. (3) It provides a suitable deeper network that guarantees a more accurate segmentation mask. VGG-19 has acquired robust feature representations for a diverse set of images.

For encoder 2 and encoder 3 in NET 2 and NET 3, respectively, each one contains four sub-encoder blocks connected serially. Every sub-encoder block includes two main sub-blocks: convolution block and max pooling. The convolution block performs two 3 × 3 convolution operations, batch normalization, Rectified Linear Unit as an activation function (ReLU), and finally squeeze-excite process, see Fig. 5. For the group of sub-encoders, we employed filter sizes of {32, 64, 128, 256}. Batch normalization speeds up convergence, decreases internal covariant shift, and regularizes the model. The model non-linearity is represented by a Rectified Linear Unit (ReLU) activation function. The employed squeeze-and-excitation (SE) module promotes feature map quality by increasing their sensitivity towards the main significant features. Finally, to minimize the spatial dimension of the feature maps in the sub-encoder, a max pooling with a 2 × 2 window and stride 2 is performed.

3.2 Squeeze-and-excitation block (SE)

SE is a representational unit employed to raise the network sensitivity to relevant features and suppress unnecessary ones. SE consists of a global 2D average pooling, two dense blocks, and an element-wise multiplication connected serially. The main target of this block is to weigh every feature map to enhance the representational power of salient features. This objective is accomplished in two phases. The squeezing process is the first phase. It is a global information embedding procedure where every channel is squeezed by utilizing global average pooling to generate channel-wise statistics. Excitation is the second phase in which adaptive recalibration is performed to fully capture channel-wise dependencies. SE block enhances the performance of the network with slight additional computational complexity. It exists inside the convolution block in the proposed network, see Fig. 5. Specifically, we added SE blocks into the intermediate stages inside the convolution block employed in encoder 2, encoder 3, decoder 1, decoder 2, and decoder 3. SE methodology, revise Fig. 6, can be summarized as follows. Firstly, it receives its input from the previous Conv2D layers. Secondly, average pooling is employed to squeeze every channel into a single numerical value. Thirdly, a dense layer followed up by a ReLU provides nonlinearity, while reducing the channel complexity with a ratio r. Then, to provide a smooth gating function to each channel, another dense layer is employed with sigmoid activation. Lastly, every feature map in the convolutional block is weighted according to the “excitation” network. The computational cost of SE modules in the model can be changed by adjusting the reduction ratio (r) as a hyperparameter.

3.3 Atrous spatial pyramidal pooling (ASPP)

ASPP is a resampling module that employs atrous convolutions with different sampling rates to extract multi-scale features [16]. This is performed by applying multiple filters with different fields of view to the targeted image. Hence, objects and valuable visual context could be captured at different scales. The notion of ASPP is derived from spatial pyramidal pooling [30], which is effective in resampling features at different scales. The ASPP mainly contains two components: atrous convolution and spatial pyramid pooling (SPP). In ASPP, the authors suggested using atrous convolution as an alternative to pooling operation to avoid salient information loss generated by the latter. The atrous convolution can effectively enlarge the field-of-view, i.e., receptive fields of filters, without adding more parameters and can control the resolution of features to extract high-level semantic data. See Fig. 7 for the difference between the regular convolution and the atrous one. Incorporating the advantages of atrous convolution with SPP is proposed by Chen et al. [13] as atrous spatial pyramid pooling (ASPP) module to further boost segmentation performance. ASPP demonstrates high recognition capability on similar objects at multiple levels, which results in a significant accuracy improvement. Hence, ASPP has become a common option in deep segmentation architectures [38], see Fig. 8 for the internal structure of ASPP module. The ASPP block acts as a bridge between the encoder and the decoder in each network branch because it is located in the middle of each branch.

3.4 Decoders

As shown in Fig. 4, we use three decoders in the proposed model corresponding to their three encoders. Each decoder in the proposed model contains four sub-decoder blocks. Each sub-decoder block doubles the input feature maps dimension by performing a 2 × 2 bi-linear up-sampling. Then, the group of sub-decoders in NET 1 concatenate the feature maps from the encoder of the same branch through the skip connections and the ASPP block, while the group of sub-decoders in NET 2 and NET 3 concatenate the feature maps from the encoder of the same and the previous branch through the skip connections and the ASPP block. Later, the concatenated maps are applied to the convolution block with filter sizes {256, 128, 64, 32}. Figure 9 represents the internal structure of only one sub-decoder block. All sub-decoders have the same internal layers. The decoders have a special feeding mechanism through skip connections, revise the dashed line in Fig. 4. Skip connections facilitate the gradient flow which leads to easier training and enhances the overall performance of the network. In addition, they help to recover the spatial data wasted due to pooling processes by extracting richer features.

3.5 Output blocks

Finally, as shown in Fig. 4, we use three identical output blocks connected after each decoder block. Each output block applies a convolution layer followed by a sigmoid function. Then, an element-wise addition is performed to the resultant three feature maps. Finally, a 1 × 1 convolution layer accompanied by a sigmoid activation function is applied to obtain the final segmentation mask.

4 Experimental results and discussion

Experimentally, the proposed segmentation network, i.e., Ψnet, is tested on seven datasets in total, which are Kvasir-SEG [36], CVC-ClinicDB [11], CVC-ColonDB [54], and ETIS-LaribDB [50] for polyp segmentation, 2018 Data Science Bowl challenge [12] for cell nuclei segmentation, ISIC-2017 and ISIC-2018 [18, 19, 57] for skin lesion segmentation. For details about the employed datasets, check Table 1. In addition, Fig. 10 indicates visual samples from the employed datasets.

Table 1 Details about the employed datasets and their characteristics

Ψnet: a parallel network with deeply coupled spatial and squeezed features for segmentation of medical images

Abstract

Similar content being viewed by others

FF-UNet: a U-Shaped Deep Convolutional Neural Network for Multimodal Biomedical Image Segmentation

Collaborative region-boundary interaction network for medical image segmentation

CMM-Net: Contextual multi-scale multi-level network for efficient biomedical image segmentation

1 Introduction

2 Related work

3 The proposed Ψnet architecture

3.1 Encoders

3.2 Squeeze-and-excitation block (SE)

3.3 Atrous spatial pyramidal pooling (ASPP)

3.4 Decoders

3.5 Output blocks

4 Experimental results and discussion

4.1 Testing the proposed Ψnet on each single/individual dataset

4.1.1 Skin lesions segmentation

A. ISIC-2017 Challenge

B. ISIC-2018 Challenge

4.1.2 Polyps segmentation

A. Kvasir-SEG Challenge

B. CVC-ClinicDB

C. CVC-ColonDB

D. ETIS-Larib Dataset

4.1.3 Nuclei segmentation

4.2 Cross-testing the employed four colonoscopy imaging datasets

4.2.1 CVC-ClinicDB-based cross-evaluation

4.2.2 Kvasir-SEG-based cross-evaluation

4.2.3 CVC-ColonDB-based cross-evaluation

4.2.4 ETIS-Larib-based cross-evaluation

4.3 Ablation study

Hyper-parameters

Loss function

The impact of pretrained network

5 Conclusion

Model limitations and future work

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation