1 Introduction

In clinical diagnosis, the segmentation of medical images is a critical and essential procedure for later medical image analysis. Hence, several automatic segmentation techniques have been proposed to help radiologists with early manifestations of life-threatening diseases [26, 46]. These automatic segmentation techniques are roughly categorized into learning-based techniques [41, 53] and classical image processing-based techniques [48, 61]. However, lesions in medical images can vary in size, shape, location, color, texture, and contrast. Hence, the development of accurate and robust segmentation solutions is still a very challenging problem due to several complexities. On the other side, the conventional manual annotation of medical images is a costly time-consuming procedure. Moreover, there is a shortage of specific annotation protocols that suit different types of imaging modalities. Furthermore, low-quality images can potentially influence annotation quality. As a result, employing a computer-aided segmentation model can be an alternate efficient solution to manual image segmentation.

With their remarkable outstanding feature representation capability, convolutional networks have revolutionized different fields, including the computer vision field [5, 47], the industrial field [28], and the monitoring field [27, 29]. Recently, segmentation algorithms based on convolutional neural networks (CNNs) have demonstrated SOTA performance for automated biomedical image segmentation [23, 45, 62]. Most of these algorithms have been encoder-decoder-based networks which have shown prominence for many medical segmentation tasks [40, 43].

Deep encoder-decoder-based CNN has demonstrated high segmentation efficiency due to its skip connections, which permit semantic dense feature maps to propagate from the encoder network to the decoder sub-networks. FCN [42] is one of the earliest deep networks proposed for semantic segmentation that is trained end-to-end for pixel-wise prediction. In [42], the authors have proved that FCNs can significantly enhance accuracy by transferring pre-trained classifier weights, fusing various layers, and learning end-to-end, and pixels-to-pixels on whole images. For the process of transferring weights, they adopted contemporary classification networks, i.e., Alex-Net, VGG-Net, and Google-Net, and transferred their learned representations via fine-tuning to the segmentation task. Then, to produce detailed and accurate segmentations, they developed a skip architecture that combined semantic information from a deep coarse layer with appearance information from a shallow fine layer.

Later, FCN [42] was extended to the most common segmentation network, i.e., U-Net. U-Net [49] is a pixel-wise encoder-decoder architecture that has been trained in an end-to-end way. It has achieved good segmentation performance. It is commonly used for lesion segmentation, anatomical segmentation, and classification in the medical image analysis sector. The main benefit of the U-Net network is that it cannot only precisely segment the targeted object and objectively process and analyze medical images, but it also can aid to improve the accuracy of medical image diagnosis. In addition, with a few training samples, U-Net can perform effectively while still capable of employing global location and context information simultaneously. Moreover, U-Net architecture outperforms FCN in different challengeable segmentation tasks and gradually becomes the pioneering model in the field of medical image segmentation. However, due to the existence of many layers in the conventional U-Net version, a significant amount of time is needed for training. In addition, relatively high GPU memory for larger images is a necessity. Moreover, employing skip connections is a double-edged sword. Skipping allows a fewer layer-based network which reduces the complexity. In addition, it decays the influence of vanishing gradients effectively. Moreover, a speedy learning process can be achieved. However, on the other side, a semantic gap between low-and high-level features could occur, and some features may be lost between skip connections. Hence, different amendments have been made to the original U-Net architecture to support its weaknesses and add to its strengths [6, 7, 9, 14, 32, 34, 35, 56, 63, 65, 67,68,69].

In this paper, the main contributions can be summed up as follows.

  1. 1

    We establish an effective novel framework for medical image segmentation, dubbed as Ψnet, which is a squeezed parallel multi-stage encoder-decoder network. Figure 1 indicates a summarized graphical abstract of the proposed segmentation network.

  2. 2

    Due to the adopted parallel mechanism, the atrous spatial pyramid pooling (ASPP), and the squeeze-and-excitation behavior in the introduced encoder, semantically significant features are extracted to enhance segmentation performance. The adopted squeeze-and-excitation module boosts the weights of the most essential features. In addition, it improves the representational power of the proposed segmentation network by enabling dynamic channel-wise feature recalibration. Moreover, the parallel scheme helps to draw up the power of triple U-Nets.

  3. 3

    In practice, multi-scale feature extraction is computationally costly and demands a lot of training data, that is not usually available. However, the proposed parallel Triple scheme with multi-stage encoder-decoder U-Net architecture extracts significant essential features which improve the training efficiency. The proposed segmentation model provides a lightweight and less complex network with a total number of parameters of around 33 M compared to the FRCU-Net of 68 M [9]. The larger the number of parameters, the longer the time needed for convergence.

  4. 4

    With the multi-scale feature extraction property of the proposed Ψnet, efficient segmentation results of small datasets, like ColonDB and ETIS-Larib obtained in terms of dice coefficient and Jaccard index, while the traditional U-Net can’t perform effectively with small datasets.

  5. 5

    To demonstrate the generalizability of our model, we have evaluated the proposed Ψnet via a variety of medical image segmentation tasks, such as Kvasir-SEG [36], CVC-ClinicDB [11], CVC-ColonDB [54], ETIS-Larib [50], 2018 Data Science Bowl (DSB) [12], ISIC-2017 [18], and ISIC-2018 [57]. Superior performance is achieved compared to most SOTA models. Figure 2 indicates some visual results of the proposed Ψnet on the employed challenging datasets.

  6. 6

    In our experiments, we performed two main types of evaluations. The first one is traditional testing where the training and testing procedures are performed on the same single dataset. The other one is cross-testing in which the model is trained with a specific dataset and tested with another one. In both evaluations, the proposed model shows effectiveness compared to most of the SOTA models.

Fig. 1
figure 1

Overall view of the proposed Ψnet architecture

Fig. 2
figure 2

Visual results of the proposed Ψnet on the employed datasets

Our paper is organized in the following manner. Section 2 provides an overview of relevant work in medical image segmentation. The proposed Ψnet architecture is presented in detail in Section 3. In the same section, a background of the traditional U-Net is introduced. In Section 4, the employed datasets and metrics are indicated. In addition, the performed experiments are described, and their results are discussed quantitatively and qualitatively. Finally, we conclude our work in Section 5.

2 Related work

Traditional handcrafted features have been used in semantic segmentation before the uprising of deep learning in computer vision. In the last few years, a variety of deep learning-based approaches have developed rapidly and have achieved outstanding results in image segmentation. The main hurdle of deep architectures is their severe hunger for labeled training data. In addition, due to the limitations of manual annotation, providing large-annotated datasets in medical image segmentation is a challenging task [58].

A convolutional neural network (CNN or ConvNet) is a class of artificial neural network (ANN), which is mostly used to analyze visual images [4, 7, 9, 14, 27,28,29, 34, 49, 69]. There are several problems with employing such CNNs, such as losing the image spatial information when the convolutional features are fed into fully connected (FC) layers. In addition, training a CNN-based model may include different problems, such as exploding gradient, overfitting, and class imbalance. These challenges can diminish the model’s performance. To overcome these problems, FCN architecture was proposed in [42].

Ronneberger et al. [49] modified the conventional FCN, proposed by Long et al. [42], by propagating contextual information from the encoder to the decoder. This was done by connecting the encoder and the decoder networks through skip connections that created a U-shaped architecture. This U-shaped architecture becomes later a major innovation of FCN and was named “U-Net”. After that, Zhang et al. [65] presented a deep Residual U-Net (ResUNet) that incorporates the strength of both U-Net and residual neural network. Compared to the original U-Net, ResUNet utilized a better CNN backbone that extracted information at multiple scales which caused better performance.

For medical image segmentation, Chen et al. [14] introduced Dense-Res-Inception Net (DRINET) with a better performance when it is compared to FCN, U-Net, and ResUNet. However, employing a dense-inception block increases the growth rate, which may result in too many parameters, making the model more complex and duller to train. Zhou et al. [67] proposed UNet++ for the task of semantic and instance segmentation. The performance of their proposed network was enhanced by restructuring skip connections and developing a pruning strategy for their architecture. They solved the issue of losing edge information and small objects by down-sampling functions. They tested their model on a variety of medical image segmentation tasks.

Additionally, Jha et al. presented (ResUNet++) [34], which is an advanced form of the basic ResUNet. They employed residual blocks, besides integrating additional layers to their network, including squeeze-and-excitation blocks [31], attention blocks, and ASPP [16]. Compared to ResUNet and U-Net, ResUNet++ achieved higher scores in DSC, IoU, and recall. Reza Azad et al. [7] proposed another modification to the conventional U-Net, denoted as Bi-directional ConvLSTM U-Net (BCDU-Net). Besides the full advantages of U-Net, the performance was improved by capturing more discriminative data utilizing bi-directional ConvLSTM and dense convolutions.

Because of the main problem of skip connection, i.e., the issue of the great semantic gap between high- and low-resolution features which results in fuzzy feature maps, Ibtehaz et al. [32] introduced a model to enhance skip connection, titled as MultiResUNet. Their architecture modified the traditional U-Net with Residual Path (ResPath) wherein encoded features execute extra convolution operations before combining the process with equivalent features in the decoder. Asadi-Aghbolaghi et al. [6] presented another U-Net extension for medical image segmentation, named Multi-level Context Gating U-Net (MCGU-Net). In their architecture, they inserted a squeeze-and-excitation (SE) module in the decoder, besides employing BConvLSTM. They utilized a dense convolutions mechanism for extracting richer discriminative features, which led to more fine segmentation maps. Jha et al. [35] proposed the famous DoubleU-Net, which is a blend of two U-Net networks placed on top of one another. On all used datasets, DoubleU-Net has outperformed different baselines and the traditional U-Net. The evaluation was performed on a variety of medical image segmentation tasks.

Zunair et al. [69] introduced Sharp U-Net as an effective depthwise encoder-decoder fully convolutional network for biomedical image segmentation. In their model, they exclude employing skip connection and instead, they utilized sharpening kernel filter in the encoder path. A depthwise convolution of the encoder feature map with a sharpening kernel filter was performed before merging the encoder and decoder features which produced a sharpened intermediate feature map of the same size as the encoder map. A variety of experiments were performed on six datasets with efficient performance. Azad et al. [9] introduced a new extension to the traditional U-Net, titled a frequency re-calibration U-Net (FRCU-Net). In the skip connection, they employed multi-level BConvLSTM. In addition, they employed SE modules in the decoding path and used densely connected convolutional modules. Moreover, they introduced a frequency-level attention mechanism that utilized a weighted combination of multiple kinds of frequency information to manage and assemble the representation space. However, both FRCU-Net and MCGU-Net need much longer time for convergence because they have a greater number of training parameters. Tran et al. [56] introduced a structural network, titled TMD-Unet. Their model had three major contributions compared to the traditional U-Net. Firstly, they employed three sub-Unet models. Secondly, they utilized dilated convolution (DC) rather than normal convolution. Thirdly, rather than a standard skip connection, they adopted a dense skip connection. A variety of experiments have been performed in this work on different datasets.

3 The proposed Ψnet architecture

The main objective of a supervised network is to learn how to predict the targeted output y from a given input image x, i.e., mapping the input image into the labeled target (P : x → y) , where P is the employed network. The network can learn to extract texture and contextual similarity between the same labeled pixels and the difference between differently labeled neighboring pixels, thereby realistic segmentation can be produced.

Deep learning-based models can provide quick diagnosis and accordingly can support specialists throughout their treatments. Medical image segmentation tasks have mostly adopted U-Net [49] and its related segmentation models [6, 7, 9, 32, 34, 35, 56, 68] to gather both high- and low-level details. U-Net [49], shown in Fig. 3, is a contracting-expanding (encoder-decoder) model which is originally made of a stack of transformers acting as encoder and decoder linked via skip connections. The conventional U-Net is made up of the same number of down-, up-sampling, and convolution layers. Additionally, U-Net connects each pair of down- and up-sampling layers using skip connection operation allowing the spatial information to be straightly transferred to much deeper layers. Hence, highly precise segmentation results can be generated.

Fig. 3
figure 3

The schematic architecture of U-Net

Here, we introduce Ψnet as a new deep learning-based segmentation framework. The main goal of the proposed model is to use fewer parameters and, at the same time, maintain high accuracy on a variety of medical image segmentation tasks. The overall view of the proposed Ψnet system is indicated in Fig. 1. The proposed architecture is based on an end-to-end deep learning approach comprised of three U-Net structures. These three U-Nets are connected parallel to each other in which three single U-Nets are fed with input image simultaneously, see Fig. 4. Employing multiple U-Nets helps in capturing more contextual and semantic features efficiently. The proposed segmentation model depends mainly on three parts, i.e., encoder-decoder backbone with squeeze-and-excitation (SE) block, atrous spatial pyramid pooling (ASPP), as well as output module, in a parallel scheme. The proposed segmentation framework has fewer trainable parameters (~33 M) in comparison to FRCU-Net [9] (~68 M), and FCN-8 s [2] (~134 M), which makes it more suitable for real-time performance. In the following subsections, details about the basic building blocks of the proposed segmentation model are demonstrated.

Fig. 4
figure 4

Detailed schematic structure of the proposed Ψnet. The dashed lines denote the skip connections appended from encoders to decoders

3.1 Encoders

Every encoder tries to encode the input data into representative features at various levels. The encoder increases the channels while decreasing the spatial dimensions in each layer. Every encoder in the model receives an input image and its ground truth mask as inputs. A pretrained VGG-19 [51], that has previously been learned on ImageNet features [20], is used in the first encoder of NET 1, while the second and third encoder are built from scratch in NET 2 and NET 3, respectively. The substantial reasons and advantages of adopting VGG-19 are as follows. (1) In comparison to other pre-trained models, it is a lightweight model. (2) VGG-19 and U-Net have similar architecture, which simplifies the integration between them. (3) It provides a suitable deeper network that guarantees a more accurate segmentation mask. VGG-19 has acquired robust feature representations for a diverse set of images.

For encoder 2 and encoder 3 in NET 2 and NET 3, respectively, each one contains four sub-encoder blocks connected serially. Every sub-encoder block includes two main sub-blocks: convolution block and max pooling. The convolution block performs two 3 × 3 convolution operations, batch normalization, Rectified Linear Unit as an activation function (ReLU), and finally squeeze-excite process, see Fig. 5. For the group of sub-encoders, we employed filter sizes of {32, 64, 128, 256}. Batch normalization speeds up convergence, decreases internal covariant shift, and regularizes the model. The model non-linearity is represented by a Rectified Linear Unit (ReLU) activation function. The employed squeeze-and-excitation (SE) module promotes feature map quality by increasing their sensitivity towards the main significant features. Finally, to minimize the spatial dimension of the feature maps in the sub-encoder, a max pooling with a 2 × 2 window and stride 2 is performed.

Fig. 5
figure 5

The internal structure of each employed sub-encoder block ij inside the encoders of NET 2, and NET 3, where  {2, 3} denotes the main encoder number and  {1, 2, 3, 4} denotes the sub-encoder number

3.2 Squeeze-and-excitation block (SE)

SE is a representational unit employed to raise the network sensitivity to relevant features and suppress unnecessary ones. SE consists of a global 2D average pooling, two dense blocks, and an element-wise multiplication connected serially. The main target of this block is to weigh every feature map to enhance the representational power of salient features. This objective is accomplished in two phases. The squeezing process is the first phase. It is a global information embedding procedure where every channel is squeezed by utilizing global average pooling to generate channel-wise statistics. Excitation is the second phase in which adaptive recalibration is performed to fully capture channel-wise dependencies. SE block enhances the performance of the network with slight additional computational complexity. It exists inside the convolution block in the proposed network, see Fig. 5. Specifically, we added SE blocks into the intermediate stages inside the convolution block employed in encoder 2, encoder 3, decoder 1, decoder 2, and decoder 3. SE methodology, revise Fig. 6, can be summarized as follows. Firstly, it receives its input from the previous Conv2D layers. Secondly, average pooling is employed to squeeze every channel into a single numerical value. Thirdly, a dense layer followed up by a ReLU provides nonlinearity, while reducing the channel complexity with a ratio r. Then, to provide a smooth gating function to each channel, another dense layer is employed with sigmoid activation. Lastly, every feature map in the convolutional block is weighted according to the “excitation” network. The computational cost of SE modules in the model can be changed by adjusting the reduction ratio (r) as a hyperparameter.

Fig. 6
figure 6

A scheme of the employed squeeze-and-excitation (SE) module, where H stands for height, W for width, C for channels, and r denotes the reduction ratio

3.3 Atrous spatial pyramidal pooling (ASPP)

ASPP is a resampling module that employs atrous convolutions with different sampling rates to extract multi-scale features [16]. This is performed by applying multiple filters with different fields of view to the targeted image. Hence, objects and valuable visual context could be captured at different scales. The notion of ASPP is derived from spatial pyramidal pooling [30], which is effective in resampling features at different scales. The ASPP mainly contains two components: atrous convolution and spatial pyramid pooling (SPP). In ASPP, the authors suggested using atrous convolution as an alternative to pooling operation to avoid salient information loss generated by the latter. The atrous convolution can effectively enlarge the field-of-view, i.e., receptive fields of filters, without adding more parameters and can control the resolution of features to extract high-level semantic data. See Fig. 7 for the difference between the regular convolution and the atrous one. Incorporating the advantages of atrous convolution with SPP is proposed by Chen et al. [13] as atrous spatial pyramid pooling (ASPP) module to further boost segmentation performance. ASPP demonstrates high recognition capability on similar objects at multiple levels, which results in a significant accuracy improvement. Hence, ASPP has become a common option in deep segmentation architectures [38], see Fig. 8 for the internal structure of ASPP module. The ASPP block acts as a bridge between the encoder and the decoder in each network branch because it is located in the middle of each branch.

Fig. 7
figure 7

Illustration of the difference between (a) Regular convolution (Top), and (b) Atrous convolution (Bottom). The number of holes/zeroes filled in between the filter parameters is called the dilation rate

Fig. 8
figure 8

The internal design of the atrous spatial pyramid pooling (ASPP) module

3.4 Decoders

As shown in Fig. 4, we use three decoders in the proposed model corresponding to their three encoders. Each decoder in the proposed model contains four sub-decoder blocks. Each sub-decoder block doubles the input feature maps dimension by performing a 2 × 2 bi-linear up-sampling. Then, the group of sub-decoders in NET 1 concatenate the feature maps from the encoder of the same branch through the skip connections and the ASPP block, while the group of sub-decoders in NET 2 and NET 3 concatenate the feature maps from the encoder of the same and the previous branch through the skip connections and the ASPP block. Later, the concatenated maps are applied to the convolution block with filter sizes {256, 128, 64, 32}. Figure 9 represents the internal structure of only one sub-decoder block. All sub-decoders have the same internal layers. The decoders have a special feeding mechanism through skip connections, revise the dashed line in Fig. 4. Skip connections facilitate the gradient flow which leads to easier training and enhances the overall performance of the network. In addition, they help to recover the spatial data wasted due to pooling processes by extracting richer features.

Fig. 9
figure 9

The internal structure of the sub-decoder block ij, where  {1, 2, 3} denotes the main decoder number and  {1, 2, 3, 4} denotes the sub-decoder number, inside the decoders of NET 1, NET 2, and NET 3

3.5 Output blocks

Finally, as shown in Fig. 4, we use three identical output blocks connected after each decoder block. Each output block applies a convolution layer followed by a sigmoid function. Then, an element-wise addition is performed to the resultant three feature maps. Finally, a 1 × 1 convolution layer accompanied by a sigmoid activation function is applied to obtain the final segmentation mask.

4 Experimental results and discussion

Experimentally, the proposed segmentation network, i.e., Ψnet, is tested on seven datasets in total, which are Kvasir-SEG [36], CVC-ClinicDB [11], CVC-ColonDB [54], and ETIS-LaribDB [50] for polyp segmentation, 2018 Data Science Bowl challenge [12] for cell nuclei segmentation, ISIC-2017 and ISIC-2018 [18, 19, 57] for skin lesion segmentation. For details about the employed datasets, check Table 1. In addition, Fig. 10 indicates visual samples from the employed datasets.

Table 1 Details about the employed datasets and their characteristics
Fig. 10
figure 10

Samples from the employed datasets with their corresponding ground truth segmentation masks. As indicated, they show variations in shape, size, color, irregular boundaries, and appearance

All experiments were implemented using the Keras framework [17] and TensorFlow [1] as the backend. GPU with high RAM from Google Colab Pro is employed. A batch size of 16 was adopted. We have used dice loss, which is a widely used loss function, based on dice coefficient, for image segmentation tasks [52]. The adaptive moment estimation (Adam) optimizer was utilized with an initial learning rate of 1e−4 in all experiments. A ReduceLRonPlateau callback, provided by Keras, was employed to monitor the performance. When the validation loss didn’t improve after 10 iterations, the learning rate was reduced by a factor of 0.1. In addition, an early stopping was applied when we didn’t have any improvements for 20 consecutive epochs. In all datasets, we utilized 80% of the dataset for training, 10% for validation, and the remaining 10% for testing. We trained the model for 300 epochs in all experiments. These hyperparameters were chosen experimentally based on the empirical evaluation.

Evaluation metrics play a pivotal important role in assessing the efficiency of segmentation models, see Table 2 for details about the employed evaluation metrics. In this work, we have analyzed the results using dice similarity coefficient (DSC), Jaccard similarity coefficient (JSC), accuracy (ACC), sensitivity (SE), precision (PREC), and specificity (SPEC) metrics. The empirical results of the proposed methodology on each employed dataset are indicated visually and computationally compared to SOTA techniques. In the performed comparisons, the following segmentation techniques are employed: U-Net [49], ResUNet [65], Recurrent Residual U-Net [4], ResUNet++ [34], UNet++ [68], BCDU-Net [7], MultiResUNet [32], MCGU-Net [6], DoubleU-Net [35], FRCU-Net [9], TMD-Unet [56], Attention Deeplabv3p [8], DDANet [55], and ColonSegNet [37].

Table 2 The employed evaluation metrics, where GT indicates Ground Truth, SR represents Segmentation Result, TP denotes True Positive, TN denotes True Negative, FP for False Positive, FN for False Negative

4.1 Testing the proposed Ψnet on each single/individual dataset

In this subsection, we will indicate the segmentation results obtained after training and testing the suggested segmentation network on a single dataset each time.

4.1.1 Skin lesions segmentation

For the task of segmenting skin lesions, two well-known dermoscopic imaging datasets are used, i.e., ISIC (International Skin Imaging Collaboration) challenge 2017, and 2018. The ISIC database comprises RBG dermoscopy images and their corresponding ground truth binary masks to designate lesion boundaries. Due to the lesions’ diverse characteristics in terms of shapes, colors, and textures, this segmentation task is extremely challenging.

A. ISIC-2017 Challenge

The ISIC 2017 competition [18] consists of three challenges: lesion segmentation, dermoscopic feature detection, and disease classification. Here, we target the task of segmentation. The proposed Ψnet is compared to the traditional U-Net [49] besides the most recent studies in skin lesion segmentation. Table 3 depicts the quantitative results of the performed comparison. As indicated, the proposed Ψnet shows the top F1-score performance with an approximate improvement margin of 1.25% compared to MCGU-Net [6] which shares the same idea of inserting the SE module in the decoder. In addition, the introduced method outperforms the commonly used BCDU-Net [7], which employs a bidirectional scheme in their enhancement strategy, in terms of F1-score with a 2.42% increase. On the other side, a 3.7% improvement in F1-score is achieved compared to the traditional U-Net [49]. In terms of specificity, the Ψnet model outperforms MCGU-Net [6]. On the other side, the proposed Ψnet comes in second place after MCGU-Net [6], in terms of accuracy, with a slight difference but it still outperforms the traditional U-Net [49] by an approximate accuracy margin of 2.24%. However, the traditional U-Net [49] is still the top performer in terms of recall, while MCGU-Net [6] is the top performer in terms of IoU. For supporting visual results, see Fig. 11. In addition, Fig. 12 demonstrates output segmentation masks using the proposed methodology compared to other SOTA methods.

Table 3 Computitative comparison between the proposed model and the most common segmentation models on ISIC-2017. The best results are indicated in bold font, while the second place is underlined
Fig. 11
figure 11

Visual segmentation results of the proposed model on ISIC-2017

Fig. 12
figure 12

Visual outputs using the proposed methodology compared to various SOTA models on ISIC-2017

B. ISIC-2018 Challenge

This dataset [19] is a large-scale dataset of dermoscopy images that the International Skin Imaging Collaboration (ISIC) published in 2018. For a quantitative comparison between the proposed network and some common different alternatives, see Table 4. From this quantitative comparison, the suggested network achieves better performance compared to SOTA alternatives in terms of F1-score, sensitivity, accuracy, and precision. The proposed Ψnet shows superior performance compared to FRCU-Net [9], which includes a multi-level BConvLSTM and SE, by achieving an F1-score improvement of 1.8%. In addition, the proposed network overcomes the commonly known BCDU-Net [7], which utilizes BConvLSTM and dense convolutions, with an 8% F1-score improvement. Moreover, superior and huge improvements, i.e., 4.7% and 28.4%, as well as 9.4% and 32.1%, are achieved compared to TMD-Unet [56], traditional U-Net [49] in terms of F1-score and IoU, respectively. Furthermore, the proposed Ψnet with Attention Deeplabv3+ [8] shows the highest accuracy of 0.964 compared to the other alternatives with an approximate margin of 7.4% improvement compared to U-Net [49]. However, in terms of specificity, the proposed Ψnet model, TMD-Unet [56], and BCDU-Net [7] achieve the third top score of 0.982 after Attention Deeplabv3+ [8] and MCGU-Net [6] with a very slight difference. For some qualitative segmentation masks of the proposed network on the ISIC-2018 dataset, see Fig. 13. As indicated, the proposed network performs efficiently on all kinds of lesions from small to large ones. In addition, Fig. 14 shows different output segmentation results using the proposed methodology compared to other SOTA methods.

Table 4 Quantitative comparison on skin lesion segmentation challenge ISIC-2018 between the proposed Ψnet and the most common segmentation models. The best results are bolded, and the second place is underlined
Fig. 13
figure 13

Some visual segmentation results using the proposed Ψnet on ISIC-2018

Fig. 14
figure 14

Visual results of the proposed model compared to other SOTA models on ISIC-2018 dataset

4.1.2 Polyps segmentation

Colonoscopy is an efficient mechanism to expose colorectal polyps that are closely associated with colorectal cancer. Segmenting polyps from colonoscopy images is crucial in clinical practice because it gives important information for diagnosis and surgery. However, the appearances, sizes, colors, textures, and aspect ratios of polyps in colonoscopy images vary, even of the same type. In addition, there is no sharp boundary between a polyp and the mucosa around it. Hence, precise polyp segmentation is a difficult task.

For the evaluation of Ψnet, we conducted experiments on four common colonoscopy polyp segmentation benchmarks, i.e., Kvasir-SEG, CVC-ColonDB, CVC-ClinicDB, and ETIS-Larib. The achieved performance is better than most of SOTA methods. Despite the prementioned challenges in polyp segmentation, the proposed Ψnet shows effectiveness in segmentation, check Table 5 for the results of all utilized colonoscopy datasets. The next subsections indicate more detailed quantitative and qualitative results of each polyp dataset which prove the accuracy and the generalizability of the proposed Ψnet.

Table 5 Quantitative comparison of our model and the SOTA polyp segmentation methods on Kvasir, ClinicDB, ColonDB, and ETIS

A. Kvasir-SEG Challenge

The first used colonoscopy dataset in the performed experiments is Kvasir-SEG dataset [36]. It is publicly available for polyp detection, localization, and segmentation. Sample images from Kvasir-SEG dataset and their corresponding masks are displayed in Fig. 10. The quantitative evaluation of Ψnet is reported in Table 6, while the supporting qualitative results are shown in Figs. 15 and 16. As indicated, the proposed model shows superior performance in all metrics compared to ResUNet, ResUNet++, NanoNet, and DDANet. Compared to DDANet, the proposed methodology shows superior performance by achieving improvements of 4.69%, 4.7%, and 7.68% in terms of DSC, IoU, and precision, respectively. In addition, Ψnet achieved an approximate increase of 8.09% in terms of recall compared to ResUNet++, which is based on encoder-decoder architecture with residual and SE blocks. The proposed framework ability to segment polyps can be seen by comparing the ground truth to the predicted mask, see Fig. 15. Furthermore, Fig. 16 depicts the segmentation outputs using the proposed methodology compared to other SOTA techniques. From the visual results, we can deduce that the proposed model generates better remarkable results with small- and large-sized polyps.

Table 6 Quantitative comparison of the Kvasir-SEG dataset between the proposed Ψnet and the most common SOTA polyp segmentation models. The best scores are bolded, while the second place is underlined
Fig. 15
figure 15

Visual qualitative segmentation results of Ψnet on medium, flat, and large polyps from the Kvasir-SEG dataset for automatic polyp detection

Fig. 16
figure 16

Visual comparison of the proposed model on the Kvasir-SEG dataset against various SOTA models

B. CVC-ClinicDB

The second employed colonoscopy dataset is CVC-ClinicDB [11], known as CVC-612. Based on the quantitative results in Table 7, the proposed model achieves the highest DSC, IoU, and precision compared to other SOTA methods, such as U-Net, Deeplabv3 + (Xception), Deeplabv3 + (Mobilenet), HRNetV2-W18-Smallv2, HRNetV2-W48, ResUNet++, ResUNet++ + CRF, DoubleU-Net, and TMD-Unet. The DSC and IoU scores are important metrics in segmentation tasks. ResUNet++ + CRF and ResUNet++ achieve the highest recall scores with a very small difference between each other. Compared to DoubleU-Net, there is a large gap in the recall by approximately 6.97% from ours. In addition, the proposed Ψnet surpasses the baseline architectures, like U-Net and Deeplabv3 + (Xception), with a significant margin, in terms of IoU and DSC, with 10.77%, 2.52% and 6.68%, 5.52% improvements, respectively. Moreover, it outperforms existing SOTA techniques, like DoubleU-Net, and TMD-Unet, and achieves a higher DSC of 0.9449. Comparing our methodology to ground truth masks, the predicted masks have substantially identical polyp boundaries and shapes as shown in Fig. 17. In addition, some output segmentation masks using the proposed model in comparison to some SOTA methods are indicated in Fig. 18.

Table 7 Quantitative comparison between the proposed Ψnet and the most common SOTA segmentation models on CVC-ClinicDB. The best scores are bolded, and the second place is underlined
Fig. 17
figure 17

Visual segmentation results of the proposed Ψnet on CVC-ClinicDB

Fig. 18
figure 18

Visual comparison of the proposed Ψnet and various SOTA models on CVC-ClinicDB

C. CVC-ColonDB

CVC-ColonDB [54] is the third polyp dataset that is employed for a more comprehensive performance analysis of automatic polyp segmentation. The quantitative results in Table 8 demonstrate that the proposed Ψnet outperforms other SOTA techniques with a superior DSC of 0.9269 by an approximate 7.95% improvement compared to ResUNet++ + TTA. There is a slight difference in DSC values between ResUNet++ and ResUNet++ + CRF. In addition, superior performance is reported in terms of IoU of 0.8641 with an approximate 1.75% improvement compared to ResUNet++ + TTA. In terms of recall, the proposed network demonstrates better performance compared to ResUNet++ and ResUNet++ + CRF with 9.12%, and 9.26% enhancement, respectively. Moreover, it achieves a higher score in precision by an approximate 11.75% increase compared to ResUNet++ + TTA + CRF. From the visual qualitative outputs in Fig. 19, we can conclude that the developed network provides exact segmentation masks when compared to ground truth masks. More visual results are indicated in Fig. 20 compared to other SOTA methods.

Table 8 Quantitative comparison between the proposed Ψnet and the most common SOTA segmentation models on CVC-ColonDB. The best scores are bolded, and the second place is underlined
Fig. 19
figure 19

Visual segmentation results of Ψnet on challenging images from CVC-ColonDB

Fig. 20
figure 20

Visual qualitative comparison on CVC-ColonDB of the proposed model and various SOTA models

D. ETIS-Larib Dataset

ETIS-LaribPolypDB [50] is the fourth employed polyp dataset. It is the most challenging one because the majority of polyps are small and hard to identify. The proposed network achieved the best DSC, IoU, precision, and recall compared to some SOTA methods, such as PraNet, ResUNet++, ResUNet++ + CRF, ResUNet++ + TTA, and ResUNet++ + TTA + CRF, as indicated in Table 9. It achieves the highest DSC and IoU of 0.8888 and 0.8000 with 25.24% and 4.66% improvements, respectively, compared to ResUNet++. Moreover, it provides the highest precision with 0.9767. Furthermore, it achieves 26.4% and 33% improvements in recall and precision, respectively, compared to ResUNet++, and improvements of 29.9% and 32.02% compared to ResUNet++ + TTA.

Table 9 Quantitative comparison between the proposed Ψnet and the most common SOTA polyp segmentation models on ETIS-LaribPolypDB. The best scores are bolded, and the second place is underlined

ETIS-LaribPolypDB is the most dataset affected by the changes in the reduction ratio r inside the SE block. By changing r from 8 to 16, a huge enhancement is achieved in all metrics with a 4.55% increase in DSC, 7.68% in IoU, 1.32% in recall, and 0.04% in precision. In addition, there is a high difference between the performance of the proposed Ψnet and the other listed techniques that makes our model a new strong baseline for medical image segmentation. For supporting visual results, see Figs. 21 and 22. Therefore, the visual and computational results reveal the significance of the proposed network in providing automated polyp detection and delineation with less miss-detection rates.

Fig. 21
figure 21

Some visual segmentation results of Ψnet on ETIS-LaribPolypDB

Fig. 22
figure 22

Visual comparison of the proposed model and other several SOTA models on ETIS-LaribPolypDB

4.1.3 Nuclei segmentation

Nuclei segmentation is a key technique for automatic pathological screening. Precisely segmented nuclei are necessary, not only for cancer detection but for defining the proper treatment effectively. Nevertheless, the variety in cell types and sizes, as well as the variety in extrinsic influences, and illumination circumstances, make nucleus segmentation a difficult task. For nucleus segmentation, the 2018 Data Science Bowl (DSB) dataset [12] is employed which is publicly available at Broad Bioimage Benchmark Collection (https://data.broadinstitute.org/bbbc/BBBC038/).

The goal of the DSB 2018 challenge is to find nuclei in divergent images. The quantitative results in Table 10 demonstrate that the proposed Ψnet outperforms other networks, such as U-Net, UNet++, ResUNet++, Deeplabv3+ (Xception), Deeplabv3+ (Mobilenet), HRNetV2-W18-Smallv2, HRNetV2-W48, ResUNet++ + CRF, PraNet, and TMD-Unet, in terms of DSC, IoU, and precision. In addition, the proposed approach achieves the top scores in DSC, IoU, and Precision of 0.9243, 0.8632, and 0.9252 with 0.87%, 1.51%, and 2.83% improvements, respectively, compared to TMD-Unet. ColonSegNet is the top precision performer while TMD-Unet is the top recall performer. The developed model surpasses ResUNet++ in terms of DSC and IoU with 1.45% and 2.62% improvements, respectively. In addition, it achieves an increase of 1.1% enhancement in recall compared to PraNet. For supporting visual results, see Figs. 23 and 24 which indicate some output segmentation masks using the proposed methodology compared to various SOTA methods.

Table 10 Quantitative comparison between the proposed Ψnet and the most common SOTA segmentation models on 2018 Data Science Bowl dataset. The highest results are bolded, while the second place is underlined
Fig. 23
figure 23

Visual segmentation results of Ψnet on 2018 Data Science Bowl

Fig. 24
figure 24

Visual qualitative results of the proposed model on 2018 Data Science Bowl compared to various SOTA models

4.2 Cross-testing the employed four colonoscopy imaging datasets

Usually, the generalization capability of a specific model is evaluated by performing a blind test on a part of the employed dataset that the model did not see before. More generalization can be tested by evaluating its applicability across various datasets from multiple sources. Cross-data evaluation is important to validate the model on unseen polyps from other sources. Hence, in this subsection, we will indicate the segmentation results of employing the proposed segmentation network after a training process on a single specific polyp dataset while the testing is performed on other datasets besides the training one. Detailed results are indicated in Tables 11, 12, 13 and 14 and demonstrated in the following subsections.

Table 11 Cross-testing results, where Ψnet is trained on CVC-ClinicDB and tested on Kvasir-SEG, CVC-ColonDB, ETIS-Larib, and CVC-ClinicDB. The highest scores are bolded, and the second place is underlined
Table 12 Cross-testing results, where Ψnet is trained on Kvasir-SEG and tested on CVC-ClinicDB, CVC-ColonDB, ETIS-Larib, and Kvasir-SEG. The highest scores are bolded, and the second place is underlined
Table 13 Cross-testing results, where Ψnet is trained on CVC-ColonDB and tested on CVC-ClinicDB, CVC-ColonDB, ETIS-Larib, and Kvasir-SEG. The highest scores are bolded, and the second place is underlined
Table 14 Cross-testing results, where Ψnet is trained on ETIS-Larib and tested on CVC-ClinicDB, CVC-ColonDB, ETIS-Larib, and Kvasir-SEG. The highest scores are bolded, and the second place is underlined

4.2.1 CVC-ClinicDB-based cross-evaluation

Here, the model is trained on CVC-ClinicDB, then the testing process is performed on CVC-ClinicDB, besides the other three polyp datasets, i.e., CVC-ColonDB, ETIS-Larib, and Kvasir-SEG. Table 11 shows the results of cross-testing the proposed Ψnet. As indicated, superior performance is achieved in all metrics when Ψnet is trained and tested on CVC-ClinicDB. CVC-ClinicDB achieves the top DSC of 0.9449 compared to Kvasir-SEG which occupies the second place in DSC with 0.7688, while ETIS-Larib and CVC-ColonDB come in the third and fourth place with DSC of 0.7373 and 0.6494, respectively. CVC-ColonDB shows the worst performance in this cross-evaluation process, mainly, because of the high dissimilarity between the original training CVC-ClinicDB dataset and the testing CVC-ColonDB one. Figure 25 indicates some visual samples of cross-evaluating the proposed model.

Fig. 25
figure 25

Visual results of cross-testing the proposed model. The training process is performed on CVC-ClinicDB while testing is performed on Kvasir-SEG, ETIS-Larib, CVC-ColonDB, besides the original dataset employed in the training process

4.2.2 Kvasir-SEG-based cross-evaluation

Here, the model is trained on Kvasir-SEG, then the testing process is performed on Kvasir-SEG, besides the other three independent polyp datasets, i.e., CVC-ColonDB, CVC-ClinicDB, and ETIS-Larib. The results of cross-testing the proposed Ψnet are shown in Table 12. As indicated, the proposed methodology achieves the highest scores in all metrics when Ψnet is trained and tested on Kvasir-SEG. Kvasir-SEG attains the highest DSC of 0.9045, compared to testing on CVC-ClinicDB which occupies the second place in DSC with a score of 0.8120, while CVC-ColonDB and ETIS-Larib come in the third and the fourth place with DSC of 0.7215 and 0.6433, respectively, via cross-evaluation. ETIS-Larib shows the lowest performance in this cross-evaluation process in terms of DSC, precision, and IoU due to the varying nature of polyps because of their unique shapes, appearance, different colors, sizes, and structures. Figure 26 shows some visual samples of cross-evaluating the proposed model.

Fig. 26
figure 26

Visual results of cross-testing the proposed model. The training process is performed on Kvasir-SEG while testing is performed on CVC-ClinicDB, CVC-ColonDB, ETIS-Larib, besides the original dataset employed in the training process

4.2.3 CVC-ColonDB-based cross-evaluation

Here, the model is trained on CVC-ColonDB and the testing process is performed on CVC-ColonDB, besides the other three independent polyp datasets, i.e., CVC-ClinicDB, ETIS-Larib, and Kvasir-SEG. Table 13 shows the results of cross-testing the proposed Ψnet. From Table 13, the best high scores are achieved when Ψnet is trained and tested on CVC-ColonDB. CVC-ColonDB achieves the best DSC of 0.9269 compared to testing on CVC-ClinicDB which takes the second DSC place of 0.6075, while Kvasir-SEG and ETIS-Larib come in the third and the fourth place with DSC of 0.4909 and 0.4020, respectively. ETIS-Larib shows the lowest performance in this cross-evaluation process due to the varying nature of CVC-ColonDB polyps compared to ETIS-Larib, such as variation in pixel intensity distribution generated using different colonoscopes. Figure 27 indicates some visual samples of cross-evaluating the proposed model.

Fig. 27
figure 27

Visual results of cross-testing the proposed model. The training process is performed on CVC-ColonDB while testing is performed on Kvasir-SEG, ETIS-Larib, CVC-ClinicDB, besides the original dataset employed in the training process

4.2.4 ETIS-Larib-based cross-evaluation

Here, the model is trained on ETIS-Larib and the testing process is performed on ETIS-Larib, besides the other three independent polyp datasets, i.e., CVC-ColonDB, Kvasir-SEG, and CVC-ClinicDB. Table 14 shows the results of cross-testing the proposed Ψnet. As indicated, Ψnet shows the top performance when it is trained and tested on ETIS-Larib. ETIS-Larib accomplishes the best DSC of 0.8888, compared to Kvasir-SEG which occupies the second place with a DSC of 0.5841, while CVC-ClinicDB and CVC-ColonDB come in the third and fourth places with DSC of 0.5655 and 0.4327, respectively. CVC-ColonDB shows the worst performance in this cross-evaluation process in terms of DSC, IoU, and recall, mainly because of the high dissimilarity between ETIS-Larib and CVC-ColonDB, each dataset has its nature and characteristics. Figure 28 shows some visual samples of cross-evaluating the proposed model.

Fig. 28
figure 28

Visual results of cross-testing the proposed model. The training process is performed on ETIS-Larib while testing is performed on Kvasir-SEG, CVC-ColonDB, CVC-ClinicDB, besides the original dataset employed in the training process

4.3 Ablation study

To verify the effectiveness of Ψnet, ablation studies are conducted to analyze various elements and settings, including hyper-parameter tuning, loss function, image resolution, and pretrained network.

Hyper-parameters

Hyper-parameter tuning is an essential task in deep learning models, which determines the accuracy of the model. Hyper-parameters control the learning process which consequently impacts how well the model performs. An optimal configuration gives the best results whereas suboptimal choices may result in the worst accuracy. Generally, after multiple trials, and error basis evaluation, our hyper-parameters are chosen. However, it is time-consuming to test various combinations without any scientific reason.

Batch size is defined as the number of examples that are used from the training dataset to estimate the gradient error which influences the dynamics of the learning algorithm. Large batch sizes slow down the learning process but produce more stable models compared to smaller batches. In our experiments, a batch size of 16 was chosen. Using activation functions, like ReLU, in the different modules inside the proposed system helped in resolving the vanishing gradient problem. Adaptive learning rate optimizers, like Adam, show efficient performance and achieve higher accuracy with a learning rate of 0.0001.

Changing the image size greatly affects the accuracy and the training time of the model and its memory space. Hence, it is the main objective to set the best proper image size in the training datasets to extract the essential information such as size, shape, and texture that could contribute to the improvement in the image segmentation accuracy. Employing an image size of 256 × 256 achieves a good balance between the computation time and the performance. In addition, it is like a standard size in all alternative models that we have employed in our experiments.

The performance of the SE module is controlled by its reduction ratio (r). Increasing the reduction ratio reduces the total number of parameters of Ψnet. For example, by changing r from 8 to 16, the total number of parameters changed from 33,903,845 to 33,849,445. Mostly, setting r = 8 assures a good balance between complexity and accuracy. Hence, it is used as the default reduction ratio in Ψnet as in Tables 3, 4 and 5, while in Tables 6, 7, 8, 9 and 10, we show the effect of increasing the reduction ratio on performance. As indicated, sometimes, we can get better performance with larger r such as the results of ETIS-LaribPolypDB.

Loss function

In highly imbalanced segmentation tasks, small-sized foreground classes in the training process are ignored, which results in low segmentation accuracy. This is called the class imbalance problem and can be alleviated by weighting the loss of the small-sized foreground classes. Loss functions are an important factor in handling this problem. There are two types of class imbalance, i.e., class imbalance at the sample-level and pixel-level. Pixel-level class imbalance occurs when only a few pixels of a sample addressing a particular class are harder to be addressed at the data collection stage compared to other ones. On the other side, the sample-level class imbalance describes the imbalance of classes in a dataset. Like classification tasks, this type of imbalance can be addressed during data collection by including class representatives uniformly.

Existing loss functions for segmentation tasks can be divided into four categories [21, 33]: distribution-based loss, region-based loss, boundary-based loss, and compounded loss. Distribution-based loss functions, such as cross-entropy and focal loss, measure the dissimilarity between two distributions. Region-based loss functions quantify the mismatch or the overlap between two regions, such as dice loss, Tversky loss, focal Tversky loss, and log-cosh dice loss. Boundary-based loss functions measure the distance between two boundaries, such as Euclidean distance or Harsdorf distance. Compounded loss functions are defined as the combinations among the distribution-, region-, and boundary-based loss functions (combo loss). Most of the loss functions are based on cross-entropy and dice loss functions. However, the objects in medical images, such as nuclei and polyps often occupy a small region in the image. The cross-entropy loss and some others are not optimal for such tasks. Most SOTA methods employ the dice coefficient loss function in their experiments. Hence, we follow their steps for the sake of fair comparisons. However, in Table 15, the effect of different loss functions is shown on ISIC-2018. As shown, the scores are very similar and close, but focal Tversky loss shows the best performance on ISIC-2018. ISIC-2018 is a heavily imbalanced dataset, hence it is the most one affected by changing the loss function.

Table 15 Performance evaluation of different types of segmentation loss functions on ISIC-2018. The highest scores are bolded, and the second place is underlined

The impact of pretrained network

Here, we replaced the adopted VGG-19 in the first encoder of NET1 with other common pretrained networks, such as ResNet50, and DenseNet121. Both ResNet50 and DenseNet121 can provide a lower number of parameters but at expense of the performance, see Table 16 for a computitative comparison of CVC-ColonDB and ETIS-LaribPolypDB. These two datasets are the most challenging ones. They have polyps with a variety of sizes, shapes, textures, and characteristics as depicted in Figs. 19, 20, 21 and 22. Some of them have the same texture as the colon and grow horizontally, leading to polyp misdetections. Hence, extracting the salient features by the pretrained network in the first encoder is a challenging task.

Table 16 Ablation study on the influence of pretrained networks in Ψnet on CVC-ColonDB and ETIS-LaribPolypDB

5 Conclusion

This paper introduces a novel encoder-decoder-based architecture, dubbed Ψnet, to semantically segment medical images. A pre-trained network is employed in the proposed encoder to increase the model’s capability of learning long-range dependencies and capturing global contextual representation effectively. In addition, we have increased the representation of salient features by weighing every feature map with a squeeze-and-excitation block. Moreover, we have added ASPP for dense large-scale feature extraction by capturing global multiscale contextual information. Hence, the proposed model combines the targeted semantic information at different levels. We validated the effectiveness of the proposed Ψnet via extensive experiments on different segmentation tasks with different modalities, such as colonoscopy, dermoscopy, and microscopy. In these experiments, the proposed Ψnet achieves superior performance compared to SOTA models, such as U-Net, ResUNet, ResUNet++, UNet++, BCDU-Net, MCGU-Net, FRCU-Net, Attention Deeplabv3p, DDANet, ColonSegNet, and TMD-Unet. In all employed datasets, our model produced the absolute best DSC results despite the challenges of having different shapes, types, and sizes, ranging from tiny to enormous, in polyps, lesions, and nuclei. In addition, for testing the generalizability of the proposed model, a cross-evaluation is performed on different datasets which proves good performance, especially when the training and testing datasets share the same nature. Hence, the proposed Ψnet could be a new baseline in medical image segmentation.

Model limitations and future work

Despite the effectiveness of the proposed Ψnet, it comprises three parallel multi-scale branches, which leads to a network of around 33 M parameters compared to the traditional U-Net of 8 M. This number of parameters makes the model more complex and slower in training. Hence, in the future, we intend to employ attentive and residual mechanisms that may help to reduce the branches’ complexity. In addition, we believe that increasing the dataset size and implementing additional augmentation approaches will boost the model’s performance even further. Moreover, the application of Ψnet should not be confined to medical segmentation but might be extended to natural image segmentation and other pixel-wise classification tasks. Furthermore, we will seek to extend the proposed Ψnet to include volumetric segmentation tasks.