Semantic segmentation has emerged as a critical tool in computerized medical image and surgical video analysis, empowering numerous applications in various domains. In surgical videos, semantic segmentation is a prerequisite in several applications ranging from phase and action recognition, irregularity detection, surgical training, objective skill assessment, relevance-based compression, surgical planning, operation room organization, and so forth [1,2,3,4]. In the case of volumetric medical images, semantic segmentation can considerably aid in the diagnosis, treatment planning, and monitoring [5]. Automatic segmentation of medical images and videos can also reduce subjective errors caused by time constraints and workloads while enhancing treatment and surgical efficiency.

Designing a neural network architecture for medical image and surgical video segmentation presents a challenge due to the diverse features exhibited by different relevant labels. Specifically, many classes of objects relevant to the medical image and surgical video analysis are heterogeneous, featuring deformable or amorphous instances, as well as color, texture, and scale variation. Besides in surgical videos, the problem of motion blur degradation becomes more critical due to the camera’s proximity to the surgical scene. Unlike general images, medical images and surgical videos may contain transparent relevant content (such as intraocular lens) or exhibit blunt boundaries, further complicating the task of semantic segmentation. Accordingly, an effective network for medical image and surgical video segmentation should be able to simultaneously deal with (I) heterogeneity and deformability in relevant objects, and (II) transparency, blunt edges, and distortions such as motion and defocus blur.

This paper introduces a U-Net-based CNN for semantic segmentation, which effectively addresses the challenges associated with segmenting relevant content in medical images and surgical videos by adaptively capturing semantic information.Footnote 1 The proposed network, called DeepPyramid+, comprises two key modules: (i) Pyramid View Fusion (PVF) module, which offers a narrow-to-wide-angle global view of the feature map centering at each pixel position, and (ii) Deformable Pyramid Reception (DPR) module, responsible for performing shape-adaptive feature extraction on the input convolutional feature mapFootnote 2. We provide comprehensive experiments to compare the performance of DeepPyramid+ with state-of-the-art baselines for five intra-domain and two cross-domain datasets. Experimental results reveal the superiority of DeepPyramid+ compared to the baselines. Ablation studies confirm the effectiveness of each proposed module in boosting semantic segmentation performance. To support reproducibility and further investigations, we will release the PyTorch implementation of DeepPyramid+ and all dataset splits with the acceptance of this paper.

Related work

U-Net [7] was initially proposed for medical image segmentation and achieved succeeding performance being attributed to its skip connections. Many U-Net-based architectures have been proposed over the past years to improve the segmentation accuracy and address different flaws and restrictions in the previous architectures [8,9,10,11,12,13,14].

Attention modules

Attention mechanisms can be broadly described as the techniques to guide the network’s computational resources (i.e.,the convolutional operations) toward the most determinative features in the input feature map [9, 15, 16]. Such mechanisms have been especially proven to be gainful in the case of semantic segmentation. The scSE blocks [15] aim to recalibrate the feature maps based on pixel-wise and channel-wise global features. BARNet [12] adopts a bilinear-attention module to extract the cross-dependencies between the different channels of a convolutional feature map. PAANET [11] uses a double-attention module to model semantic dependencies between channels and spatial positions in the convolutional feature map.

Fusion modules

Fusion modules can be characterized as modules designed to improve semantic representation via combining several feature maps. The input feature maps could range from varying-level semantic features to the features coming from parallel operations. PSPNet [17] adopts a pyramid pooling module (PPM) containing parallel sub-region average pooling layers followed by upsampling to fuse the multi-scale sub-region representations. Atrous spatial pyramid pooling (ASPP) [18, 19] was proposed to deal with objects’ scale variance by aggregating multi-scale features extracted using parallel varying-rate dilated convolutions. CPFNet [13] uses another fusion approach for scale-aware feature extraction.

Fig. 1
figure 1

Overall architecture of DeepPyramid+ consisting of encoder blocks of the VGG16 network, and the proposed PVF and DPR modules. The numbers in each block correspond to the output feature map’s dimensions


We present a segmentation network that focuses on (I) modeling heterogeneous classes featuring deformations, shape, scale, color, and context variation, (II) dealing with content distortion due to motion blur and reflection, and (III) handling objects’ transparency and blunt boundaries (Fig. 1). At its core, our network adopts the U-Net architecture, with the encoder part being set to VGG16. We develop two decoder modules specifically tailored to tackle the mentioned challenges: (1) Pyramid View Fusion (PVF), which aims to replicate a deduction process within the neural network analogous to the functioning of the human visual system by enhancing the representation of relative information at each individual pixel position. (2) Deformable Pyramid Reception (DPR), which addresses the limitations of regular convolutional layers by introducing deformable dilated convolutions and shape- and scale-adaptive feature extraction techniques. This module allows for handling the complexities of heterogeneous classes and deformable shapes, resulting in improved accuracy and robustness in the segmentation performance.

We specify the functionality of each module in the following subsections. Additional discussions regarding the effectiveness of each module and an analysis of the complexity for each module are available in the supplementary material.

Notations. Throughout this paper, we represent convolutional layers with a kernel size of \((k\times k)\), dilation of d, m output channels, and g groups as \(\circledast _{k,d}^{m,g}\). For deformable convolutions, we use the symbol \({\tilde{\circledast }}_{k,d}^{m,g}\). Additionally, we illustrate the average-pooling layer with a kernel size of \((k\times k)\) and a stride of s pixels as and global average pooling as . The symbol \(+\!\!\!\!+\,_{D}\) denotes feature map concatenation over dimension D. Furthermore, we employ \(\Uparrow ^{(W_{out}, H_{out})}\) and \(\Downarrow ^{(W_{out}, H_{out})}\) for upsampling and downsampling operations with a scale factor of \((W_{out}, H_{out})\), respectively. We use \(\sigma (\cdot )\) to represent the Softmax operation, \(\Vert \cdot \Vert _{n}\) for layer normalization over the last n dimensions, \(\mathcal {R}(\cdot )\) for the ReLU nonlinearity function, and \(\tau (\cdot )\) for the hard tangent hyperbolic function.

Pyramid View Fusion (PVF)

To optimize computational complexity, the initial step involves creating a bottleneck by employing a convolutional layer with a kernel size of one, as illustrated in Fig. 2. Following this dimensionality reduction stage, the resulting convolutional feature map is fed into four parallel branches. The first branch features a global average pooling layer, which is subsequently followed by upsampling. The other three branches employ average pooling layers with progressively increasing filter sizes while maintaining a stride of one pixel. The use of a one-pixel stride is specifically important to achieve a pixel-wise centralized pyramid view, as opposed to the region-wise pyramid attention approach employed in PSPNet [17]. The output feature maps from all branches are then concatenated and fed into a convolutional layer with four groups, for extracting inter-channel dependencies during dimensionality reduction. Subsequently, a regular convolutional layer is applied to extract joint intra-channel and inter-channel dependencies. The resulting feature map is then passed through a layer-normalization function, which helps normalize the activations for improved stability and performance.

Fig. 2
figure 2

The detailed architecture of the PVF and DPR modules

Deformable Pyramid Reception (DPR)

The architecture of the Deformable Pyramid Reception (DPR) module, as depicted in Fig. 2, can be described as follows. Initially, the upsampled coarse-grained semantic feature map from the preceding layer is concatenated with its symmetric fine-grained feature map from the encoder. Subsequently, these concatenated features are passed through three parallel branches. The first branch employs a regular convolution operation, while the other two branches utilize deformable convolutions with different dilation rates of three and six. The structured convolution covers the immediate neighboring pixels up to one pixel to the central pixel. The deformable convolutions with the dilation rate of three and six cover an area from two to four and five to seven pixels far away from each central pixel, respectively. Accordingly, the DPR module forms a learnable sparse receptive field of size \(15\times 15\) pixels by incorporating these layers. These layers share the weights to avoid imposing a huge number of trainable parameters.

To compute the feature-map-adaptive offset field for each deformable convolution, a regular convolution operation is employed. Considering the target area of the two deformable convolutions, the offset field should be computed based on the internal content within four and seven pixels away from each central pixel (\(k=9\), \(k=15\)). The computed offset values are then passed through a tangent hyperbolic function, which clips them within the range of \([-1, 1]\), to ensure that each deformable convolution adaptively covers an area within the range of \([k-1, k+1]\). The offset field provides two values per element in the deformable convolutional kernel (horizontal and vertical offsets). Accordingly, the number of offset field’s output channels for a deformable convolution with a kernel of size \(3\times 3\) is equal to 18. This enables the deformable convolution to spatially adjust its receptive field based on the learned offset values, improving its ability to capture contextually relevant information.

The output feature maps of the parallel structured and deformable convolutions are then passed through a feature fusion decision (FFD) module [4]. This module determines the significance of each input feature map based on the spatial descriptors using pixel-wise convolutions. These descriptors are concatenated and subjected to a Softmax operation, resulting in normalized descriptors. The normalized descriptors determine the pixel-wise contribution or weight of each input convolutional feature map in the final fused feature map. The output feature map of the FFD module is obtained as a weighted sum of the input feature maps, where the normalized descriptors serve as pixel-wise weights. The resulting feature map from the FFD module goes through a series of additional operations for deeper feature extraction and normalization.

Table 1 Specifications of the single-domain and cross-domain datasets
Fig. 3
figure 3

Exemplary images from the different datasets along with their corresponding overlayed masks

Experimental settings


We evaluate the performance of our proposed network on five intra-domain datasets from three different modalities (video, MRI, and OCT) and two cross-domain datasets from two different modalities. Table 1 details the specifications of adopted datasets, and Fig. 3 presents exemplary images together with the ground-truth segmentations from each dataset. These datasets cover a wide range of object classes with distinct characteristics. For example, endometriosis videos contain amorphous endometrial implants with color and texture variations. OCT scans involve amorphous intraretinal fluid, while prostate MRI images include deformations and variations in scale, contrast, and brightness. In addition, instrument segmentation in cataract and laparoscopy surgeries presents various challenges, such as scale variation, reflection, motion blur, and defocus blur degradation. The diversities in datasets ensure realistic conditions for evaluating the proposed network’s effectiveness in addressing challenges in medical image and surgical video segmentation.Footnote 3 For result reproducibility, we provide all train/test sets as CSV files in the paper’s GitHub repository.

Alternative methods

We compare the effectiveness of our proposed network architecture with eleven state-of-the-art neural networks using different backbones. Table 2 lists the specifications of the baselines and the proposed network. Note that UNet+ is an improved version of UNet, where we use VGG16 as the backbone network and double convolutional blocks (two consecutive convolutions followed by batch normalization and ReLU layers) as decoder modules. To have fair comparisons with alternative methods, we report the performance of DeepPyramid+ with three different backbones (VGG16, ResNet34, and ResNet50).

Table 2 Specifications of the proposed and alternative approaches

Training settings

All backbones are initialized with the ImageNet pre-trained parameters. We use a batch size of four for all datasets, set the initial learning rate to 0.001, and decrease it during training using polynomial decay \(lr = lr_{\textrm{init}}\times (1-\frac{\hbox {iter}}{\hbox {total-iter}})^{0.9}\). The input size of the networks is \(512\times 512\) for all datasets. We apply cropping and random rotation (up to \(30^{\circ }\)), color jittering (brightness = 0.7, contrast = 0.7, saturation = 0.7), Gaussian blurring, and random sharpening as augmentations during training, and use the cross-entropy log dice loss during training [6]. All experiments are conducted using NVIDIA RTX:3090 GPUs.

Ablation study settings

To evaluate the effectiveness of different modules, we use the improved version of UNet (UNet+), with the same backbone (VGG16) as our baseline. This network does not include any PVF modules. Besides, the DPR module is replaced with a sequence of two convolutional layers, each of which being followed by a batch normalization layer and a ReLU activation.

Experimental results

Table 3 reports the segmentation performance of the proposed and state-of-the-art networks across three different modalities. DeepPyramid+ consistently demonstrates the highest average performance across all datasets with various backbones, while other methods, such as CPFNet, exhibit varying performance with different backbones and 2.22% compared to DeepPyramid+, respectively. Besides, DeepPyramid+ achieves the best results with all three backbones for endometrial implants and prostate segmentation and the best results with ResNet34 and ResNet50 backbones for IRF segmentation in OCT. Considering instrument segmentation performance (Table 4), DeepPyramid+ with VGG16 backbone shows more than 5.6% gain in segmentation compared to CPFNet as its main alternative (58.93% vs. 53.29%). Across all backbones, DeepPyramid+ with VGG16 backbone shows more than 2.7% higher performance compared to other methods. Besides, the best results for both datasets correspond to DeepPyramid+ with VGG16 backbone. Overall, DeepPyramid+ with our suggested backbone (VGG16) achieves the best segmentation performance in instrument and organ/disease segmentation.

Table 3 Quantitative comparisons among the performance of DeepPyramid+ and alternative methods in organ and disease segmentation, with top two results shown in italic and bold, respectively
Table 4 Quantitative comparisons among the performance of DeepPyramid+ and alternative methods in instrument segmentation, with top two results shown in italic and bold, respectively

Table 5 compares the cross-domain segmentation performance of DeepPyramid+ and its best two alternatives for three backbones (considering single-domain results in Table 3 and Table 4). Overall, DeepPyramid+ consistently outperforms other methods across all backbones. Considering the MRI dataset, DeepPyramid+ with VGG16 backbone shows more than 4.8% gain in Dice compared to alternatives. For instrument segmentation in cataract surgery, DeepPyramid+ with the VGG16 backbone exhibits an impressive improvement of approximately 19.5% in Dice score compared to CPFNet with the same backbone (55.10% vs. 35.59%), and a 17% improvement compared to the best alternative across all backbones (55.10% vs. 38.10% achieved by UPerNet). This exceptional performance in dealing with cross-domain distribution gaps [28] can be attributed to the effectiveness of the proposed modules in incorporating multi-scale local and global features.

Table 5 Quantitative comparisons of cross-domain performance among DeepPyramid+ and state-of-the-art methods, with top two results shown in italic and bold, respectively

Table 6 provides an ablation study of DeepPyramid+ components. The results suggest that both PVF and DPR modules contribute significantly to improvements in segmentation performance across all datasets. This impact is more prominent in the case of cataract surgery, where the addition of PVF and DPR modules lead to a 4.95% and 4.72% increase in the Dice coefficient, respectively.

Table 6 Ablation study of DeepPyramid+ component across different datasets


In recent years, considerable attention has been devoted to computerized medical image and surgical video analysis. A reliable relevant-instance-segmentation approach is a prerequisite for a majority of these applications. In this paper, we introduce a novel network architecture for semantic segmentation that addresses the challenges encountered in medical image and surgical video segmentation. Our proposed architecture, DeepPyramid+, incorporates two innovative modules, namely “Pyramid View Fusion” and “Deformable Pyramid Reception.” Experimental results demonstrate the effectiveness of DeepPyramid+ in capturing object features in challenging scenarios, including shape and scale variation, reflection and blur degradation, blunt edges, and deformability, resulting in competitive performance in cross-domain segmentation compared to state-of-the-art networks. The ablation study validates the efficacy of the proposed modules in DeepPyramid+, showcasing their performance across diverse datasets. The obtained promising results indicate the potential of DeepPyramid+ to enhance the precision in various computerized medical imaging and surgical video analysis applications.