1 Introduction

Semantic segmentation is an important aspect of image comprehension in the field of image processing. Unlike object detection, semantic segmentation assigns class labels to each pixel in a given image. With the advancement of semantic segmentation techniques and the breakthroughs achieved through weakly supervised learning [1, 2], the application of semantic segmentation has expanded across various fields, such as autonomous driving, remote sensing imagery, medical imaging, augmented reality, and scene segmentation. Recently, with the advancement of convolutional neural networks, the pixel-based semantic segmentation framework [3, 4] has achieved significant improvements in recognition and segmentation accuracy. However, most previous network models, while downsampling to capture image feature information, fail to effectively handle global contextual information and tend to lose feature details when segmenting small-scale objects, significantly impacting segmentation accuracy.

Various improvements have been proposed to address these issues. For instance, subsequent methods such as [5, 6] have incorporated global contextual information aggregation into the Fully Convolutional Networks (FCN) [3] model. The Encoder-Decoder model [7], proposed later, utilizes the FCN encoder structure for multiple downsampling to acquire high-level semantic information and then recovers the original spatial information through hierarchical upsampling. [6] introduced skip connections to compensate for the loss of feature information during the downsampling process, and this approach has been applied in many current models [8, 9]. In addition to the U-shaped jumping connection approach to compensate for the loss of feature information during downsampling, [10] proposes the use of atrous convolution to replace the original convolutional layer to ensure high resolution and expand the receptive field.

Additionally, in the process of image segmentation, there can be variations in the scale of objects, which can cause significant challenges to their recognition and localization. To address this, subsequent works [11, 12] have made significant progress by incorporating a pyramid pooling module into multi-scale feature fusion. However, traditional pyramid pooling modules simply perform pooling operations on high-level semantic information received, which fails to prevent the loss of resolution before multi-scale fusion. Even with the inclusion of dilated convolutions in [10, 13], the problem of sparse pixel sampling and information discarding cannot be avoided, and it lacks the ability to extract discriminative features for the targets. This shallow multi-scale feature concatenating fails to efficiently aggregate global and local features, leading to pixel point classification errors. [14, 15] employ the Swin Transformer [16] to construct hierarchical feature maps and conduct self-attention computations for semantic segmentation. However, the method of partitioning the feature maps into windows imposes limitations on establishing inter-window feature connections, which hinders the capacity of model to comprehensively capture contextual information.

To address the problem of easily losing detail information and the inability to establish pixel-level correlations, the Feature Complementation Network with Pyramid Fusion (FCPFNet) retains the advantages of the encoder-decoder structure to capture spatial and detail information. Moreover, by incorporating deep featrue aggregation module (DFAM) to capture global contextual information and Efficient pyramid pooling module (EPPM) for extracting discriminative multi-scale feature information, the segmentation performance of FCPFNet can be enhanced. Specifically, FCPFNet introduce DFAM, which employs a multi-layer fusion strategy to jointly model and complement features of various characteristics and expand the receptive field. Furthermore, through the addition of the EPPM, the proposed approach is able to enhance the stability and accuracy of target segmentation by complementing the loss of spatial and fine-grained information during the downsampling process, and effectively capture the correlations between long-range pixels. [17] proposed a multimodal hypergraph learning-based sparse coding method by extracting semantic similarities between images to improve the performance of image click prediction. In comparison to the complementary relationship in [18], there is a slight distinction in our approach. Specifically, our method utilizes the EPPM to compensate the feature loss caused by downsampling in the DFAM, [18] captures the complementary relationship between top-level objects and bottom-level part enhancing the stability of image click prediction. FCPFNet has been extensively experimented on two challenging datasets, Pascal VOC and Cityscapes, and the present algorithm has improved MIoU on both datasets compared with other advanced semantic segmentation algorithms.

The main contributions are summarized as follows:

  • We propose a novel Deep Feature Aggregation Module, which combines feature aggregation and pyramid pooling modules for aggregating global feature information and local feature information layer by layer, capturing feature information under different size receptive fields to improve the accuracy of segmenting objects at multi-scales.

  • We introduce the Efficient Pyramid Pooling Module, which captures both spatial attention and channel attention simultaneously, while establishing long-range dependencies between pixels by channel shuffling operations to extract differentiated multi-level features, providing richer contextual information for small targets at low resolution and increasing recognition accuracy.

  • FCPFNet achieved excellent results in both Pascal VOC 2012 and Cityscapes, which shows that the method has good robustness under semantic segmentation of different scenes.

2 Related Work

In recent years, with the increasing recognition of the crucial role of contextual information in semantic segmentation tasks, various methods [19,20,21,22] have been explored based on this foundation. In this section, we will categorize the related work into three parts: semantic segmentation, multi-scale and context aggregation, and attention mechanisms.

2.1 Semantic Segmentation

Semantic segmentation has undergone rapid development and extensive research in recent years. The initial Fully Convolutional Network (FCN) [3] extracts low-resolution feature maps with high-level semantic information by applying consecutive convolutional pooling downsampling. It then utilizes deconvolution [23, 24] to upsample the feature maps, completely discarding the fully connected layers commonly used in image classification. While another pioneering network, U-Net [6], jump-connects the features in the contraction path on one side with the corresponding upsampling layer on the other side through a U-shaped structure to achieve features at different scales.Subsequently, variants of U-Net, such as Unet++ [9], Swin-Unet [25], and Dense-Unet [26], were proposed to address various problems.

2.2 Attention Mechanism

The attention mechanism has long been demonstrated in the previous research literature and has been applied in many tasks [27,28,29]. It aims to extract more useful features by assigning more weight to feature expressions with larger information content while suppressing the weights of feature expressions with smaller information content. SENet [30] models channel relationships through two fully connected (FC) layers to automatically acquire the importance of each channel. ECANet [31] introduces an extremely lightweight channel attention module to generate channel weights, which optimizes the reduction operation in SENet that leads to loss in prediction accuracy by using non-reduced global average pooling (GAP) aggregated convolutional features. The combination of channel attention and spatial attention is also implemented in CBAM [32], GCNet [33], and SGE [34]. Unlike previous sequential connections, DANet [21] sums up two attention modules to obtain better feature representation. Self-attention in image segmentation calculates the similarity between two pixels and recalculates the feature representation of each pixel based on its similarity with other pixels, but it brings significant computational costs to the network.

2.3 Context Aggregation and Multi-scales

Considering the significant scale variations of objects in scene semantic segmentation, single-scale predictions are insufficient for achieving robust segmentation across different scenes and determining whether pixels within multiple scales belong to the same object. Therefore, incorporating contextual information aggregation improves the ability of model to accurately localize and detect objects. Contextual information has been proven to greatly enhance network performance, receiving attention in subsequent studies [35,36,37,38]. In contrast to SegNet [7] and U-Net [6], which directly concatenate high-level and low-level features, DeepLab [4] incorporates ASPP into the network, consisting of parallel dilated convolutions with different dilation rates to enlarge the receptive field and capture multi-scale contextual information. However, sparse sampling of the dilated convolutions can lead to a lack of feature dependencies and interactions between pixels, as well as reduced correlation between long-range convolutions. OCNet [22] combines attention mechanisms with ASPP to extract contextual dependencies. PSPNet integrates four different scales of feature extraction and contextual information through pyramid pooling modules. HRNet [39] adopts a novel multi-scale fusion approach by merging with branches that have larger receptive fields at each stage.

3 Method

In this section, we will first introduce the overall pipeline. Then, we will provide a detailed explanation of the different modules used for constructing the network.

Fig. 1
figure 1

An overview of the proposed FCPFNet. “DFAM” denotes deep feature aggregation module. “EPPM” denotes efficient pyramid pooling module. “Upsample” denotes bilinear interpolation upsampling

3.1 Overview

As shown in Fig. 1, FCPFNet is built on a general encoder-decoder framework. The encoder downsamples the input images to learn rich semantic information, and the learned high-level features are decoded and reconstructed through the decoder for pixel-level semantic prediction. FCPFNet is inspired by the classic scene semantic segmentation model PSPNet [11], where the process involves concatenating multi-scale information extracted by the pyramid pooling module with the feature information before entering the module. However, PSPNet cannot capture salient features and lacks the ability to aggregate deep features. Therefore, motivated by these issues, FCPFNet proposes an Deep Feature Aggregation Module (DFAM) to extract contextual information and expand the receptive field through the deep feature aggregation module, while incorporating multi-scale information based on low-resolution feature maps. In addition, we have introduced a Efficient Pyramid Pooling Module (EPPM) to extract informative feature representations. The EPPM enhances the original pyramid pooling module by incorporating a shuffle attention module and channel shuffle operation. This enhancement facilitates increased cross-channel information interaction, enabling the acquisition of discriminative feature information. Ultimately, the EPPM efficiently fuses the obtained feature representations using the pyramid pooling module for comprehensive multi-scale information fusion. In the end, the feature maps will be upsampled by using bilinear interpolation to match the size before it entered the branch structure, and then channel-wise concatenated. Finally, the decoder will maps the features to classes and rescales the class mappings to the input resolution.

In the following section, we will elaborate on each module of the network design and its rationale. Figures 2 and 3 illustrate the precise structure of our network modules.

Fig. 2
figure 2

An overview of the proposed Deep Feature Aggregation Module. The size of the feature maps can be adjusted by employing “avgpool,” which refers to global average pooling, followed by the fusion of multi-scale feature information through residual connections

3.2 Deep Feature Aggregation Module

In previous literature on semantic segmentation[40, 41], it has been demonstrated that expanding the receptive field contributes to improving the performance of semantic segmentation models. Additionally, local and global contextual information has been effectively utilized in many models [4, 11] and has shown excellent performance. Therefore, this paper proposes a novel deep feature aggregation module, as illustrated in Fig. 2.This module first receives feature maps at 1/8 image resolution, and then undergoes multi-branch average pooling to obtain feature maps at 1/16, 1/32, and 1/64 image resolutions, as well as a global average pooling to integrate spatial information and acquire image-level information. Inspired by Res2Net, the feature maps are uniformly divided into subsets of feature maps denoted as \(x_i\) after dimension reduction with \(1\times 1\) convolutions, where \(x_i\in \{1,2,\ldots ,s\}\). Then, following upsampling, different from PPM, a layered residual fusion approach is employed by introducing layer-wise \(3\times 3\) convolutions after \(1\times 1\) convolutions to incorporate context information from various scales. The output at each scale can be represented using the following equation:

$$\begin{aligned} y_i=\left\{ \begin{array}{lr} x_1, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, {i=1}&{}\\ C_{3\times 3}(B(x_i)+y_{i-1}),\,\,\, {1<i\le s} \end{array} \right. \end{aligned}$$
(1)

Where \(C_{3\times 3}()\) represents \(3\times 3\) convolution, B() represents bilinear upsampling, except for \(x_1\), which is transformed into \(y_1\) without any operations. For each \(x_i\), it is converted by adding with the corresponding \(y_{i-1}\) and inputting into a \(3\times 3\) convolution, thereby obtaining multi-scale feature information while expanding the receptive field. Finally, all concatenated feature maps are compressed through \(1\times 1\) convolution and then concatenated with the feature maps from the projection shortcut. In DFAM, the input feature maps are processed by extracting multi-scale information. This is achieved by combining different depths with varying sizes of pooling kernels and integrating richer low-level information using larger pooling kernels. This split-concatenation strategy greatly enhances the capability of global and local information extraction and feature processing. Additionally, He et al. [39] extensively discuss the combination of traditional Conv-BN-ReLU and propose a complete pre-activation design. In the construction of DFAM, all input feature maps are normalized through pre-activation, which effectively reduces model overfitting and improves its generalization ability, specifically in shortcut structures, compared to the traditional combination of BN-ReLU-Conv structure.

Fig. 3
figure 3

An overview of the proposed Efficient Pyramid Pooling Module (left) and specific module parts (right). The input feature maps are split into two groups through “channel split,” followed by sequential application of depthwise convolution (DWConv) and pointwise convolution (PWConv). The channel shuffle module (CSM) is then employed to establish long-range dependencies between pixels. Subsequently, pyramid pooling is utilized to capture multi-scale information from the feature maps. Finally, the output feature maps are integrated with the feature maps processed through shuffle attention (SA) to enhance the feature representation

3.3 Efficient Pyramid Pooling Module

The incorporation of multi-scale feature fusion enhances the capable of model to detect objects of different sizes, leading to significant performance improvement in many deep learning models [11, 42]. Therefore, we propose the efficient pyramid pooling module (EPPM), an efficient multi-scale fusion module. Compared to the original pyramid pooling module, EPPM integrates ShuffleUnit operations to increase feature interaction between sub-features and we incorporate the shuffle attention module to capture spatial attention and channel attention of the feature maps, aiming to extract more discriminative feature information. As shown in the Fig. 3, the input feature information is denoted as \(I^{w\times h\times c}\), where w and h represent the spatial dimensions, and c represents the number of feature channels. It then undergoes the efficient pyramid pooling module, which concatenates \(I^t\) with \(I^s\) in the channel dimension to obtain the final output \(O^{eppm}\). The formula for is given as follows:

$$\begin{aligned} I^t=I^a\oplus I^b \end{aligned}$$
(2)

Where \(I^a\) represents the output feature after the aggregation of contextual information, while \(I^b\) denotes the output feature after convolution. \(\oplus \) represents the element-wise sum of these two features.

Specifically, the input \(I^{w\times h\times c}\) is processed through two branches, a and b, to generate the features \(I^a\) and \(I^b\), respectively. In branch a, as illustrated in Fig. 3, the input feature information \(I^{w\times h\times c}\) is split into two groups, \(I^a_1\in \mathbb {R}^{w\times h\times c/2}\) and \(I^a_2\in \mathbb {R}^{w\times h\times c/2}\), based on the number of channels. Then, it undergoes depth-wise separable convolution operations, which can be represented by the following formula:

$$\begin{aligned}{} & {} Q^a_1=C_{pw}(P_{max}(C_{dw}(I^a_1))) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} Q^a_2=C_{pw}(P_{max}(C_{dw}(I^a_2))) \end{aligned}$$
(4)

In the end, we obtain \(Q^a_1\in \mathbb {R}^{w\times h\times c}\) and \(Q^a_2\in \mathbb {R}^{w\times h\times c}\), where the operations for both groups are identical. It begins with depth-wise convolution, followed by a max pooling layer and point-wise convolution. The approach of performing convolution in two parts and then combining them to generate attention feature maps is beneficial for capturing long-range dependencies in subsequent processes. Furthermore, the output is normalized through softmax using Eq. (5):

$$\begin{aligned} Q=Softmax(Q^a_1+Q^a_2) \end{aligned}$$
(5)

After normalization, branch a sequentially performs element-wise multiplication and addition with \(I^{w\times h\times c}\) through residual connections. Furthermore, we incorporate a channel shuffle module (CSM), similar to ShuffleNet [43], to suppress interference from redundant information while promoting semantic consistency. By enabling mutual influence among feature channels with similar semantics, it mitigates classification errors during segmentation and increase the diversity of features, ultimately improving the robustness of the segmentation process. The output of this branch is computed by Eq. (6):

$$\begin{aligned} I^a=Cs((Q\odot I^{w\times h\times c})\oplus I^{w\times h\times c}) \end{aligned}$$
(6)

Where Cs() represents the channel shuffle operation, \(\odot \) denotes element-wise multiplication. Finally, the output obtained from Eq. (6) is element-wise added to the output of branch b by Eq. (2). In this equation, \(I^b\) is the output of branch b, which is obtained through depth-wise separable convolution, as shown in Eq. (7):

$$\begin{aligned} I^b=C_{pw}(C^{3\times 3}_{dw}(I^{w\times h\times c})) \end{aligned}$$
(7)

In Eq. (7), \(C^{3\times 3}_{dw}()\) represents the depth-wise convolution with a \(3\times 3\) kernel, and \(C_{pw}()\) represents the point-wise convolution.

4 Experiments

In this section, we conducted a series of experiments to assess the effectiveness of our method. The experiments were conducted using three widely employed datasets, namely Pascal VOC 2012 [44], Cityscapes [45] and Coco-Stuff [46]. Our results indicate that FCPFNet displays superior performance in the VOC dataset and notably improves accuracy in the Cityscapes and Coco-Stuff dataset. We will provide a detailed exposition of the experimental procedure, including the datasets used and the specific implementation details. Additionally, we performed ablation experiments on the VOC dataset to assess the contribution of individual modules in FCPFNet on performance. Finally, we will present a comparative analysis of the proposed method with other advanced methods, demonstrating relevant precision and visualization results.

4.1 Datasets

PASCAL VOC 2012 benchmark contains 20 foreground object classes and 1 background class. The original dataset consists of 1464 pixel-level annotated images for training, 1449 for validation, and 1456 for testing. In addition, this dataset was augmented with extra annotations provided by [47], resulting in 10,582 training augmented images that are divided into 21 classes. In this experiment, the model’s performance is validated through Pascal VOC 2012 val, and the evaluation metric used in this paper is the mean intersection-over-union (MIOU) calculated based on the 21 categories.

Cityscapes is one of the well-known scene semantic segmentation datasets, focusing on the analysis of urban street scenes. It consists of 5000 high-quality pixel-level finely annotated images collected from 50 cities, which are divided into 2975 images for training, 500 images for validation, and 1525 images for testing, with a total of 19 classes. In addition, we do not use its extra 20,000 coarse labeled images during training.

Coco-Stuff consists of a total of 10k annotated images, with 9k images allocated for training and 1k images for testing. Compared to Cityscapes and Pascal VOC 2012, the Coco-Stuff dataset is more challenging due to its more complex classes, includes 80 thing classes, 91 stuff classes, and 1 class labeled as ’unlabeled’.

4.1.1 Train Setting

The pre-trained model utilized in this experiment is PSPNet [11], which has been trained on the ImageNet dataset. The implementation details include the adoption of the poly learning rate strategy Eq. (8) similar to [4, 11], as well as SGD optimizer with weight decay with momentum set to 0.9 and power set to 0.9. The initial learning rates were set to \(1\times 10^{-2}\) for PASCAL VOC 2012 and Coco-Stuff and \(1\times 10^{-3}\) for Cityscapes.

$$\begin{aligned} l=l_{init}\times \left( 1-\frac{iterations}{iter_{max}}\right) ^{power} \end{aligned}$$
(8)

All experiments were conducted on a single GeForce RTX 3090 using the Pytorch framework. The same data augmentation was applied throughout the training process, which involved randomly flipping and cropping the dataset fed into the network. Specifically, the PASCAL VOC 2012 dataset was cropped to \(440\times 440\) resolution, while the Cityscapes dataset was cropped to \(657\times 657\) resolution and the Coco-stuff dataset was cropped to \(380\times 380\). The models were trained using cross-entropy loss and compared with currently advanced semantic segmentation models. The formula for the cross-entropy loss function is as follows:

$$\begin{aligned} L(x,y) = -\sum ^{C}_{i=1}x_ilog(y_i) \end{aligned}$$
(9)

where C represents the number of classes, \(x_i\) represents the ground truth label for class i, while \(y_i\) represents the predicted probability for class i. The Cross Entropy loss function quantifies the dissimilarity between the predicted probabilities \(y_i\) and the ground truth labels \(x_i\) for each class. The smaller the Cross Entropy loss, the more accurate the predictions made by the model.

4.1.2 Ablation Study

To validate the feasibility and effectiveness of our method, we will evaluate the improvement brought by each module in the model through a series of experiments. As our method is inspired by PSPNet [11] and improves the traditional PPM by incorporating EPPM, we will compare it with PPM in the experiments.

As depicted in Table 1, each incorporated module in the model has been observed to substantially enhance its performance. Compared with PPM, adopting EPPM yields a result of 78.7% in Mean IoU, which brings 1.4% improvement. Meanwhile, adding SA and DFAM respectively outperforms the baseline by 1.7% and 2.1%. What’s more, we integrate the three modules together, the performance further improves to 79.9%, which suggest that the inclusion of these modules has contributed to the overall enhancement of the model’s effectiveness.

Table 1 Ablation study on Pascal VOC 2012 val set
Fig. 4
figure 4

Visualization of segmentation using different modules results on Pascal VOC 2012 val set

As shown in Fig. 4, we visualize the segmentation results of three different models: PPM, EPPM and FCPFNet. The first row indicates that EPPM outperforms PPM in terms of segmentation accuracy by enhancing feature correlation and providing clearer boundary localization. Additionally, incorporating SA and DFAM on the basis of EPPM leads to complete segmentation of aircraft tail. Furthermore, the second row demonstrates that both PPM and EPPM achieve high recognition accuracy, with EPPM outperforming PPM. The third column shows that FCPFNet can accurately identify and segment objects by effectively integrating multi-scale information while avoiding loss of resolution. Although our FCPFNet in the third row does not satisfy us completely, it still has a much higher segmentation accuracy compared to PPM. It can be observed from the visualization results that our proposed modules are effective.

4.1.3 Comparison to State-of-the-Arts

The present study evaluated its results based on the validation set of the respective dataset. Moreover, a comparative analysis of MIoU was conducted between ResNet50 and ResNet101 and other advanced models. We refrained from utilizing the multi-scale testing and flip testing techniques to further enhance the precision of results. The following section elaborates on the outcomes in more detail.

Pascal VOC 2012 As shown in Table 2, FCPFNet achieved MIoU of 78.8% and 81.0% on ResNet50 and ResNet101, respectively. Additionally, the investigation conducted by researchers in [48] revealed that utilizing larger image sizes during the training and validation stages led to more segmentation accuracy, consistent with previous studies. It is worth noting that the image sizes employed in our proposed method were relatively smaller compared to those utilized by other state-of-the-art models. Despite this, FCPFNet demonstrated superior performance by outperforming these advanced methods and achieving the highest MIoU. This outcome further substantiates the superiority and effectiveness of our proposed approach. Although our the MIoU of FCPFNet are only 0.8% higher than the second-highest performing model, WASPNet [49],it is approximately 2%-3% higher than the results of the other models, demonstrating superior performance even when using the shallower ResNet50 network structure.

Table 2 Segmentation results on Pascal VOC 2012 val set
Table 3 Segmentation results on Cityscpaes val set

Cityscapes In Table 3, a comparative analysis between our model and the latest methods on the Cityscapes dataset reveals that our model remains competitive, as evidenced by its MIoU of 78.1% and 78.8% on the ResNet50 and ResNet101, respectively. Despite the marginal difference in segmentation accuracy compared to the other models, our model outperforms the second best model by 0.3%, achieving the best results. Moreover, compared to other models, FCPFNet exhibits outstanding performance even when ResNet50 is utilized as the backbone. Visualization results of FCPFNet on the Cityscapes dataset can be found in Fig. 5.

Coco-Stuff By comparing the MIoU of FCPFNet with other methods on the Coco-Stuff dataset, as presented in Table 4, we can observe that FCPFNet still achieves relatively competitive performance, with a 1–3% MIoU higher.

Table 4 Segmentation results on Coco-Stuff test set

The experimental results from three datasets demonstrate that our network is the best performing, with excellent applications in scene semantic segmentation. As shown in Tables 2, 3 and 4, although our method is only 0.3% higher than DeepLabv3 in the Cityscapes dataset, it is 4.5% higher and 2.8% in the Pascal VOC 2012 and Coco-Stuff dataset, respectively. Furthermore, our method’s MIoU is only 0.8% higher than second place WASPNet in the Pascal VOC 2012 dataset but 4.8% higher than it in the Cityscapes dataset, proving that our model has superior generalizability. Due to the improvement of our network pipeline inspired by PSPNet, our model’s performance is also significantly improved compared to PSPNet, with an increase of 2.5%, 0.5% and 2.9% on Pascal VOC 2012, Cityscapes and Coco-Stuff dataset.

Fig. 5
figure 5

Visualization of segmentation results on Cityscapes val set

5 Conclusion

In this paper, we have proposed an Feature Feature Complementation Network with Pyramid Fusion Model (FCPFNet) for scene segmentation. By incorporating EPPM and DFAM into our model, we effectively solve the problem of insufficient contextual information, weak feature correlations and insignificant features during feature extraction. In specific, the Deep Feature Aggregation Module DFAM we designed aggregates local and global information in depth by combining different depths with pooling kernels of different sizes. Furthermore, we have introduced Efficient Pyramid Pooling Module EPPM to capture discriminative features by feature enhancement and to facilitate cross-channel information interaction of feature maps. The ablation studies demonstrate that the addition of DFAM and EPPM give more precise segmentation results. FCPFNet shows robustness while achieving outstanding performance consistently on three scene segmentation datasets, i.e. Pascal VOC 2012, Cityscapes, Coco-Stuff.