FCPFNet: Feature Complementation Network with Pyramid Fusion for Semantic Segmentation

Lei, Jingsheng; Shu, Chente; Xu, Qiang; Yu, Yunxiang; Yang, Shengying

doi:10.1007/s11063-024-11464-9

FCPFNet: Feature Complementation Network with Pyramid Fusion for Semantic Segmentation

Open access
Published: 20 February 2024

Volume 56, article number 60, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

FCPFNet: Feature Complementation Network with Pyramid Fusion for Semantic Segmentation

Download PDF

Jingsheng Lei¹^na1,
Chente Shu¹^na1,
Qiang Xu²,
Yunxiang Yu³ &
…
Shengying Yang^1,3

525 Accesses
Explore all metrics

Abstract

Traditional pyramid pooling modules have shown effective improvements in semantic segmentation tasks by capturing multi-scale feature information. However, their limitations arise from the shallow structure, which fails to fully extract contextual information, and the fused multi-scale feature information lacks distinctiveness, resulting in issues with the final segmentation discriminability. To address these issues, we proposes an effective solution called FCPFNet, which is based on global contextual prior for deep feature extraction of detailed information. Specifically, we introduce a novel deep feature aggregation module to extract semantic information from the output feature map of each layer through a deep aggregation of context information module, and expands the effective perception range. Additionally, we propose an Efficient Pyramid Pooling Module (EPPM) to capture distinctive features through communicating information between different sub-features and performs multi-scale fusion, which is integrated as a branch within the network to complement the information loss resulting from downsampling operations. Furthermore, in order to ensure the richness of image detail feature information and maintain a large receptive field to obtain more contextual information, EPPM concatenates the input feature map and the output feature map of the pyramid pooling module to acquire more comprehensive global contextual information. It has been demonstrated by experiment that the method described in this article achieves competitive performance on the challenging scene segmentation datasets Pascal VOC 2012, Cityscapes and Coco-Stuff, with MIOU of 81.0%, 78.8% and 40.1%, respectively.

MAFNet: dual-branch fusion network with multiscale atrous pyramid pooling aggregate contextual features for real-time semantic segmentation

Article Open access 17 April 2024

DSMRSeg: Dual-Stage Feature Pyramid and Multi-Range Context Aggregation for Real-Time Semantic Segmentation

FPANet: Feature pyramid aggregation network for real-time semantic segmentation

Article 05 July 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Semantic segmentation is an important aspect of image comprehension in the field of image processing. Unlike object detection, semantic segmentation assigns class labels to each pixel in a given image. With the advancement of semantic segmentation techniques and the breakthroughs achieved through weakly supervised learning [1, 2], the application of semantic segmentation has expanded across various fields, such as autonomous driving, remote sensing imagery, medical imaging, augmented reality, and scene segmentation. Recently, with the advancement of convolutional neural networks, the pixel-based semantic segmentation framework [3, 4] has achieved significant improvements in recognition and segmentation accuracy. However, most previous network models, while downsampling to capture image feature information, fail to effectively handle global contextual information and tend to lose feature details when segmenting small-scale objects, significantly impacting segmentation accuracy.

Various improvements have been proposed to address these issues. For instance, subsequent methods such as [5, 6] have incorporated global contextual information aggregation into the Fully Convolutional Networks (FCN) [3] model. The Encoder-Decoder model [7], proposed later, utilizes the FCN encoder structure for multiple downsampling to acquire high-level semantic information and then recovers the original spatial information through hierarchical upsampling. [6] introduced skip connections to compensate for the loss of feature information during the downsampling process, and this approach has been applied in many current models [8, 9]. In addition to the U-shaped jumping connection approach to compensate for the loss of feature information during downsampling, [10] proposes the use of atrous convolution to replace the original convolutional layer to ensure high resolution and expand the receptive field.

Additionally, in the process of image segmentation, there can be variations in the scale of objects, which can cause significant challenges to their recognition and localization. To address this, subsequent works [11, 12] have made significant progress by incorporating a pyramid pooling module into multi-scale feature fusion. However, traditional pyramid pooling modules simply perform pooling operations on high-level semantic information received, which fails to prevent the loss of resolution before multi-scale fusion. Even with the inclusion of dilated convolutions in [10, 13], the problem of sparse pixel sampling and information discarding cannot be avoided, and it lacks the ability to extract discriminative features for the targets. This shallow multi-scale feature concatenating fails to efficiently aggregate global and local features, leading to pixel point classification errors. [14, 15] employ the Swin Transformer [16] to construct hierarchical feature maps and conduct self-attention computations for semantic segmentation. However, the method of partitioning the feature maps into windows imposes limitations on establishing inter-window feature connections, which hinders the capacity of model to comprehensively capture contextual information.

To address the problem of easily losing detail information and the inability to establish pixel-level correlations, the Feature Complementation Network with Pyramid Fusion (FCPFNet) retains the advantages of the encoder-decoder structure to capture spatial and detail information. Moreover, by incorporating deep featrue aggregation module (DFAM) to capture global contextual information and Efficient pyramid pooling module (EPPM) for extracting discriminative multi-scale feature information, the segmentation performance of FCPFNet can be enhanced. Specifically, FCPFNet introduce DFAM, which employs a multi-layer fusion strategy to jointly model and complement features of various characteristics and expand the receptive field. Furthermore, through the addition of the EPPM, the proposed approach is able to enhance the stability and accuracy of target segmentation by complementing the loss of spatial and fine-grained information during the downsampling process, and effectively capture the correlations between long-range pixels. [17] proposed a multimodal hypergraph learning-based sparse coding method by extracting semantic similarities between images to improve the performance of image click prediction. In comparison to the complementary relationship in [18], there is a slight distinction in our approach. Specifically, our method utilizes the EPPM to compensate the feature loss caused by downsampling in the DFAM, [18] captures the complementary relationship between top-level objects and bottom-level part enhancing the stability of image click prediction. FCPFNet has been extensively experimented on two challenging datasets, Pascal VOC and Cityscapes, and the present algorithm has improved MIoU on both datasets compared with other advanced semantic segmentation algorithms.

The main contributions are summarized as follows:

We propose a novel Deep Feature Aggregation Module, which combines feature aggregation and pyramid pooling modules for aggregating global feature information and local feature information layer by layer, capturing feature information under different size receptive fields to improve the accuracy of segmenting objects at multi-scales.
We introduce the Efficient Pyramid Pooling Module, which captures both spatial attention and channel attention simultaneously, while establishing long-range dependencies between pixels by channel shuffling operations to extract differentiated multi-level features, providing richer contextual information for small targets at low resolution and increasing recognition accuracy.
FCPFNet achieved excellent results in both Pascal VOC 2012 and Cityscapes, which shows that the method has good robustness under semantic segmentation of different scenes.

2 Related Work

In recent years, with the increasing recognition of the crucial role of contextual information in semantic segmentation tasks, various methods [19,20,21,22] have been explored based on this foundation. In this section, we will categorize the related work into three parts: semantic segmentation, multi-scale and context aggregation, and attention mechanisms.

2.1 Semantic Segmentation

Semantic segmentation has undergone rapid development and extensive research in recent years. The initial Fully Convolutional Network (FCN) [3] extracts low-resolution feature maps with high-level semantic information by applying consecutive convolutional pooling downsampling. It then utilizes deconvolution [23, 24] to upsample the feature maps, completely discarding the fully connected layers commonly used in image classification. While another pioneering network, U-Net [6], jump-connects the features in the contraction path on one side with the corresponding upsampling layer on the other side through a U-shaped structure to achieve features at different scales.Subsequently, variants of U-Net, such as Unet++ [9], Swin-Unet [25], and Dense-Unet [26], were proposed to address various problems.

2.2 Attention Mechanism

The attention mechanism has long been demonstrated in the previous research literature and has been applied in many tasks [27,28,29]. It aims to extract more useful features by assigning more weight to feature expressions with larger information content while suppressing the weights of feature expressions with smaller information content. SENet [30] models channel relationships through two fully connected (FC) layers to automatically acquire the importance of each channel. ECANet [31] introduces an extremely lightweight channel attention module to generate channel weights, which optimizes the reduction operation in SENet that leads to loss in prediction accuracy by using non-reduced global average pooling (GAP) aggregated convolutional features. The combination of channel attention and spatial attention is also implemented in CBAM [32], GCNet [33], and SGE [34]. Unlike previous sequential connections, DANet [21] sums up two attention modules to obtain better feature representation. Self-attention in image segmentation calculates the similarity between two pixels and recalculates the feature representation of each pixel based on its similarity with other pixels, but it brings significant computational costs to the network.

2.3 Context Aggregation and Multi-scales

Considering the significant scale variations of objects in scene semantic segmentation, single-scale predictions are insufficient for achieving robust segmentation across different scenes and determining whether pixels within multiple scales belong to the same object. Therefore, incorporating contextual information aggregation improves the ability of model to accurately localize and detect objects. Contextual information has been proven to greatly enhance network performance, receiving attention in subsequent studies [35,36,37,38]. In contrast to SegNet [7] and U-Net [6], which directly concatenate high-level and low-level features, DeepLab [4] incorporates ASPP into the network, consisting of parallel dilated convolutions with different dilation rates to enlarge the receptive field and capture multi-scale contextual information. However, sparse sampling of the dilated convolutions can lead to a lack of feature dependencies and interactions between pixels, as well as reduced correlation between long-range convolutions. OCNet [22] combines attention mechanisms with ASPP to extract contextual dependencies. PSPNet integrates four different scales of feature extraction and contextual information through pyramid pooling modules. HRNet [39] adopts a novel multi-scale fusion approach by merging with branches that have larger receptive fields at each stage.

3 Method

In this section, we will first introduce the overall pipeline. Then, we will provide a detailed explanation of the different modules used for constructing the network.

3.1 Overview

As shown in Fig. 1, FCPFNet is built on a general encoder-decoder framework. The encoder downsamples the input images to learn rich semantic information, and the learned high-level features are decoded and reconstructed through the decoder for pixel-level semantic prediction. FCPFNet is inspired by the classic scene semantic segmentation model PSPNet [11], where the process involves concatenating multi-scale information extracted by the pyramid pooling module with the feature information before entering the module. However, PSPNet cannot capture salient features and lacks the ability to aggregate deep features. Therefore, motivated by these issues, FCPFNet proposes an Deep Feature Aggregation Module (DFAM) to extract contextual information and expand the receptive field through the deep feature aggregation module, while incorporating multi-scale information based on low-resolution feature maps. In addition, we have introduced a Efficient Pyramid Pooling Module (EPPM) to extract informative feature representations. The EPPM enhances the original pyramid pooling module by incorporating a shuffle attention module and channel shuffle operation. This enhancement facilitates increased cross-channel information interaction, enabling the acquisition of discriminative feature information. Ultimately, the EPPM efficiently fuses the obtained feature representations using the pyramid pooling module for comprehensive multi-scale information fusion. In the end, the feature maps will be upsampled by using bilinear interpolation to match the size before it entered the branch structure, and then channel-wise concatenated. Finally, the decoder will maps the features to classes and rescales the class mappings to the input resolution.

In the following section, we will elaborate on each module of the network design and its rationale. Figures 2 and 3 illustrate the precise structure of our network modules.

3.2 Deep Feature Aggregation Module

In previous literature on semantic segmentation[40, 41], it has been demonstrated that expanding the receptive field contributes to improving the performance of semantic segmentation models. Additionally, local and global contextual information has been effectively utilized in many models [4, 11] and has shown excellent performance. Therefore, this paper proposes a novel deep feature aggregation module, as illustrated in Fig. 2.This module first receives feature maps at 1/8 image resolution, and then undergoes multi-branch average pooling to obtain feature maps at 1/16, 1/32, and 1/64 image resolutions, as well as a global average pooling to integrate spatial information and acquire image-level information. Inspired by Res2Net, the feature maps are uniformly divided into subsets of feature maps denoted as $x_i$ after dimension reduction with $1\times 1$ convolutions, where $x_i\in \{1,2,\ldots ,s\}$. Then, following upsampling, different from PPM, a layered residual fusion approach is employed by introducing layer-wise $3\times 3$ convolutions after $1\times 1$ convolutions to incorporate context information from various scales. The output at each scale can be represented using the following equation:

$$\begin{aligned} y_i=\left\{ \begin{array}{lr} x_1, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, {i=1}&{}\\ C_{3\times 3}(B(x_i)+y_{i-1}),\,\,\, {1<i\le s} \end{array} \right. \end{aligned}$$

(1)

Where $C_{3\times 3}()$ represents $3\times 3$ convolution, B() represents bilinear upsampling, except for $x_1$, which is transformed into $y_1$ without any operations. For each $x_i$, it is converted by adding with the corresponding $y_{i-1}$ and inputting into a $3\times 3$ convolution, thereby obtaining multi-scale feature information while expanding the receptive field. Finally, all concatenated feature maps are compressed through $1\times 1$ convolution and then concatenated with the feature maps from the projection shortcut. In DFAM, the input feature maps are processed by extracting multi-scale information. This is achieved by combining different depths with varying sizes of pooling kernels and integrating richer low-level information using larger pooling kernels. This split-concatenation strategy greatly enhances the capability of global and local information extraction and feature processing. Additionally, He et al. [39] extensively discuss the combination of traditional Conv-BN-ReLU and propose a complete pre-activation design. In the construction of DFAM, all input feature maps are normalized through pre-activation, which effectively reduces model overfitting and improves its generalization ability, specifically in shortcut structures, compared to the traditional combination of BN-ReLU-Conv structure.

3.3 Efficient Pyramid Pooling Module

The incorporation of multi-scale feature fusion enhances the capable of model to detect objects of different sizes, leading to significant performance improvement in many deep learning models [11, 42]. Therefore, we propose the efficient pyramid pooling module (EPPM), an efficient multi-scale fusion module. Compared to the original pyramid pooling module, EPPM integrates ShuffleUnit operations to increase feature interaction between sub-features and we incorporate the shuffle attention module to capture spatial attention and channel attention of the feature maps, aiming to extract more discriminative feature information. As shown in the Fig. 3, the input feature information is denoted as $I^{w\times h\times c}$, where w and h represent the spatial dimensions, and c represents the number of feature channels. It then undergoes the efficient pyramid pooling module, which concatenates $I^t$ with $I^s$ in the channel dimension to obtain the final output $O^{eppm}$. The formula for is given as follows:

$$\begin{aligned} I^t=I^a\oplus I^b \end{aligned}$$

(2)

Where $I^a$ represents the output feature after the aggregation of contextual information, while $I^b$ denotes the output feature after convolution. $\oplus $ represents the element-wise sum of these two features.

Specifically, the input $I^{w\times h\times c}$ is processed through two branches, a and b, to generate the features $I^a$ and $I^b$, respectively. In branch a, as illustrated in Fig. 3, the input feature information $I^{w\times h\times c}$ is split into two groups, $I^a_1\in \mathbb {R}^{w\times h\times c/2}$ and $I^a_2\in \mathbb {R}^{w\times h\times c/2}$, based on the number of channels. Then, it undergoes depth-wise separable convolution operations, which can be represented by the following formula:

$$\begin{aligned}{} & {} Q^a_1=C_{pw}(P_{max}(C_{dw}(I^a_1))) \end{aligned}$$

(3)

$$\begin{aligned}{} & {} Q^a_2=C_{pw}(P_{max}(C_{dw}(I^a_2))) \end{aligned}$$

(4)

In the end, we obtain $Q^a_1\in \mathbb {R}^{w\times h\times c}$ and $Q^a_2\in \mathbb {R}^{w\times h\times c}$, where the operations for both groups are identical. It begins with depth-wise convolution, followed by a max pooling layer and point-wise convolution. The approach of performing convolution in two parts and then combining them to generate attention feature maps is beneficial for capturing long-range dependencies in subsequent processes. Furthermore, the output is normalized through softmax using Eq. (5):

$$\begin{aligned} Q=Softmax(Q^a_1+Q^a_2) \end{aligned}$$

(5)

After normalization, branch a sequentially performs element-wise multiplication and addition with $I^{w\times h\times c}$ through residual connections. Furthermore, we incorporate a channel shuffle module (CSM), similar to ShuffleNet [43], to suppress interference from redundant information while promoting semantic consistency. By enabling mutual influence among feature channels with similar semantics, it mitigates classification errors during segmentation and increase the diversity of features, ultimately improving the robustness of the segmentation process. The output of this branch is computed by Eq. (6):

$$\begin{aligned} I^a=Cs((Q\odot I^{w\times h\times c})\oplus I^{w\times h\times c}) \end{aligned}$$

(6)

Where Cs() represents the channel shuffle operation, $\odot $ denotes element-wise multiplication. Finally, the output obtained from Eq. (6) is element-wise added to the output of branch b by Eq. (2). In this equation, $I^b$ is the output of branch b, which is obtained through depth-wise separable convolution, as shown in Eq. (7):

$$\begin{aligned} I^b=C_{pw}(C^{3\times 3}_{dw}(I^{w\times h\times c})) \end{aligned}$$

(7)

In Eq. (7), $C^{3\times 3}_{dw}()$ represents the depth-wise convolution with a $3\times 3$ kernel, and $C_{pw}()$ represents the point-wise convolution.

4 Experiments

In this section, we conducted a series of experiments to assess the effectiveness of our method. The experiments were conducted using three widely employed datasets, namely Pascal VOC 2012 [44], Cityscapes [45] and Coco-Stuff [46]. Our results indicate that FCPFNet displays superior performance in the VOC dataset and notably improves accuracy in the Cityscapes and Coco-Stuff dataset. We will provide a detailed exposition of the experimental procedure, including the datasets used and the specific implementation details. Additionally, we performed ablation experiments on the VOC dataset to assess the contribution of individual modules in FCPFNet on performance. Finally, we will present a comparative analysis of the proposed method with other advanced methods, demonstrating relevant precision and visualization results.

4.1 Datasets

PASCAL VOC 2012 benchmark contains 20 foreground object classes and 1 background class. The original dataset consists of 1464 pixel-level annotated images for training, 1449 for validation, and 1456 for testing. In addition, this dataset was augmented with extra annotations provided by [47], resulting in 10,582 training augmented images that are divided into 21 classes. In this experiment, the model’s performance is validated through Pascal VOC 2012 val, and the evaluation metric used in this paper is the mean intersection-over-union (MIOU) calculated based on the 21 categories.

Cityscapes is one of the well-known scene semantic segmentation datasets, focusing on the analysis of urban street scenes. It consists of 5000 high-quality pixel-level finely annotated images collected from 50 cities, which are divided into 2975 images for training, 500 images for validation, and 1525 images for testing, with a total of 19 classes. In addition, we do not use its extra 20,000 coarse labeled images during training.

Coco-Stuff consists of a total of 10k annotated images, with 9k images allocated for training and 1k images for testing. Compared to Cityscapes and Pascal VOC 2012, the Coco-Stuff dataset is more challenging due to its more complex classes, includes 80 thing classes, 91 stuff classes, and 1 class labeled as ’unlabeled’.

4.1.1 Train Setting

The pre-trained model utilized in this experiment is PSPNet [11], which has been trained on the ImageNet dataset. The implementation details include the adoption of the poly learning rate strategy Eq. (8) similar to [4, 11], as well as SGD optimizer with weight decay with momentum set to 0.9 and power set to 0.9. The initial learning rates were set to $1\times 10^{-2}$ for PASCAL VOC 2012 and Coco-Stuff and $1\times 10^{-3}$ for Cityscapes.

$$\begin{aligned} l=l_{init}\times \left( 1-\frac{iterations}{iter_{max}}\right) ^{power} \end{aligned}$$

(8)

All experiments were conducted on a single GeForce RTX 3090 using the Pytorch framework. The same data augmentation was applied throughout the training process, which involved randomly flipping and cropping the dataset fed into the network. Specifically, the PASCAL VOC 2012 dataset was cropped to $440\times 440$ resolution, while the Cityscapes dataset was cropped to $657\times 657$ resolution and the Coco-stuff dataset was cropped to $380\times 380$. The models were trained using cross-entropy loss and compared with currently advanced semantic segmentation models. The formula for the cross-entropy loss function is as follows:

$$\begin{aligned} L(x,y) = -\sum ^{C}_{i=1}x_ilog(y_i) \end{aligned}$$

(9)

where C represents the number of classes, $x_i$ represents the ground truth label for class i, while $y_i$ represents the predicted probability for class i. The Cross Entropy loss function quantifies the dissimilarity between the predicted probabilities $y_i$ and the ground truth labels $x_i$ for each class. The smaller the Cross Entropy loss, the more accurate the predictions made by the model.

4.1.2 Ablation Study

To validate the feasibility and effectiveness of our method, we will evaluate the improvement brought by each module in the model through a series of experiments. As our method is inspired by PSPNet [11] and improves the traditional PPM by incorporating EPPM, we will compare it with PPM in the experiments.

As depicted in Table 1, each incorporated module in the model has been observed to substantially enhance its performance. Compared with PPM, adopting EPPM yields a result of 78.7% in Mean IoU, which brings 1.4% improvement. Meanwhile, adding SA and DFAM respectively outperforms the baseline by 1.7% and 2.1%. What’s more, we integrate the three modules together, the performance further improves to 79.9%, which suggest that the inclusion of these modules has contributed to the overall enhancement of the model’s effectiveness.

Table 1 Ablation study on Pascal VOC 2012 val set

Full size table

As shown in Fig. 4, we visualize the segmentation results of three different models: PPM, EPPM and FCPFNet. The first row indicates that EPPM outperforms PPM in terms of segmentation accuracy by enhancing feature correlation and providing clearer boundary localization. Additionally, incorporating SA and DFAM on the basis of EPPM leads to complete segmentation of aircraft tail. Furthermore, the second row demonstrates that both PPM and EPPM achieve high recognition accuracy, with EPPM outperforming PPM. The third column shows that FCPFNet can accurately identify and segment objects by effectively integrating multi-scale information while avoiding loss of resolution. Although our FCPFNet in the third row does not satisfy us completely, it still has a much higher segmentation accuracy compared to PPM. It can be observed from the visualization results that our proposed modules are effective.

4.1.3 Comparison to State-of-the-Arts

The present study evaluated its results based on the validation set of the respective dataset. Moreover, a comparative analysis of MIoU was conducted between ResNet50 and ResNet101 and other advanced models. We refrained from utilizing the multi-scale testing and flip testing techniques to further enhance the precision of results. The following section elaborates on the outcomes in more detail.

Pascal VOC 2012 As shown in Table 2, FCPFNet achieved MIoU of 78.8% and 81.0% on ResNet50 and ResNet101, respectively. Additionally, the investigation conducted by researchers in [48] revealed that utilizing larger image sizes during the training and validation stages led to more segmentation accuracy, consistent with previous studies. It is worth noting that the image sizes employed in our proposed method were relatively smaller compared to those utilized by other state-of-the-art models. Despite this, FCPFNet demonstrated superior performance by outperforming these advanced methods and achieving the highest MIoU. This outcome further substantiates the superiority and effectiveness of our proposed approach. Although our the MIoU of FCPFNet are only 0.8% higher than the second-highest performing model, WASPNet [49],it is approximately 2%-3% higher than the results of the other models, demonstrating superior performance even when using the shallower ResNet50 network structure.

Table 2 Segmentation results on Pascal VOC 2012 val set

Full size table

Table 3 Segmentation results on Cityscpaes val set

Full size table

Cityscapes In Table 3, a comparative analysis between our model and the latest methods on the Cityscapes dataset reveals that our model remains competitive, as evidenced by its MIoU of 78.1% and 78.8% on the ResNet50 and ResNet101, respectively. Despite the marginal difference in segmentation accuracy compared to the other models, our model outperforms the second best model by 0.3%, achieving the best results. Moreover, compared to other models, FCPFNet exhibits outstanding performance even when ResNet50 is utilized as the backbone. Visualization results of FCPFNet on the Cityscapes dataset can be found in Fig. 5.

Coco-Stuff By comparing the MIoU of FCPFNet with other methods on the Coco-Stuff dataset, as presented in Table 4, we can observe that FCPFNet still achieves relatively competitive performance, with a 1–3% MIoU higher.

Table 4 Segmentation results on Coco-Stuff test set

Full size table

The experimental results from three datasets demonstrate that our network is the best performing, with excellent applications in scene semantic segmentation. As shown in Tables 2, 3 and 4, although our method is only 0.3% higher than DeepLabv3 in the Cityscapes dataset, it is 4.5% higher and 2.8% in the Pascal VOC 2012 and Coco-Stuff dataset, respectively. Furthermore, our method’s MIoU is only 0.8% higher than second place WASPNet in the Pascal VOC 2012 dataset but 4.8% higher than it in the Cityscapes dataset, proving that our model has superior generalizability. Due to the improvement of our network pipeline inspired by PSPNet, our model’s performance is also significantly improved compared to PSPNet, with an increase of 2.5%, 0.5% and 2.9% on Pascal VOC 2012, Cityscapes and Coco-Stuff dataset.

5 Conclusion

In this paper, we have proposed an Feature Feature Complementation Network with Pyramid Fusion Model (FCPFNet) for scene segmentation. By incorporating EPPM and DFAM into our model, we effectively solve the problem of insufficient contextual information, weak feature correlations and insignificant features during feature extraction. In specific, the Deep Feature Aggregation Module DFAM we designed aggregates local and global information in depth by combining different depths with pooling kernels of different sizes. Furthermore, we have introduced Efficient Pyramid Pooling Module EPPM to capture discriminative features by feature enhancement and to facilitate cross-channel information interaction of feature maps. The ablation studies demonstrate that the addition of DFAM and EPPM give more precise segmentation results. FCPFNet shows robustness while achieving outstanding performance consistently on three scene segmentation datasets, i.e. Pascal VOC 2012, Cityscapes, Coco-Stuff.

Data Availability

In this paper, the Pascal VOC 2012 dataset has been used (http://host.robots.ox.ac.uk/pascal/VOC) and the Cityscapes dataset has been used (https://www.cityscapes-dataset.com/dataset-overview).The datasets used or analysed during the current study are available from the corresponding author on reasonable request.

Code Availability

The code will be available at https://github.com/syyang2022/FCPFNet.

References

Li M, Chen D, Liu S (2022) Weakly supervised segmentation loss based on graph cuts and superpixel algorithm. Neural Process Lett, pp 1–24
Sun W, Liu Z, Zhang Y, et al (2023) An alternative to WSSS? An empirical study of the segment anything model (SAM) on weakly-supervised semantic segmentation problems. arXiv preprint arXiv:2305.01586
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Chen LC, Papandreou G, Kokkinos I et al (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article Google Scholar
Shen D, Ji Y, Li P et al (2020) Ranet: region attention network for semantic segmentation. Adv Neural Inf Process Syst 33:13927–13938
Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Proceedings international conference on medical image computing and computer-assisted intervention, pp 234–241
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Article Google Scholar
Seichter D, Köhler M, Lewandowski B, et al (2021) Efficient rgb-d semantic segmentation for indoor scene analysis. In: 2021 IEEE international conference on robotics and automation (ICRA), pp 13525–13531
Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, et al (2018) Unet++: a nested u-net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support, pp 3–11
Chen LC, Papandreou G, Schroff F, et al (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587
Zhao H, Shi J, Qi X, et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
He H, Chen Y, Li M et al (2022) Forknet: strong semantic feature representation and subregion supervision for accurate remote sensing change detection. IEEE J Sel Top Appl Earth Observ Remote Sens 15:2142–2153
Article Google Scholar
Yang Z (2023) Semantic segmentation method based on improved deeplabv3+. In: International conference on cloud computing, performance computing, and deep learning, pp 32–37
Cui L, Jing X, Wang Y et al (2022) Improved swin transformer-based semantic segmentation of postearthquake dense buildings in urban areas using remote sensing images. IEEE J Sel Top Appl Earth Observ Remote Sens 16:369–385
Article Google Scholar
Zhang C, Jiang W, Zhang Y et al (2022) Transformer and cnn hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Trans Geosci Remote Sens 60:1–20
Google Scholar
Liu Z, Lin Y, Cao Y, et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 5(23):2019–2032
MathSciNet Google Scholar
Yu J, Tan M, Zhang H et al (2021) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell 44(2):563–578
Article Google Scholar
Zhang H, Dana K, Shi J, et al (2018) Context encoding for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7151–7160
Yuan Y, Chen X, Chen X, et al (2019) Segmentation transformer: object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065
Fu J, Liu J, Tian H, et al (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154
Yuan Y, Huang L, Guo J, et al (2018) Ocnet: object context network for scene parsing. arXiv preprint arXiv:1809.00916
Zeiler MD, Taylor GW, Fergus R (2011) Adaptive deconvolutional networks for mid and high level feature learning. In: 2011 international conference on computer vision, pp 2018–2025
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer Vision—ECCV 2014: 13th European Conference, pp 818–833
Cao H, Wang Y, Chen J, et al (2022) Swin-unet: Unet-like pure transformer for medical image segmentation. In: Proceedings of the European conference on computer vision, pp 205–218
Li X, Chen H, Qi X et al (2018) H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE Trans Med Imaging 37(12):2663–2674
Article Google Scholar
Tang J, Hong R, Yan S et al (2011) Image annotation by k nn-sparse graph-based label propagation over noisily tagged web images. ACM Trans Intell Syst Technol 2(2):1–15
Article Google Scholar
Jinhui T, Lu J, Zechao L, et al (2015) Rgb-d object recognition via incorporating latent data structure and prior knowledge. IEEE Trans Multimedia, pp 1899–1908
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Advances in neural information processing systems, 30
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Wang Q, Wu B, Zhu P, et al (2020) Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11534–11542
Woo S, Park J, Lee JY, et al (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Cao Y, Xu J, Lin S, et al (2019) Gcnet: non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF international conference on computer vision workshops
Li X, Hu X, Yang J (2019) Spatial group-wise enhance: improving semantic feature learning in convolutional networks. arXiv preprint arXiv:1905.09646
Chen LC, Yang Y, Wang J, et al (2016) Attention to scale: Scale-aware semantic image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3640–3649
Fourure D, Emonet R, Fromont E, et al (2017) Residual conv-deconv grid network for semantic segmentation. arXiv preprint arXiv:1707.07958
Hou L, Vicente TFY, Hoai M et al (2019) Large scale shadow annotation and detection using lazy annotation and stacked cnns. IEEE Trans Pattern Anal Mach Intell 43(4):1337–1351
Article Google Scholar
Fu J, Liu J, Wang Y, et al (2019) Stacked deconvolutional network for semantic segmentation. IEEE Trans Image Process
Wang J, Sun K, Cheng T et al (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 43(10):3349–3364
Article Google Scholar
Gu Z, Cheng J, Fu H et al (2019) Ce-net: context encoder network for 2d medical image segmentation. IEEE Trans Med Imaging 38(10):2281–2292
Article Google Scholar
Li H, Xiong P, An J, et al (2018) Pyramid attention network for semantic segmentation. arXiv preprint arXiv:1805.10180
Qin Y, Kamnitsas K, Ancha S, et al (2018) Autofocus layer for semantic segmentation. In: Medical image computing and computer assisted intervention–MICCAI 2018: 21st international conference, pp 603–611
Ma N, Zhang X, Zheng HT, et al (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131
Everingham M, Eslami SA, Van Gool L et al (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vision 111:98–136
Article Google Scholar
Cordts M, Omran M, Ramos S, et al (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Caesar H, Uijlings J, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1209–1218
Hariharan B, Arbeláez P, Bourdev L, et al (2011) Semantic contours from inverse detectors. In: 2011 international conference on computer vision, pp 991–998
Chu X, Chen L, Chen C, et al (2022) Improving image restoration by revisiting global information aggregation. In: European conference on computer vision, Springer, pp 53–71
Artacho B, Savakis A (2019) Waterfall atrous spatial pooling architecture for efficient semantic segmentation. Sensors 19(24):5361
Article Google Scholar
Huang Z, Wang X, Huang L, et al (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 603–612
Zhao H, Zhang Y, Liu S, et al (2018) Psanet: point-wise spatial attention network for scene parsing. In: Proceedings of the European conference on computer vision (ECCV), pp 267–283
He J, Deng Z, Qiao Y (2019) Dynamic multi-scale filters for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3562–3572
Kirillov A, Wu Y, He K, et al (2020) Pointrend: image segmentation as rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9799–9808
He J, Deng Z, Zhou L, et al (2019) Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7519–7528
Lin G, Milan A, Shen C, et al (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1925–1934
Li X, Zhong Z, Wu J, et al (2019) Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9167–9176

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of China (No. 61972357) and Natural Science Foundation of Xinjiang Uygur Autonomous Region (No. 2022D01C349)

Funding

This work is supported by Key Research and Development Project in Zhejiang Province (No. 2024C01135), the Natural Science Foundation of China (No. 61972357) and Natural Science Foundation of Xinjiang Uygur Autonomous Region (No. 2022D01C349).

Author information

Jingsheng Lei and Chente Shu have contributed equally to this work.

Authors and Affiliations

College of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, Zhejiang, China
Jingsheng Lei, Chente Shu & Shengying Yang
Zhejiang Prov. Development Planning & Research Institute, Hangzhou, 310030, Zhejiang, China
Qiang Xu
Zhejiang Dingli Industry Co. Ltd, Lishui, 321400, Zhejiang, China
Yunxiang Yu & Shengying Yang

Authors

Jingsheng Lei
View author publications
You can also search for this author in PubMed Google Scholar
Chente Shu
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yunxiang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shengying Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JL: conceptualization, methodology, visualization, writing-original draft. CS: methodology, validation, writing-original draft, writing-review and editing. QX: conceptualization, formal analysis, writing-original draft. YY: conceptualization, investigation, writing-original draft. SY: validation, resources, supervision, writing- review and editing.

Corresponding author

Correspondence to Shengying Yang.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest.

Ethics Approval

Not applicable.

Human and Animal Ethics

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

All authors give the publisher license of the copyright which provides the publisher with the exclusive right to publish and sell the research findings in all.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lei, J., Shu, C., Xu, Q. et al. FCPFNet: Feature Complementation Network with Pyramid Fusion for Semantic Segmentation. Neural Process Lett 56, 60 (2024). https://doi.org/10.1007/s11063-024-11464-9

Download citation

Accepted: 21 December 2023
Published: 20 February 2024
DOI: https://doi.org/10.1007/s11063-024-11464-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

FCPFNet: Feature Complementation Network with Pyramid Fusion for Semantic Segmentation

Abstract

Similar content being viewed by others

MAFNet: dual-branch fusion network with multiscale atrous pyramid pooling aggregate contextual features for real-time semantic segmentation

DSMRSeg: Dual-Stage Feature Pyramid and Multi-Range Context Aggregation for Real-Time Semantic Segmentation

FPANet: Feature pyramid aggregation network for real-time semantic segmentation

1 Introduction

2 Related Work

2.1 Semantic Segmentation

2.2 Attention Mechanism

2.3 Context Aggregation and Multi-scales

3 Method

3.1 Overview

3.2 Deep Feature Aggregation Module

3.3 Efficient Pyramid Pooling Module

4 Experiments

4.1 Datasets

4.1.1 Train Setting

4.1.2 Ablation Study

4.1.3 Comparison to State-of-the-Arts

5 Conclusion

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics Approval

Human and Animal Ethics

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation