Introduction

Semantic segmentation aims to label each pixel in an image [1,2,3], a fundamental direction in computer vision tasks. Semantic segmentation is utilized extensively in automatic driving [4, 5], human–computer interaction [6], augmented reality [7] and medical imaging [8]. It is considered an intensive segmentation task, so it will inevitably increase the number of model parameters. Early deep convolution neural network is the main method of character recognition, such as MNIST [9]. In previous years, DCNNs became the mainstream of computer vision, including image classification networks [10, 11], and deep detection networks [12, 13]. Deep convolution neural networks have superior feature extraction capabilities when compared to the conventional machine learning methods. Specifically, Long et al. [14] directly applied the deep neural network to the entire image, and the convolution layer replaced the full connection layer, enabling FCNet to receive end-to-end training and enhancing semantic segmentation efficiency. Recent advancements in neural networks [11, 15] have promoted the emergence of a series of networks [16, 17]. FCN serves as the foundation of these networks. To extract features, their encoder increases the number of feature channels and compresses the spatial dimension, while their decoder uses upsampling to restore the size of the feature map. Based on this method, many researchers later focused on these two issues in an effort to improve the performance of semantic segmentation. The first issue is to enhance the coding structure, so the encoder can extract more robust feature information. Specifically, DeeplabV2 [18] and Chen et al. [19] improved the ASPP module in the encoder, which consists of heterogeneous convolution and BN layers with varying sampling rates, and attempted to arrange the modules serially or in parallel. The second challenge is how to model the multi-scale context and encode context information into the original feature map data to improve the representation of each pixel. For example, Deeplabv3 + [17], SegNet [20], and OCRNet [13]. DeeplabV2 introduces the Spatial Pyramid Pool (ASPP) with atrous for multi-scale context modeling, so it can capture multi-scale context information using various expansion convolutions. Zhao et al. [21] proposed a network algorithm called "PSPNet" that aggregates multi-scale contextual information via spatial pyramid pooling. However, these strategies require a large number of parameters and computational effort, so Wang et al. [22] designed a non-local block to weighted aggregate the contextual information in the whole image, and this approach can capture the long-range interdependencies between pixels by simply designing a self-attention. Since non-local block matrix multiplication is computationally intensive and can impose a video memory burden, Zhu et al. [23] proposed ANNet based on the asymmetric pyramidal non-local block to reduce the standard non-local module's computation and GPU memory consumption. Although ANNet can capture interdependencies at long distances, it only considers the relationship between pixels in the feature map space direction and does not consider the relationship between pixels in the feature map channel direction. As a result, it cannot establish the global relationship between targets or objects, which is essential for semantic segmentation. In addition, ANNet focuses primarily on the relationship between pixels in the spatial direction, but it cannot distinguish between classes with similar appearances that do not share the same label. In this paper, we designed The Channel Enhancement Region Module to accentuate the distinction between categories (CERM). CERM is used to solve the problem of class inconsistency. Semantic segmentation is a classification task at the pixel level that typically requires multi-scale context features to encode both local and global information. Due to a lack of global contextual semantic information, small objects, such as traffic lights, fences, etc., are often incorrectly predicted.

ANNet incorporated multi-scale pyramid pooling modules [21, 24, 25] into APNB and AFNB modules, thereby decreasing the amount of computation. AFNB first calculated the similarity of the fourth- and fifth-stage feature maps of ResNet101 [26], and then weighted the feature map of the V (Value) branch using the similarity matrix, without considering the use of channels at all. Given a feature map of the APNB module with a size of C × H × W, the multiplication computation of two matrices in the AFNB module is \(\mathcal{O}({C}^{*}NS)\), where N = H × W, S represents the pixels in the spatial direction of the feature map. ANNet further aggregates the location context information by calculating \(\mathcal{O}({C}^{*}NS)\). Since ANNet does not consider the use of pixel similarity in the feature map channel direction, we compute the pixel relationships in the feature map channel direction to improve the feature class region representation with reduced computational effort, noting that CERM uses a total of one matrix multiplication with \(\mathcal{O}({C}_{\mathrm{class}}NC)\), where \({C}_{\mathrm{class}}\) represents the number of category channels, N = H × W, and C is the number of feature map channels after convolutional compression.

In this paper, a simple and effective Mutually Reinforcing Non-local Neural Network for Semantic Segmentation (MRNNet) network is proposed to enhance the ANNet semantic segmentation network. There are two improvements: the first improvement is a reduction in the number of ANNet parameters, and the second is an enhancement of the AFNB module in ANNet to improve the feature map channel category region representation by utilizing the similarity of feature map channel directions. Inspired by the methods of DANet [27] and DFNet [28], we can effectively distinguish inconsistent problems within the classification by weighting and summing the characteristic maps of the channel directions, such as the spine of cattle and the network divides left and right sides of cattle into distinct categories. Therefore, we need to make the network better distinguish each category. We observed that the auxiliary loss function of ANNet was added to the fourth stage of ResNet101 to aid in network training, and that the number of characteristic graph channels of the input loss function was equal to the number of categories. Inspired by the auxiliary loss function of ANNet, we supervised the formation of the category region of the feature map using the auxiliary loss function of ANNet. In the fifth stage of ResNet101, we normalized the feature map input of ANNet and weighted the feature map summation. Through this operation, the feature maps of the auxiliary loss function can be distinguished in the direction of the channel, with each feature map representing a category.

In addition, we discovered that ANNet utilized multi-scale spatial pyramid pooling to reduce the computation of matrix multiplication, whereas AFNB did not account for the channel direction similarity of the characteristic graph. Moreover, the computation of two matrix multiplications was considerably greater than that of single matrix multiplication, and the use of multi-scale spatial pyramid pooling would inevitably result in the loss of detailed information. Thus, we reduce two matrix multiplications to single matrix multiplication and use this matrix multiplication to perform a weighted summation of the feature maps along the channel direction. On the other hand, we do not utilize multi-scale spatial pyramid pooling as ANNet does, but by reducing one matrix multiplication, we can reduce the matrix multiplications generated by the computational effort. To better demonstrate the efficiency of our model, we compare it to ANNet in terms of model parameters, reasoning time, and segmentation accuracy. Our model parameter is 51.82(MB), 11.35(MB) less than ANNet's parameter, which is 63.17(MB) (MB). Our reasoning time is 0.508(s), which is 0.103(s) faster than ANNet's s reasoning time of 0.611(s). Our model's segmentation accuracy is 81.9%, 0.6% higher than that of ANNet [23], which is 81.3%.

The pixel information in the feature space direction and the pixel information in the channel direction are equally important DANet [27], DFNet [28], CBAM [29], and CoordAttention [30]. The channel attention module of DANet is utilized to determine the channel dependency between any two-channel mappings, while the weighted sum of all channel mappings is used to update each channel mapping. However, too many channels will cause some feature maps to contain less context information than others, which is not conducive to the segmentation of small target objects and edge objects. To include as much context information as possible in the channel. We use cross-entropy loss to supervise the formation of feature map channel categories, and we reduce the number of feature map channels using compression, so that the number of feature map channels equals the number of categories. This ensures that each channel in the feature map represents a category and contains rich context information. Through this operation, we achieve equal treatment of multi-scale objects and improve segmentation accuracy by assigning greater weight to the feature map representing the objects that are difficult to segment and less weight to the feature map representing the objects that are easy to segment.

In summary, our main contributions are as follows:

  1. 1.

    We propose a simple and effective Mutually Reinforcing Non-local Neural Networks for Semantic Segmentation (MRNNet) to reduce the number of parameters and the mean inference time of ANNet [23].

  2. 2.

    Motivated by DANet and CBAM, we improved the AFNB module in ANNet to become the channel enhancement regions module (CERM). Specific implementation entails reducing AFNB's two matrix multiplications to single matrix multiplication. AFNB focuses on capturing the spatial dependence between any two feature map locations. PEPM identifies channel dependencies between any two-channel mappings and updates each channel mapping with the weighted sum of all channel mappings. Compared with the AFNB model, CERM has fewer parameters and computation.

  3. 3.

    We designed the position-enhanced pixels module (PEPM) to further improve segmentation accuracy. PEPM can not only establish the long-distance interdependence between pixels, but also create a spatial attention matrix and improve the spatial feature dependence. By establishing rich context dependence for pixels in the spatial direction of the feature graph, PEPM greatly improves the performance of semantic segmentation.

Related work

In the following section, we briefly review the development of semantic segmentation. FCN performed exceptionally well in semantic segmentation tasks [31, 32]. After that, FCN promoted a number of additional works, including SegNet, UNet [33], Deeplabv3 + , and DeconvNet [34].

Enhance contextual aggregation

Semantic segmentation has made significant progress with the FCN-based method. Deeplabv2 makes full use of atrous convolution, which can effectively expand the receptive field and merge more context information without increasing the parameters; DeepLabv3 enhances the use of atrous convolution, expands the ASPP module, and can obtain a larger receptive field to obtain multi-scale information. PSPNet proposes to use spatial pyramid pooling to aggregate multi-scale context information. Pyramid spatial pooling can collect data at different scales. Moreover, models comprised of encoding and decoding structures [35, 36] can improve context aggregation.

Semantic segmentation

FCN first applied the full convolution network to the entire image, while semantic segmentation is a classification task at the pixel level. Usually, there are two kinds of semantic prediction tasks at the pixel level. The first one is to design a new backbone network [37,38,39,40] to extract more robust feature graphs for pixels. Because it is crucial to maintain the backbone network's high resolution to extract spatial location information, Zhang et al. [41] applied channel attention to different network branches to leverage their methods for capturing cross-feature interactions and learning different representations. The other task is to design different decoders, which typically work in tandem with encoders to achieve optimal results. There are different types for different working decoders. For example, obtaining different receptive fields [18, 19, 21], enhancing edge features [42,43,44], and obtaining global context information [27, 45].

Attention mechanisms

The attention mechanism is an automatic selection process that allows the model to concentrate on the essential components. Usually, attention mechanisms are divided into channel attention and spatial attention. Different attention plays different roles. The primary focus of spatial attention is on spatial regions [27, 46,47,48]. However, channel attention primarily enables the model to select essential categories. Channel attention is equally important than spatial attention [27, 49, 50]. However, some recent works [23, 51] ignore the application of channel attention in feature map channel direction to enhance semantic segmentation networks. Therefore, DANet uses the spatial attention module and the channel attention module in parallel to improve the semantic segmentation network. Unlike DANet, our MRNNet model uses the channel attention module and the spatial attention module in series to enhance the semantic segmentation network.

Methods

In this section, we will first introduce the overall framework of our network, followed by an explanation of how the CERM and PEPM modules enhance the performance of semantic segmentation in the channel dimension and spatial dimension.

Network architecture

As shown in Fig. 1, with the introduction of CERM and PEPM, we propose MRNNet. We followed ANNet in selecting ResNet101 as the network's backbone. Similarly, we eliminated the last two downsampling operations and used atrous convolution to maintain the feature map size of ResNet101 stages 4 and 5 at \(\frac{1}{8}\) of the input image. In other words, the fourth and fifth stages of ResNet101 output feature maps of the same size.

Fig. 1
figure 1

Overview of the proposed MRNNet for semantic segmentation

In the fourth and fifth stages, CERM is used to capture the long-distance pixel relationship in the feature map's channel direction. In particular, the fourth-stage generated channel attention matrix is multiplied by the fifth-stage generated feature map. The output of the fourth stage is supplemented with a cross-entropy loss function, which is used to supervise and learn the category area of the output characteristic map of the fourth stage. In the fourth stage, the number of channels of the output feature map is equal to the number of categories to be segmented. By performing such an operation, the output feature map in the fourth stage can be enhanced, so that it contains more context information. To make full use of the feature map output by CERM, we input this feature map and the feature map output by the fifth stage into PEPM. On the V (Value) branch, PEPM can generate a spatial attention matrix and weighted sum with a feature map. This allows for the capture of the long-distance interdependence between pixels in the spatial direction of the feature map. To enhance the contextual information of the PEPM output feature maps, we parallelized the feature maps on the \(Q_{2}\) branch. For the PEPM module, we follow the design approach of OCNet [52] \(. T_{\theta }\), \(T_{\gamma }\) sharing parameters can reduce the model parameters and computational effort. Our model, unlike ANNet, does not use multi-scale pyramid pooling [21,22,23]. Furthermore, our model can reduce model parameters and improve reasoning speed and accuracy. Notably, ANNet did not use the loss function to monitor the learned feature map, whereas we added the loss function to CERM and obtained outstanding results.

Channel enhancement regions module

Effectively distinguishing the channel number of a feature graph can improve class consistency, which is crucial for semantic segmentation. However, ANNet did not consider differentiating the channel categories of the feature map, instead relying on the attention of pixels in the spatial direction of the feature map to capture the long-distance dependency between pixels. Additionally, ANNet employed multi-scale pyramid pooling to reduce model parameters and computational overhead. However, the segmentation effect of ANNet is not ideal. As shown in Fig. 2, one of the matrix multiplication calculations in ANNet is \({\mathcal{O}}\left( {C^{*} NS} \right)\), while the matrix multiplication calculation of CERM is \({\mathcal{O}}\left( {C^{*} NC_{{{\text{class}}}} } \right)\).Where \(C_{class}\) represents the number of characteristic map channels, that is, the number of categories to be divided. It is ideal that \(C_{{{\text{class}}}}\) should be much smaller than \(S\). In semantic segmentation, it is crucial to maintain a high resolution for the feature map. However, ANNet uses multi-scale pyramid improving semantic segmentation performance. However, CERM does not use multi-scale pyramid pooling to reduce model parameters, resulting in a loss of feature map resolution that is not conducive to pooling but rather a loss function to supervise the formation of a category area channel feature map. The channel attention generated by the channel feature map of the category area attempts to weight the feature map of the fifth stage to improve the category channel's representation in the feature map. The detailed structure diagram of CERM and AFNB is shown in Fig. 2, and a more intuitive operation flow of CERM and AFNB is given in Eq. 1

$$ \begin{array}{*{20}c} {{\mathcal{R}}^{{N \times C^{*} }} \times {\mathcal{R}}^{{C^{*} \times N}} \to {\mathcal{R}}^{N \times S} \times {\mathcal{R}}^{{S \times C^{*} }} \to {\mathcal{R}}^{{N \times C^{*} }} .} \\ \end{array} $$
(1)
Fig. 2
figure 2

From left to right are the frames of the CERM and AFNB

We reduce two matrix multiplications of AFNB to single matrix multiplication, thereby drastically reducing the amount of calculation required. The feature map of \({\text{Stage}}_{4}\) is supervised by the loss function. The size is \(C_{{{\text{class}}}} \times H \times W\). Here, \(C_{{{\text{class}}}}\), \(H\), and \(W\) represent the channel number, height, and width of the feature map, respectively. \(C_{{{\text{class}}}}\) represents the number of categories to be split. Consider the channel attention matrix generated by the feature map of Stage4 to weigh the feature map of \({\text{Stage}}_{5}\). According to Figs. 2 and 3, the CERM operation flow is shown as follows:

$$ \begin{array}{*{20}c} {{\mathcal{R}}^{{C_{{{\text{class}}}} \times N}} \to {\mathcal{R}}^{{C_{{{\text{class}}}} \times N}} \times {\mathcal{R}}^{{N \times C^{*} }} \to {\mathcal{R}}^{{C_{{{\text{class}}}} \times C^{*} }} .} \\ \end{array} $$
(2)
Fig. 3
figure 3

The details of the CERM

ANNet embeds the pyramid pool module to select the pixel representative point set S, but it does not use all of the pixel points, so resolution is inevitably lost. In light of this issue, we eliminate pyramid pooling and switch from analyzing the relationship between spatial pixels to analyzing the relationship between pixels along the channel direction of the feature map. As depicted in Fig. 3, we propose CERM based on this concept, and the calculation flow of CERM's feature map is shown as follows:

$$ \begin{array}{*{20}c} {\ell_{l} = T_{1}^{l} \left( {{\mathcal{F}}_{l} } \right),\hbar_{h} = T_{3}^{h} \left( {{\mathcal{F}}_{h} } \right).} \\ \end{array} $$
(3)

Here, \({\mathcal{F}}_{l} \, and\, {\mathcal{F}}_{h}\) denote the feature maps output at the low − level stage and the high − level stage with dimensions \(C_{l} \times H \times W\), \(C_{h} \times H \times W\), respectively. \(T_{1}^{l}\) and \(T_{3}^{h}\) denote \(1 \times 1Conv\) at the low-level stage and \(3 \times 3 \, Conv\) at the high-level stage. \(\ell_{l}\) and \(\hbar_{h}\) denote the feature maps with dimensions \(C_{{{\text{class}}}} \times H \times W{\text{ and}} \, C^{*} \times H \times W\). We normalize \(\ell_{l}\) to the attention matrix \(\widetilde{{\ell_{l} }} \in {\mathcal{R}}^{{C_{class} \times N}}\), and then the matrix multiply \(\widetilde{{\ell_{l} }}\) with \(\hbar_{h}\), as shown in Eq. 4

$$ \begin{array}{*{20}c} {O = \widetilde{{\ell_{l} }}\hbar_{h}^{T} ,} \\ \end{array} $$
(4)

where \(O\) stands for the feature map of CERM output with dimensions \(C_{{{\text{class}}}} \times C^{*}\).

The computation of ANNet matrix multiplication, \({\mathcal{O}}\left( {C^{*} NS} \right)\), is much larger than that of CERM matrix multiplication, \({\mathcal{O}}\left( {C^{*} NC_{{{\text{class}}}} } \right)\), because \(S\) >  > \(C_{{{\text{class}}}}\). In addition, the multi-scale pyramid pooling embedded in ANNet inevitably causes a loss of feature map resolution. Thus, our CERM does not use multi-scale pyramid pooling. To improve the performance of semantic segmentation DANet, and DFNet, we also try to reduce the matrix multiplication computation using matrix multiplication and converting the pixel relationship in the spatial direction to the relationship between pixels in the feature map channel direction. In addition, ANNet makes the necessary adjustments to the sampling points S to compensate for the resolution loss brought on by multi-scale pyramid pooling. ANNet ablation experiments show that the more sampling points segmented, the higher the accuracy, which corresponds to a reduction in the efficiency of the model. ANNet experiments show that S is usually taken to be 110, which is much larger compared to \(C_{{{\text{class}}}}\). Thus, the corresponding matrix multiplication computation will also be much larger than that of CERM.

Position-enhanced pixels module

Position-enhanced pixels module (PEPM) model structure is identical to that of APNB in the ANNet. The only difference is that the input feature map of APNB is the output feature map of AFNB, whereas our PEPM makes full use of the output feature maps of CERM and \(S_{5}\). In other words, APNB has one input, while PEPM has two. PEPM is capable of capturing the context information of an entire image from a global perspective. The function of PEPM is similar to that of DANet's position attention module, which can obtain distinguishing features, which is very important for PEPM to capture long-distance interdependent context information. As shown in Fig. 4, we stack the feature map output by PEPM with the feature map \(C_{6}\) of the fifth stage to improve the ability to represent features. Fu et al. [25] proposed that the features generated by the conventional FCNs are prone to incorrect object classification. To generate rich context information for local features, we developed PEPM, which in the fifth stage integrated the rich context information into the six feature maps to generate a more robust feature map and enhance the feature representation capability. Finally, the feature map is input to the classifier. Next, we describe the detailed structure diagram of PEPM in detail. As shown in Fig. 4, we provide two input feature maps \(\alpha \in {\mathbb{R}}^{{C_{6} \times H \times W}} {\text{ and }}\beta \in {\mathbb{R}}^{{C_{6} \times C_{class} \times 1}}\). Then, we reshape \(\alpha\) and \(\beta \) into \(\alpha \in {\mathbb{R}}^{{C_{6} \times N_{K} }} ,\beta \in {\mathbb{R}}^{{C_{6} \times N_{Q} }}\), where \(N_{K}\) = \(H \times W\) and \(N_{Q} = 1 \times C_{class} { }\). Finally, we perform a matrix multiplication operation between the transpose of \(\alpha\) and \(\beta\), and apply a softmax layer to compute the spatial attention map \(S \in {\mathbb{R}}^{{N_{K} \times N_{Q} }}\)

$$ \begin{array}{*{20}c} {s_{ji} = \frac{{\exp \left( {\alpha_{i} \cdot \beta_{j} } \right)}}{{\mathop \sum \nolimits_{i}^{{\text{N}}} \exp \left( {\alpha_{i} \cdot \beta_{j} } \right)}},} \\ \end{array} $$
(5)

where \(s_{ji}\) represents the impact of the \(i\)th position of the space pixel on the \(j\)th position of space pixel. The more similar the feature representations of two locations are, the greater the correlation between them. We reshape the feature map \(\beta \in {\mathbb{R}}^{{C_{6} \times C_{{{\text{class}}}} \times 1}}\) of the CERM output into \(\beta \in {\mathbb{R}}^{{C_{6} \times C_{{{\text{class}}}} }}\), while we perform a matrix multiplication between the transpose of \(\beta\) and S, and reshape the result into \(\gamma \in {\mathbb{R}}^{{C^{*} \times H \times W}}\). Finally, we multiply it by a scaling parameter \(\rho\) and concatenate with the characteristic \(\alpha\) to obtain the final output \(\delta \in {\mathbb{R}}^{{C_{{Stage_{4} }} \times H \times W}}\) as follows:

$$ \begin{array}{*{20}c} {\delta_{j} = {\text{concat}}\left( {\rho \mathop \sum \limits_{i = 1}^{N} \left( {s_{ji} \beta_{i} } \right),\alpha_{j} } \right),} \\ \end{array} $$
(6)

where \(\rho\) is initialized to 0, which gradually learns to assign more weights [53]. It can be deduced from Eq. 6 that the resulting feature \(\delta\) for each location is a weighted sum of the features of all locations and the original features. Therefore, features in the spatial direction of the feature map have a global contextual view and aggregate contexts selectively based on spatial attention maps. Similar semantic features achieve reciprocity, thereby enhancing intra-class compactness and semantic consistency.

Fig. 4
figure 4

The details of the PEPM

Experiments

Datasets

Our method is evaluated using three main data sets, PASCALVOC2012 [54], Cityscapes [55], and ADE20k [56]. PASCALVOC2012 is a comprehensive scene dataset with 2,913 images and 20 categories. From the 2913 images, 1464 are used for training, 1449 for verification, and 1456 for testing. Cityscenes dataset contains 5000 high-quality pixel-level annotated images of urban driving scenes, which are divided into 30 categories. From the 5,000 pictures, 2,975 are used for training, 500 for evaluation, and 1525 for testing. These photos were taken in 50 different cities. The data set also contains 19,998 roughly annotated pictures; here, we only use finely marked images for 19 categories. ADE20k has more than 25,000 images (20 k-train, 2 k-val, and 3 k-test), which are intensively annotated with an open dictionary tag set. For 2017 Places Challenge 2, 100 thing and 50 stuff categories covering 89% of all pixels were selected.

Implementation details

To train the model on these three data sets, we use the stochastic gradient descent (SGD) [57] using the multi-learning rate decay strategy, where the initial learning rate is multiplied by.\(\left( {1 - \frac{{\text{ iter }}}{{{\text{ max\_iter }}}}} \right)^{{\text{power }}}\). For training and verification on the Cityscape dataset, we use a 0.0025 learning rate, 0.9 weight attenuation, and 0.0005 momentum. In the training and verification stage, we cut the original image to 1024 × 512 (cityscape), 512 × 512 (PASCAL VOC 2012), and 520 × 520 (ADE20k). During training, the input image is randomly scaled from 0.5 to 2 and flipped horizontally for data enhancement. Our backbone network is ResNet101, which has been pre-trained on the ImageNet data set [58]. For the Cityscape dataset, the batch size is 4, and the training period is 160 K iterations. For PASCAL VOC 2012 dataset, the batch size is 8, and the training period is 100 k iterations. The batch size of ADE20k is 8, and the training period is 200 K iterations. All experiments were performed using 1 × V100 GPU. According to ANNet, our model is optimized using two cross-entropy losses. The first loss function is applied to the output of ResNet101's fourth stage, and the second loss function is applied to the model's output. Thus, the total loss function is

$$ \begin{array}{*{20}c} {l = \lambda l_{{{\text{model}}}} + l_{{{\text{backbone}}_{{{\text{stage4}}}} }} } \\ \end{array} , $$
(7)

where \(l_{{{\text{backbone}}_{{{\text{stage4}}}} }}\) denotes the loss function at the output of backbone \({\text{Stage}}_{4}\), \(l_{{{\text{model}}}}\) denotes the loss function at the output of the model, and \(\lambda\) is set to 0.4.


Evaluation metrics In this paper, we use pixel accuracy (PA), intersection/merge (IoU), and the mean value of IoU (mIoU) as our evaluation metrics. Their calculation results are as follows:

$$ \begin{array}{*{20}c} {{\text{PA}} = \frac{{\mathop \sum \nolimits_{i = 0}^{k} P_{ii} }}{{\mathop \sum \nolimits_{i = 0}^{k} \mathop \sum \nolimits_{j = 0}^{k} p_{ij} }},} \\ \end{array} $$
(8)
$$ \begin{array}{*{20}c} {{\text{IoU}} = \frac{{p_{ii} }}{{\mathop \sum \nolimits_{j = 0}^{k} p_{ij} + \mathop \sum \nolimits_{j = 0}^{k} p_{ji} - p_{ii} }} \quad i = 0,1,2 \ldots k,} \\ \end{array} $$
(9)
$$ \begin{array}{*{20}c} {{\text{ mIoU }} = \frac{1}{K + 1}\mathop \sum \limits_{i = 0}^{k} \frac{{p_{ii} }}{{\mathop \sum \nolimits_{j = 0}^{k} p_{ij} + \mathop \sum \nolimits_{j = 0}^{k} p_{ji} - p_{ii} }}.} \\ \end{array} $$
(10)

PA PA indicates the ratio of correctly identified pixels to the total number of pixels. IoU For each category, the IoU is calculated as the intersection ratio of the true and predicted values. MIoU To calculate this metric, the arrears are first calculated for each category, and then the average is calculated.

Ablation study

In this section, we present an ablation test to validate our method. We select various CERM and PEPM components and demonstrate their effect on the model. All subsequent experiments use ResNet101 as the network's backbone and train 160 K iterations on the refined Cityscapes data set. In addition, we train 200 K iterations on the ADE20K dataset and 100 K iterations on the PASCAL VOC 2012 dataset, respectively.

Efficacy of the CERM and PEPM

Our model has two main components: CERM and PEPM. Next, let us evaluate the role of each component. As shown in Table 1, we add cross-entropy loss after \(1 \times 1Conv\) of the fourth stage of ResNet101. Compared with no cross-entropy loss function, the performance mIoU increases from 80.64 to 82.67%. Experiments demonstrate that maximizing the use of the features learned by loss function supervision improves semantic segmentation performance. As shown in Table 2, when the fifth stage of ResNet101's feature map is merged with the feature map output by PEPM, the performance is improved by 0.33% compared to when the maps were not merged. This phenomenon demonstrates that combining high-stage and low-stage feature graphs improves the performance of semantic segmentation. To investigate the impact of multi-scale pyramid pooling in ANNet on MRNNet, we incorporate multi-scale pyramid pooling at two CERM inputs. The respective pool sizes are (1, 2, 3, 6), (1, 3, 6, 8), and (1, 4, 8, 12). Similar to Table 3, as the number of pool sampling points increases, the segmentation effect improves. As shown in Table 4, we remove the multi-scale pyramid pooling module from CERM and add it to PEPM to examine the network's response to the addition of different pool sizes. The experiment demonstrates that the mIoU value increases with pool size, with the highest mIoU value occurring when the pool size is (1, 4, 8, 12).

Table 1 Ablation experiments for \(+\mathrm{loss}\) and \(-\mathrm{loss}\) validation datasets in Cityscapes
Table 2 Ablation experiments on \(-\mathrm{fusion}\) and \(+\mathrm{fusion}\) of Cityscapes validation set
Table 3 We conducted ablation experiments on various pyramid pool scales for CERM
Table 4 Ablation experiments with PEPM use pyramidal pooling of different scales for the K (Key) and V (Value) branches

Efficiency comparison with asymmetric non-local neural networks for semantic segmentation

This section compares MRNNet and ANNet in terms of reasoning time, model parameters, and memory consumption. As demonstrated in Table 5, the reasoning time and model parameters of our model are lower than those of ANNet. However, our model's memory is larger than that of ANNet, and our model has more advantages overall than ANNet. For a more intuitive comparison, we visualize the split network of ANNet and MRNNet. As shown in Fig. 5, the first, second, third, and fourth columns represent the evident that our model has a superior effect on segmentation.

Table 5 Comparison of inference time (s), GPU memory usage, and parameters based on the network
Fig. 5
figure 5

Visual comparison on PASCAL VOC 2012 Dataset. a Image, b Ground Truth, c ANNet, and d Ours

Comparisons with other methods on Cityscapes and ADE20K datasets

Cityscapes To demonstrate the superiority of our method, we compare it to other techniques using a test dataset. It is important to note that we do not include the validation set when training the model with 160 K iterations performed directly on the refined images. As shown in Table 6, our method outperforms the above methods, obtaining 81.9%.

Table 6 In the Cityscapes dataset, the Val column indicates whether finely annotated validation set data containing Cityscenes were used to train the model

In addition, we provide a qualitative comparison with other methods on the Cityscapes dataset, and as depicted in Fig. 6, CERM is able to obtain rich information about class regions and produce consistent segmentation results within classes. In other words, CERM can group categories with the same label into the same class. For example, in Fig. 6, other methods segment the motorcycles in the third column of the first row of images, which are segmented into different classes. The truck's rear end and the car's front end in the fourth row are not segmented effectively, whereas our model does not have this issue. In addition our model is capable of segmenting long-shaped objects, such as the sidewalk in the second row of Fig. 6, whereas the other models are unable to do so. As shown in Table 7, although ADE20K has different image sizes, the training and validation sets have many semantic category gaps, and PSPNet uses the deepest backbone network. Our model segmentation yields a performance of 45.26%, which is superior to other methods.

Fig. 6
figure 6

Visual comparison on the Cityscapes Dataset: a Image, b Ground Truth, c ANNet, d DANet, and e Ours

Table 7 On the validation set of ADE20K, comparisons with the existing methods are performed

Conclusion

This paper proposes Mutually Reinforcing Non-local Neural Networks for Semantic Segmentation. The core contribution of Mutually Reinforcing Non-local Neural Networks for Semantic Segmentation is to propose the channel enhancement regions module (CERM), and the position-enhanced pixels module (PEPM). CERM employs the loss function in the fourth stage of ResNet101 to accurately classify each channel in the fourth-stage feature map into one of several categories. PEPM makes effective use of the feature map generated by CERM, calculates the relationship between pixels and the fifth-stage characteristic map, and establishes the long-distance interdependence of pixels. Moreover, under the interaction of CERM and PEPM, our model is able to effectively capture long-distance context information to improve category representation and achieve more accurate segmentation. Experiments demonstrate that our model achieves remarkable performance on three scene segmentation datasets, including Cityscapes, Pascal VOC 2012, and Ade20k.