Introduction

Semantic segmentation is an important and critical task in the field of machine vision, which recognizes the content in an image by classifying and labeling each pixel in the input image. Semantic segmentation, as a core technology for autonomous driving [1, 2], agricultural field [3, 4] and remote sensing image analysis [5,6,7], has attracted much attention since FCN [8] was proposed.

Fig. 1
figure 1

Architecture of a OCR block, b our block. \(f_{pix}\) pixel feature representation, \(f_{cate}\) category feature representation, B intra-class context representation, FCA feature categorization attention, CRA channel relation attention

In recent years, significant advancements have been made in the field of semantic segmentation through the utilization of the encoder-decoder architecture to improve FCNs. This line of research primarily focuses on the crucial task of capturing contextual information. For instance, Unet [9] effectively captures contexts by concatenating the upsampled feature maps in the decoder with the corresponding feature maps in the encoder. This approach maximizes the utilization of intermediate-level features to comprehensively capture contexts. In contrast, Segnet [10] employs the maximum pooling index recorded by the encoder to upsample the feature maps in the decoder. However, this convolution plus pooling operation lacks access to global contexts, thereby resulting in incomplete segmentation of large-sized objects. To overcome this challenge, researchers have proposed innovative solutions such as DeeplabV2 [11] and PSPNet [12]. These approaches employ ASPP and PPM, respectively, to expand the receptive field and capture multi-scale contexts, thereby enhancing the segmentation accuracy. Furthermore, there are other notable studies that uses convolutional backbone to extract features and introduces attention mechanisms to capture the global contexts. For instance, DANet [13] employs self-attention mechanisms in parallel in both spatial and channel dimensions. This enables the capture of long-distance dependencies and channel connections, effectively extracting rich contexts. Similarly, CBAM [14] leverages channel attention and spatial attention sequentially. It first calculates channel attention through maximum pooling and average pooling, and subsequently calculates spatial attention based on these results. This comprehensive approach improves the overall performance of semantic segmentation techniques.

Unlike these approaches, as shown in Fig. 1a, OCR [15] aggregates the same-category features to capture the precise contexts before using spatial attention, in a way that not only improves the segmentation performance but also reduces the secondary complexity of the attention mechanism. However, OCR neglects the study of channel relationships and lacks the capture of contexts between related categories. As shown in Fig. 1b, our model adds Pyramid Pooling Module (PPM) to capture important and precise contexts before ensembling same-category features. Then in the spatial dimension, the FCA is used to capture the spatial location long-range dependencies of categories and pixels. Next in the channel dimension, the output of the FCA is used as an input branch of the CRA, which is utilized to capture the connections between categories. By adopting this Nested Attention framework, NANet successfully captures long-distance dependencies between spatial positions and correlations among channels while efficiently addressing the challenge of high computational complexity.

In summary, this paper makes the following three contributions:

  • The NANet proposed in this paper uses the sampling module PPM to capture accurate same-category contexts with low computational complexity and serves as the input to the FCA and CRA.

  • FCA is used as an input branch of CRA to realize the nesting of two different dimensions of attention. The inter-channel relationship is further computed after capturing the long-distance dependency between locations, capturing the associated category contexts.

  • We evaluated the performance of NANet on three datasets, Cityscapes, PASCAL VOC 2012 and ADE20K, achieving promising experimental results.

Related work

Semantic segmentation

The proposed variant model based on Fully Convolutional Networks has made promising progress in semantic segmentation.The variant model based on Fully Convolutional Network has made significant advances in semantic segmentation. UNet [9], utilizing an encoder-decoder structure, effectively combines the decoder features with their corresponding encoder features to enhance contextual understanding. PSPNet [12] incorporates Pyramid Pooling Modules (PPM), while Deeplabv2 [11] and Deeplabv3 [16] utilize Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context. DANet [13] employs spatial attention and channel attention mechanisms to capture long-range dependencies across different dimensions. OCR [15] designs category priors to capture more precise context.

Table 1 The comparison of the effectiveness of channel attention and spatial attention in different tasks

Attention mechanism

In many state-of-the-art works [17, 18], attention mechanisms are applied to capture long-range dependencies in different dimensions. For example, ANNet [19] captures global dependencies between all pixels in the spatial dimension using two self-attentions. SENet [20] uses channel attention to assign different weights to different channel maps. And yet, ANNet fails to capture inter-channel correlations, while SENet overlooks long-distance dependencies among spatial pixels. In order to overcome the shortcomings of these methods, some work connects spatial attention and channel attention in tandem or in parallel to capture rich contexts. For example, DANet [13] uses spatial attention and channel attention in parallel. CBAM [14] first uses channel attention to extract channel features, and then uses spatial attention to extract important location information. In these works, channel attention and spatial attention are combined in different ways. Please refer to Table 1 for details. However, the utilization of attention mechanisms, specifically spatial attention and channel attention, introduces a notable increase in computational complexity. To address this issue, a novel approach called Nested Attention is proposed in this study.

Context

Context is critical for dense prediction tasks. Many works [21, 22] have been devoted to capturing rich contexts to study semantic segmentation tasks. Some state-of-the-art methods capture multi-scale contexts to improve semantic segmentation performance. For example, PSPNet [12] uses different scales of pooling to obtain different sizes of receptive fields and extract features at different scales. Deeplabv2 [11] varies the size of the receptive field with different dilated rates to extract multi-scale contexts. There are also methods that introduce attention mechanisms to capture the global contexts. For example, DANet [13] captures the global contexts of spaces and channels. Unlike these approaches, OCR [15] aggregates same-category features to form object regions that capture the contexts within the object. Inspired by OCR [15], after utilizing pyramid pooling to capture multi-scale context information, our method also aggregates features from the same category to capture precise intra-category contexts.

Method

This section first describes the overall architecture of Nested Attention Network (NANet), and then introduces the Category Context Ensemble (CCE) module and the Nested Attention (NA) module, respectively.

Overview

We propose a NANet that captures rich contexts to address the problem of unclear object edge demarcation and pixel misclassification in semantic segmentation. NANet uses HRNet-W48 as a convolutional backbone network to down-sample the input image to one-fourth of its original size, fully preserving the spatial detail information of the feature map. Then, As shown in Fig. 2, the output feature maps of the convolutional backbone network first go through the CCE module to compute the intra-class contexts, and then go through the NA module to categorize the spatial pixel features and capture the channel connections of the intra-class contexts, and enhance the spatial long-range dependencies with the captured channel relationships.

Fig. 2
figure 2

The diagram of the overall architecture of NANet

Category context ensemble module

The feature \(X\in {\mathbb {R}}^{C^{\prime } \times H\times W}\) extracted by the convolutional backbone network is fed to the pixel representation and category representation branches, respectively. Here, \(C^{\prime },\) H and W denote channel, height and width. We employ point convolution module (PC) independently in the pixel representation and category representation branches to adjust the number of channels for the feature X. This process, which considers reducing the computational effort while taking into account the segmentation performance, can be formulated as,

$$\begin{aligned} f_{pix}= & {} {{\textbf {PC}}}_{pix} \left( X \right) , \nonumber \\ f_{cate}= & {} {{\textbf {PC}}}_{cate} \left( X \right) , \end{aligned}$$
(1)

where pixel feature representation \(f_{pix}\in {\mathbb {R}}^{C \times H\times W}\) is for capturing intra-class context and category feature representation \(f_{cate} \in {\mathbb {R}}^{N \times H\times W}\) is for modeling spatial context. C denotes the number of channels after \({{\textbf {PC}}}_{pix},\) while N is the number of categories contained in the dataset. In addition, the point convolution module consists of a convolution layer with kernel size of \(1 \times 1,\) a batch normalization layer, and a ReLU activation layer.

Specifically, the \({{\textbf {PPM}}}\) samples the pixel feature representation \(f_{pix}\) and the category feature representation \(f_{cate}\) separately to generate the intermediate multiscale features \(f_{pix}^{\prime }\) and \(f_{cate}^{\prime }.\) Then, the intermediate features are flattened and then concatenated together to form the final representations \(f_{pix}^{ms} \in {\mathbb {R}}^{C \times S}\) and \(f_{cate}^{ms} \in {\mathbb {R}}^{N\times S}.\) This process can be formulated as,

$$\begin{aligned} f_{pix}^{ms}= & {} \left[ flatten \left( f_{pix}^{\prime } \right) \right] ,\quad f_{pix}^{\prime } = {{\textbf {PPM}}} \left( f_{pix} \right) , \nonumber \\ f_{cate}^{ms}= & {} \left[ flatten \left( f_{cate}^{\prime } \right) \right] ,\quad f_{cate}^{\prime } = {{\textbf {PPM}}} \left( f_{cate} \right) , \end{aligned}$$
(2)

where the output resolution of the pyramid pooling module is set to \(\left( 1,3,6,8 \right) \) in order, and \(S = 1^2 + 3^2 + 6^2 + 8^2 = 110.\) \([\cdot ]\) denotes concatenate operator.

Next, the category information contained in \(f_{cate}^{ms}\) is used to bootstrap \(f_{pix}^{ms}\) to further focus on feature information within it that is beneficial for segmentation performance to complete the final intra-class context representation \(f_{cnt}.\) This process can be formulated as,

$$\begin{aligned} f_{cnt} = Softmax \left( f_{cate}^{ms} \right) \times {\left( f_{pix}^{ms} \right) }^{T}, \end{aligned}$$
(3)

where, the dimension of \(f_{cnt}\) is \({N \times C}.\) The Softmax operator is computed in the last dimension of the tensor.

Considering that in order to retain rich spatial detail information, the size of feature map \(X \in {\mathbb {R}} ^{C^{\prime } \times H\times W}\) will not be too small. Assuming H=128 and W=256, without using PPM, the computational complexity for calculating the intra-class context representation \(f_{cnt}\) is \(\mathcal {O} \left( C\times 32768\times N \right) ,\) while with the use of PPM, the computational complexity is reduced to only \(\mathcal {O} \left( C \times 110 \times N \right) .\) As you can see, adding PPM saves us \(\frac{32768}{110} \approx 298 \) times the calculation.

Nested attention module

Nested Attention Module is composed of Feature Categorization Attention (FCA) and Channel Relation Attention (CRA). Many previous works [13, 14, 19, 23] have utilized spatial attention to capture the inter-dependencies between all locations. For example, the PAM module proposed by DANet [13] uses self-attention on the feature map with size of \({C\times H\times W}\) to obtain an attention map of size \(\left( HW \right) \times \left( HW \right) ,\) and generates a weight for each pixel.

Unlike the PAM of DANet [13], the Query and Key tensors of the FCA used by the Nested Attention Module come from two different branches, the pixel feature representation \(f_{pix}\) and the intra-class context representation \(f_{cnt}.\) FCA employs three point convolution modules to tune the features \(f_{pix}\) and \(f_{cnt}\) from CEE module, respectively. The formula is as follows:

$$\begin{aligned} f_{0}= & {} {{\textbf {PC}}}_{0} \left( f_{pix} \right) ,\nonumber \\ f_{1}= & {} {{\textbf {PC}}}_{1}\left( f_{cnt} \right) ,\nonumber \\ f_{2}= & {} {{\textbf {PC}}}_{2}\left( f_{cnt} \right) , \end{aligned}$$
(4)

where \( {f_0}\in {\mathbb {R}}^{C\times H\times W},\) \(\left\{ f_{1}, f_{2} \right\} \in {\mathbb {R}}^{C\times N\times 1}.\)

Then, the reshape operator is performed to transform the pixel feature representation \({f_0}\in {\mathbb {R}}^{C\times H\times W}\) and the intra-class context representation \(\left\{ f_{1}, f_{2} \right\} \in {\mathbb {R}}^{C\times N\times 1}\) into \(\bar{f_0}\in {\mathbb {R}} ^{C\times L}\) and \(\left\{ f_{1}, f_{2} \right\} \in {\mathbb {R}}^{C\times N},\) where \(L=H\times \ W.\) Next, the spatial attention map \(f^{sa} \in {\mathbb {R}}^{L\times N}\) is given by

$$\begin{aligned} f^{sa} = {\bar{f_0}}^{T} \times f_{1}. \end{aligned}$$
(5)

Next, normalize \(f^{sa}_{ij}\) according to Eq. (6),

$$\begin{aligned} {f^{sa}_{ij}}^{\prime } = \frac{exp\left( f^{sa}_{ij} \right) }{ {\sum _{k=1}^{L}}exp\left( f^{sa}_{kj} \right) }, \end{aligned}$$
(6)

where, \(f^{sa}_{ij}\) is the pixel value of \(\left( i, j\right) \) in the feature map \(f^{sa},\) and \({f^{sa}_{ij}}^{\prime }\) is the pixel value of \(\left( i, j\right) \) in the feature map \({f^{sa}}^{\prime }.\)

Finally, the spatial attention map \({f^{sa}}^{\prime } \in {\mathbb {R}}^{L\times N}\) is weighted and summed with the intra-class contextual representation \(f_{2} \in {\mathbb {R}}^{C\times N} .\) The output of the FCA is

$$\begin{aligned} O_{fca} = {f^{sa}}^{\prime } \times {f_2}^{T}, \end{aligned}$$
(7)

where \(O_{fca} \in {\mathbb {R}} ^{L\times C} .\)

Unlike other approaches that use spatial and channel attentions, the output of the FCA in our model serves as an input branch to the CRA. The CRA computes the connections between channels and captures the long-range dependencies between categories, which are used to augment the output of the FCA. In the CRA, two \(1 \times 1\) convolutional layers \({{\textbf {PC}}}_{3}\) and \({{\textbf {PC}}}_{4}\) transform the intra-class context representation \(f^{sa} \in {\mathbb {R}}^{C \times N \times 1}\) into \(\left\{ f_{3}, f_{4} \right\} \in {\mathbb {R}}^{C\times N\times 1}.\) After flattening, the relationship between channels is computed by

$$\begin{aligned} f^{ca} = f_3 \times {f_4}^{T}, \end{aligned}$$
(8)

where \(f^{ca} \in {\mathbb {R}} ^{C\times C}\) represents the channel attention map. Then, normalize \(f^{ca}\) according to Eq. (9),

$$\begin{aligned} {f^{ca}_{ij}}^{\prime } = \frac{exp\left( f^{ca}_{ij} \right) }{ {\sum _{k=1}^{C}} exp\left( f^{ca}_{ik} \right) } , \end{aligned}$$
(9)

where, \(f^{ca}_{ij}\) is the pixel value of \(\left( i, j\right) \) in the feature map \(f^{ca},\) and \({f^{ca}_{ij}}^{\prime }\) is the pixel value of \(\left( i, j\right) \) in the feature map \({f^{ca}}^{\prime }.\)

Finally, we augment the output of the FCA by mapping the long-distance dependencies between the captured categories to the output of the FCA, the final output Y is given by

$$\begin{aligned} O_{cra} = O_{fca} \times {f^{ca}}^{\prime }, \end{aligned}$$
(10)

where \(Y\in {\mathbb {R}} ^{L\times C}.\) We fused the \(Y \in {\mathbb {R}} ^{L\times C}\) and pixel feature representation \(f_{pix} \in {\mathbb {R}} ^{L\times C},\)

$$\begin{aligned} Y = Concate\left( O_{cra}, f_{pix} \right) . \end{aligned}$$
(11)

Finally, feed Y into the classifier to get the output map.

Experiment

We conducted extensive experiments on Cityscapes [24], PASCAL VOC 2012 [25], ADE20K [26] datasets to evaluate the performance of the Nested Attention Network proposed in this paper.

Datasets

Cityscapes. The Cityscapes dataset contains 5000 finely labeled images of town streets, of which 2975 images were used for training, 500 for validation, and 1525 for testing. These images are categorized into 19 classes and each image is \(1024\times 2048\) in size.

PASCAL VOC 2012. The PASCAL VOC 2012 dataset contains 13,487 images that are categorized into 21 categories, one background category and 20 object categories. Our model is trained using augmented data containing 10,582 images. The remaining 1449 images are used for validation and 1456 images are used for testing. These images are cropped to \(512\times 512.\)

ADE20K. The ADE20K dataset contains 25,562 images covering 150 categories. The training set contains 20,210 images, validation set contains 2000 images and test set contains 3352 images. The image cropping size is \(512\times 512.\)

Implementation details

Training Objectives. Two cross-entropy loss functions, \(L_{A}\) and \(L_{O}\) , are used to supervise the category feature representation of the CCE module and the final output of the model, respectively. In this paper, we use the weighted sum of these two cross entropy loss functions as the objective function L

$$\begin{aligned}{} & {} min L = \mu L_{A } + L_{O} ,\end{aligned}$$
(12)
$$\begin{aligned}{} & {} L_{A} = \frac{\gamma }{2} \Vert \varphi \Vert _2^2 - \sum _{i=1}^L \sum _{n=1}^N Y_i^n \log P_n \left( X_i,\varphi \right) \end{aligned}$$
(13)
$$\begin{aligned}{} & {} L_{O}=\frac{\beta }{2} \Vert \theta \Vert _2^2 - \sum _{i=1}^L \sum _{n=1}^N y_i^n \log p_n \left( x_i,\theta \right) , \end{aligned}$$
(14)

where \(\theta \) and \(\varphi \) are network parameters with regularization added. \(p_{n}\) is the maximum value of the predicted probability of pixel \(x_{i}\) in the final output feature map with respect to the Nth category. \(P_{n}\) is the maximum value of the predicted probability of pixel \(X_{i}\) in the Category feature representation with respect to the Nth category. \(Y_{i}\) is the groundtruth semantic label of Nth category for pixel \(X_{i},\) and \(y_{i}\) is the groundtruth semantic label of Nth category for pixel \(x_{i}.\) \(\gamma \) and \(\beta \) are weight decay parameters. \(\mu \) indicates the weight on the loss \(L_{A},\) and is set to 0.4. Our training goal is to find the optimal parameters of our model that minimize L.

Training Settings. The entire NANet is implemented based on Paddleseg framework. The stochastic gradient descent algorithm SGD is used to optimize the network. At each iteration, the learning rate is updated by multiplying by a factor of \(1- \left( \frac{iter}{max\underline{}iter} \right) ^{power},\) here, we set power to 0.9.

Cityscapes. When training our model on Cityscapes dataset, the initial learning rate is set to 0.01, batch size is set to 8, weight decay is set to 0.0005, the number of training iterations is 150K, and the cropping size of the image is \(1024\times 512.\)

PASCAL VOC 2012. The initial learning rate when training the model on the PASCAL voc 2012 dataset is 0.01, weight decay is 0.0001, batch size is 16, and the number of training iterations is 40K.

ADE20K. When training our model on the ADE20K dataset, the initial learning rate is set to 0.01, weight recession is set to 0.0001, batch size is set to 16, and the number of training iterations is 150K.

Evaluation metric

The mean Intersection over Union (mIoU) and Accuracy (Acc) were employed to evaluate the model performance. For a dataset with K classes, the mIoU is defined as follows:

$$\begin{aligned} mIoU = \frac{1}{K}\sum _{i=0}^{k} {\frac{|A_i \cap B_i|}{|A_i \cup B_i|}}, \end{aligned}$$
(15)

where A is the output predicted by the segmentation model and B is the ground truth.

Accuracy (Acc) is a widely used evaluation metric in semantic segmentation to quantify the pixel-level classification accuracy of a model. The Acc can be expressed as follows:

$$\begin{aligned} Acc= \frac{P_{T}+N_{T}}{P_{T}+N_{T}+P_{F}+N_{F}} \end{aligned}$$
(16)

where \(P_{T}\) refers to the number of pixels correctly predicted as positive by the model, \(N_{T}\) represents the number of pixels correctly predicted as negative by the model, \(P_{F}\) indicates the number of pixels incorrectly predicted as positive when they are actually negative, \(N_{F}\) signifies the number of pixels incorrectly predicted as negative when they are actually positive.

Comparisons with other methods

Efficiency comparisons

To assess the effectiveness of our model, we compare NANet with some state-of-the-art methods in terms of total model parameters and model computation. The data in the table was obtained in the same test environment. To avoid the effect of different backbone networks, we only calculate the number of parameters and the total number of floating-point operations in the remaining part after removing the backbone network in different methods. As shown in Table 2, the number of parameters in the DANet [13] model is 23.94M, which is 4.45 times more than the number of parameters in NANet. And the total number of floating-point operations of DANet [13] is 196.09G, which is 41.42G more than NANet. The number of model parameters in the OCR [15] is 5.99M, which is 0.61M more than NANet. And the model computation of OCR [15] is 169.88G, which is 15.21G more than the total floating-point operations of NANet. PSPNet [12] has 8.24 times more parameters and 2.13 times more model computations than NANet. From the above analysis, it can be seen that NANet has the least number of parameters and it is less computationally intensive compared to other models, which suggests that the NANet is a lightweight network with low model complexity.

Table 2 Comparison of the number of parameters (Params) and the total number of floating-point operations (FLOPs) for different models (excluding backbone)
Table 3 Performance comparison of our method with several typical methods on the Cityscapes test set

Performance comparisons

Cityscapes. In order to evaluate the performance of the method NANet proposed in this paper, we submitted NANet on the official Cityscapes server, and the comparison results with other research methods on the Cityscapes test set are shown in Table 3. Note that our model does not use roughly annotated data. As can be seen in Table 3, both DANet [13] and ANNet [19] achieve a performance of 81.5% and 81.3%, respectively. Our model achieves a performance of 82.6%. The performance obtained by OCR is 82.4%, which is 0.2% lower than the performance of our model. In addition, our model achieves superior performance compared to recent models such as FRM and GC-MobileSeg-Tiny.

Table 4 Performance comparison of our method with several typical methods on the Cityscapes validation set
Fig. 3
figure 3

Qualitative comparison of our method with several state-of-the-art methods on the Cityscapes validation set. Row 1: Input images. Rows 2–5: The segmentation results of FCN, PSPNet, OCR and DANet. Row 6: The segmentation results of our model. Row 7: Ground truth. The yellow boxes mark where our model is particularly superior to other methods

In addition, we train all the models in Table 4 in the same environment using only fine-grained training set data, and evaluate the performance of the models on the Cityscapes validation set. Table 4 shows the results of comparing our model with state-of-the-art models on the Cityscapes validation set. Our model achieves 82.47% mIoU, a 0.32% improvement over the performance of OCR, and a 2.27% mIoU increase over the performance of DANet, a network that uses spatial attention and channel attention in parallel. Furthermore, our model outperforms other models with a higher accuracy (Acc) of 96.82%.

Figure 3 illustrates a comparative figure of the effect of several typical methods. As seen in the first, third and fourth columns of Fig. 3, our model segmented the thin poles better than the other models, indicating that our model has a better ability to capture slender targets. As can be seen from the second column, our model handles large objects without the problem of intra-class confusion and achieves good performance in terms of consistent labeling of intra-class pixels compared to other methods.

Table 5 Performance comparison of our method with several typical methods on PASCAL VOC2012 validation set
Fig. 4
figure 4

Qualitative comparison of our method with several state-of-the-art methods on the PASCAL VOC2012 validation set. Row 1: Input images. Rows 2–4: The segmentation results of DANet, PSPNet and OCR. Row 5: The segmentation results of our model. Row 6: Ground truth. The yellow boxes mark where our model is particularly superior to other methods

PASCAL VOC 2012. We report the results of comparing NANet with state-of-the-art methods on the PASCAL VOC 2012 validation set in Table 5. From Table 5, it can be seen that NANet performs better than the other models on the PASCAL VOC 2012 validation set. NANet obtained 81.37% performance, which is 1.42% better than the performance of FCN based on the same backbone network. As can be seen in Table 5, OCR achieves a performance of 80.94%, which is 0.43% lower than the performance of our model. Furthermore, our model achieves higher accuracy (Acc) compared to other models.

Figure 4 shows the segmentation results of several typical methods on the PASCAL VOC 2012 validation set. The first and third columns of Fig. 4 show that our model handles the shape of the bottle better. The fourth and fifth columns show that our model has a more detailed treatment of the human legs compared to the other models, without the occurrence of missing parts of the target. The second column shows that for different objects of the same category, our model guarantees the independence of a single object from other objects labeled with the same category.

Table 6 Performance comparison of our method with several typical methods on ADE20K validation set
Table 7 Ablation study in terms of NANet without PPM and with PPM of different output sizes in the CCE module

ADE20K. We performed comparative experiments on the challenging dataset ADE20K. Table 6 shows the results of performance comparison between NANet and other models on the ADE20K validation set. Table 6 shows that the performance of our model outperforms most models, obtaining 45.46% mIoU. OCR obtained 45.28% mIoU. Compared to OCR, our model performance improved by 0.18%. The performance of NANet is improved by 1.35% compared to PSANet.

Ablation study

In this section, we do extensive experiments on the Cityscapes validation set to explore the validity of NANet. In NANet, there are two important modules: the Category Context Ensemble Module and the Nested Attention Module. In the CCE Module, we explore the effectiveness of the Pyramid Pooling Module (PPM). For the Nested Attention Module, we explore the role of the Feature Categorization Attention Module (FCA) and the Channel Relationship Attention Module (CRA).

Ablation study for PPM

In “Category context ensemble module”, we add PPM to the CCE module to reduce the computational complexity while guaranteeing performance. At the same time, we explored the effect of the output size of the PPM on the performance of the model. Table 7 shows the experimental results for several options. As can be seen in the first row, the NANet achieves an accuracy of 82.52% when the PPM module is not added. When the PPM is added, there are three choices for the output size of the PPM: (1, 2, 3, 6), (1, 3, 6, 8) and (1, 4, 8, 12). The three outputs were flattened and concatenated together to obtain the number of sampling points as 50, 110 and 225, respectively. The second row of the table shows that the model performance is 82.41% when the output sampling point of the PPM is 50. The third row illustrates that the model performance reaches 82.47% when the PPM output sampling point is 110. The fourth row shows that the model performance achieves 82.49% mIoU when the PPM output sampling point is 225. From the experimental analysis, it can be seen that the model performance decreases slightly by adding PPM, and the model performance will be better with the purpose of increasing the number of sampling points. Considering the balance of computational complexity and model performance, we choose to add PPM with 110 sampling points.

In addition, we conducted an experiment to demonstrate the superiority of pyramid pooling module (PPM) over pyramid convolution (PyConv). In the experiment, we replaced pyramid pooling with pyramid convolution, and the experimental results are shown in the Table 8. The results in the Table 8 indicate that the performance of using pyramid pooling in our model is comparable to using pyramid convolution. However, pyramid convolution introduces more parameters (0.5M) and higher FLOPs (14.2G) compared to pyramid pooling.

Table 8 Ablation study on the validation set of cityscapes about pyramid pooling (PPM) and pyramid convolution (PyConv)
Table 9 Ablation study on the validation set of cityscapes about FCA and CRA , both with and without PPM

Ablation study for nested attention module

In order to investigate the effectiveness of CRA and FCA, we conduct experiments for both cases of using only FCA and using FCA + CRA module in Nested Attention Module. The results of the experiments for several different cases are shown in Table 9. Rows 4 and 5 in Table 9 are the results when the CCE Module does not use PPM. Row 5 shows that when the Nested Attention Module uses FCA and CRA, NANet has the best performance of 82.52%, and at this time, the computational complexity is also the highest. Row 4 shows that when Nested Attention Module uses only FCA, at this time, 82.15% mIoU is obtained, which is 0.37% lower than the performance when CRA is used. This shows that CRA plays an important role in model performance improvement. Considering the reduction of computational complexity, rows 2 and 3 in Table 9 show the results for the case where PPM is used in the CCE module. Row 2 shows that the model performance is 81.92% when no CRA is added to the Nested Attention Module and only FCA is used. Row 3 shows that when CRA is added to the Nested Attention Module, the model performance achieves 82.47% mIoU, which is a 0.55% improvement over the performance when CRA is not added. The experimental results in Table 9 show that CRA can capture the contextual information ignored by FCA and plays a crucial role in capturing rich contexts.

Conclusion

In this paper, we propose Nested Attention Network (NANet) for semantic segmentation. NANet is a lightweight model for capturing rich and efficient contexts. NANet aggregates category contexts with low computational complexity and uses the obtained category context representation to nest the Feature Categorization Attention (FCA) with the Channel Relation Attention (CRA). After FCA aggregates features of the same category in the spatial dimension, CRA captures the inter-channel dependencies of the category context representation and is used to enhance the output of FCA. We evaluated the performance of NANet on three datasets: Cityscapes, PASCAL VOC 2012, ADE20K. Numerous experiments have shown that NANet has better performance than state-of-the-art methods. In addition, our future work will further enhance the contexts aggregation capability of the category feature representation to allow NANet to break through the original performance and gain performance improvement.