Nested attention network based on category contexts learning for semantic segmentation

Li, Tianping; Liu, Meilin; Wei, Dongmei

doi:10.1007/s40747-024-01520-1

Nested attention network based on category contexts learning for semantic segmentation

Original Article
Open access
Published: 19 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Nested attention network based on category contexts learning for semantic segmentation

Download PDF

Tianping Li¹,
Meilin Liu¹ &
Dongmei Wei¹

179 Accesses
Explore all metrics

Abstract

The attention mechanism is widely used in the field of semantic segmentation, due to the fact that it can be used to obtain effective long-distance dependencies by assigning different weights to objects according to different tasks. We propose a novel Nested Attention Network (NANet) for semantic segmentation, which combines Feature Category Attention (FCA) and Channel Relationship Attention (CRA) to effectively aggregate same-category contexts in both spatial and channel dimensions. Specifically, FCA captures the dependencies between spatial pixel features and categories to achieve the aggregation of features of the same category. CRA further captures the channel relationships on the output of FCA to obtain richer contexts. Numerous experiments have shown that NANet has a lower number of parameters and computational complexity than other state-of-the-art methods, and is a lightweight model with a lower total number of floating-point operations. We evaluated the performance of NANet on three datasets: Cityscapes, PASCAL VOC 2012, and ADE20K, and the experimental results show that NANet obtains promising results, reaching a performance of 82.6% on the Cityscapes test set.

End-to-End Object Detection with Transformers

CBAM: Convolutional Block Attention Module

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Semantic segmentation is an important and critical task in the field of machine vision, which recognizes the content in an image by classifying and labeling each pixel in the input image. Semantic segmentation, as a core technology for autonomous driving [1, 2], agricultural field [3, 4] and remote sensing image analysis [5,6,7], has attracted much attention since FCN [8] was proposed.

In recent years, significant advancements have been made in the field of semantic segmentation through the utilization of the encoder-decoder architecture to improve FCNs. This line of research primarily focuses on the crucial task of capturing contextual information. For instance, Unet [9] effectively captures contexts by concatenating the upsampled feature maps in the decoder with the corresponding feature maps in the encoder. This approach maximizes the utilization of intermediate-level features to comprehensively capture contexts. In contrast, Segnet [10] employs the maximum pooling index recorded by the encoder to upsample the feature maps in the decoder. However, this convolution plus pooling operation lacks access to global contexts, thereby resulting in incomplete segmentation of large-sized objects. To overcome this challenge, researchers have proposed innovative solutions such as DeeplabV2 [11] and PSPNet [12]. These approaches employ ASPP and PPM, respectively, to expand the receptive field and capture multi-scale contexts, thereby enhancing the segmentation accuracy. Furthermore, there are other notable studies that uses convolutional backbone to extract features and introduces attention mechanisms to capture the global contexts. For instance, DANet [13] employs self-attention mechanisms in parallel in both spatial and channel dimensions. This enables the capture of long-distance dependencies and channel connections, effectively extracting rich contexts. Similarly, CBAM [14] leverages channel attention and spatial attention sequentially. It first calculates channel attention through maximum pooling and average pooling, and subsequently calculates spatial attention based on these results. This comprehensive approach improves the overall performance of semantic segmentation techniques.

Unlike these approaches, as shown in Fig. 1a, OCR [15] aggregates the same-category features to capture the precise contexts before using spatial attention, in a way that not only improves the segmentation performance but also reduces the secondary complexity of the attention mechanism. However, OCR neglects the study of channel relationships and lacks the capture of contexts between related categories. As shown in Fig. 1b, our model adds Pyramid Pooling Module (PPM) to capture important and precise contexts before ensembling same-category features. Then in the spatial dimension, the FCA is used to capture the spatial location long-range dependencies of categories and pixels. Next in the channel dimension, the output of the FCA is used as an input branch of the CRA, which is utilized to capture the connections between categories. By adopting this Nested Attention framework, NANet successfully captures long-distance dependencies between spatial positions and correlations among channels while efficiently addressing the challenge of high computational complexity.

In summary, this paper makes the following three contributions:

The NANet proposed in this paper uses the sampling module PPM to capture accurate same-category contexts with low computational complexity and serves as the input to the FCA and CRA.
FCA is used as an input branch of CRA to realize the nesting of two different dimensions of attention. The inter-channel relationship is further computed after capturing the long-distance dependency between locations, capturing the associated category contexts.
We evaluated the performance of NANet on three datasets, Cityscapes, PASCAL VOC 2012 and ADE20K, achieving promising experimental results.

Related work

Semantic segmentation

The proposed variant model based on Fully Convolutional Networks has made promising progress in semantic segmentation.The variant model based on Fully Convolutional Network has made significant advances in semantic segmentation. UNet [9], utilizing an encoder-decoder structure, effectively combines the decoder features with their corresponding encoder features to enhance contextual understanding. PSPNet [12] incorporates Pyramid Pooling Modules (PPM), while Deeplabv2 [11] and Deeplabv3 [16] utilize Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context. DANet [13] employs spatial attention and channel attention mechanisms to capture long-range dependencies across different dimensions. OCR [15] designs category priors to capture more precise context.

Table 1 The comparison of the effectiveness of channel attention and spatial attention in different tasks

Full size table

Attention mechanism

In many state-of-the-art works [17, 18], attention mechanisms are applied to capture long-range dependencies in different dimensions. For example, ANNet [19] captures global dependencies between all pixels in the spatial dimension using two self-attentions. SENet [20] uses channel attention to assign different weights to different channel maps. And yet, ANNet fails to capture inter-channel correlations, while SENet overlooks long-distance dependencies among spatial pixels. In order to overcome the shortcomings of these methods, some work connects spatial attention and channel attention in tandem or in parallel to capture rich contexts. For example, DANet [13] uses spatial attention and channel attention in parallel. CBAM [14] first uses channel attention to extract channel features, and then uses spatial attention to extract important location information. In these works, channel attention and spatial attention are combined in different ways. Please refer to Table 1 for details. However, the utilization of attention mechanisms, specifically spatial attention and channel attention, introduces a notable increase in computational complexity. To address this issue, a novel approach called Nested Attention is proposed in this study.

Context

Context is critical for dense prediction tasks. Many works [21, 22] have been devoted to capturing rich contexts to study semantic segmentation tasks. Some state-of-the-art methods capture multi-scale contexts to improve semantic segmentation performance. For example, PSPNet [12] uses different scales of pooling to obtain different sizes of receptive fields and extract features at different scales. Deeplabv2 [11] varies the size of the receptive field with different dilated rates to extract multi-scale contexts. There are also methods that introduce attention mechanisms to capture the global contexts. For example, DANet [13] captures the global contexts of spaces and channels. Unlike these approaches, OCR [15] aggregates same-category features to form object regions that capture the contexts within the object. Inspired by OCR [15], after utilizing pyramid pooling to capture multi-scale context information, our method also aggregates features from the same category to capture precise intra-category contexts.

Method

This section first describes the overall architecture of Nested Attention Network (NANet), and then introduces the Category Context Ensemble (CCE) module and the Nested Attention (NA) module, respectively.

Overview

We propose a NANet that captures rich contexts to address the problem of unclear object edge demarcation and pixel misclassification in semantic segmentation. NANet uses HRNet-W48 as a convolutional backbone network to down-sample the input image to one-fourth of its original size, fully preserving the spatial detail information of the feature map. Then, As shown in Fig. 2, the output feature maps of the convolutional backbone network first go through the CCE module to compute the intra-class contexts, and then go through the NA module to categorize the spatial pixel features and capture the channel connections of the intra-class contexts, and enhance the spatial long-range dependencies with the captured channel relationships.

Category context ensemble module

The feature $X\in {\mathbb {R}}^{C^{\prime } \times H\times W}$ extracted by the convolutional backbone network is fed to the pixel representation and category representation branches, respectively. Here, $C^{\prime },$ H and W denote channel, height and width. We employ point convolution module (PC) independently in the pixel representation and category representation branches to adjust the number of channels for the feature X. This process, which considers reducing the computational effort while taking into account the segmentation performance, can be formulated as,

$$\begin{aligned} f_{pix}= & {} {{\textbf {PC}}}_{pix} \left( X \right) , \nonumber \\ f_{cate}= & {} {{\textbf {PC}}}_{cate} \left( X \right) , \end{aligned}$$

(1)

where pixel feature representation $f_{pix}\in {\mathbb {R}}^{C \times H\times W}$ is for capturing intra-class context and category feature representation $f_{cate} \in {\mathbb {R}}^{N \times H\times W}$ is for modeling spatial context. C denotes the number of channels after ${{\textbf {PC}}}_{pix},$ while N is the number of categories contained in the dataset. In addition, the point convolution module consists of a convolution layer with kernel size of $1 \times 1,$ a batch normalization layer, and a ReLU activation layer.

Specifically, the ${{\textbf {PPM}}}$ samples the pixel feature representation $f_{pix}$ and the category feature representation $f_{cate}$ separately to generate the intermediate multiscale features $f_{pix}^{\prime }$ and $f_{cate}^{\prime }.$ Then, the intermediate features are flattened and then concatenated together to form the final representations $f_{pix}^{ms} \in {\mathbb {R}}^{C \times S}$ and $f_{cate}^{ms} \in {\mathbb {R}}^{N\times S}.$ This process can be formulated as,

$$\begin{aligned} f_{pix}^{ms}= & {} \left[ flatten \left( f_{pix}^{\prime } \right) \right] ,\quad f_{pix}^{\prime } = {{\textbf {PPM}}} \left( f_{pix} \right) , \nonumber \\ f_{cate}^{ms}= & {} \left[ flatten \left( f_{cate}^{\prime } \right) \right] ,\quad f_{cate}^{\prime } = {{\textbf {PPM}}} \left( f_{cate} \right) , \end{aligned}$$

(2)

where the output resolution of the pyramid pooling module is set to $\left( 1,3,6,8 \right) $ in order, and $S = 1^2 + 3^2 + 6^2 + 8^2 = 110.$ $[\cdot ]$ denotes concatenate operator.

Next, the category information contained in $f_{cate}^{ms}$ is used to bootstrap $f_{pix}^{ms}$ to further focus on feature information within it that is beneficial for segmentation performance to complete the final intra-class context representation $f_{cnt}.$ This process can be formulated as,

$$\begin{aligned} f_{cnt} = Softmax \left( f_{cate}^{ms} \right) \times {\left( f_{pix}^{ms} \right) }^{T}, \end{aligned}$$

(3)

where, the dimension of $f_{cnt}$ is ${N \times C}.$ The Softmax operator is computed in the last dimension of the tensor.

Considering that in order to retain rich spatial detail information, the size of feature map $X \in {\mathbb {R}} ^{C^{\prime } \times H\times W}$ will not be too small. Assuming H=128 and W=256, without using PPM, the computational complexity for calculating the intra-class context representation $f_{cnt}$ is $\mathcal {O} \left( C\times 32768\times N \right) ,$ while with the use of PPM, the computational complexity is reduced to only $\mathcal {O} \left( C \times 110 \times N \right) .$ As you can see, adding PPM saves us $\frac{32768}{110} \approx 298 $ times the calculation.

Nested attention module

Nested Attention Module is composed of Feature Categorization Attention (FCA) and Channel Relation Attention (CRA). Many previous works [13, 14, 19, 23] have utilized spatial attention to capture the inter-dependencies between all locations. For example, the PAM module proposed by DANet [13] uses self-attention on the feature map with size of ${C\times H\times W}$ to obtain an attention map of size $\left( HW \right) \times \left( HW \right) ,$ and generates a weight for each pixel.

Unlike the PAM of DANet [13], the Query and Key tensors of the FCA used by the Nested Attention Module come from two different branches, the pixel feature representation $f_{pix}$ and the intra-class context representation $f_{cnt}.$ FCA employs three point convolution modules to tune the features $f_{pix}$ and $f_{cnt}$ from CEE module, respectively. The formula is as follows:

$$\begin{aligned} f_{0}= & {} {{\textbf {PC}}}_{0} \left( f_{pix} \right) ,\nonumber \\ f_{1}= & {} {{\textbf {PC}}}_{1}\left( f_{cnt} \right) ,\nonumber \\ f_{2}= & {} {{\textbf {PC}}}_{2}\left( f_{cnt} \right) , \end{aligned}$$

(4)

where $ {f_0}\in {\mathbb {R}}^{C\times H\times W},$ $\left\{ f_{1}, f_{2} \right\} \in {\mathbb {R}}^{C\times N\times 1}.$

Then, the reshape operator is performed to transform the pixel feature representation ${f_0}\in {\mathbb {R}}^{C\times H\times W}$ and the intra-class context representation $\left\{ f_{1}, f_{2} \right\} \in {\mathbb {R}}^{C\times N\times 1}$ into $\bar{f_0}\in {\mathbb {R}} ^{C\times L}$ and $\left\{ f_{1}, f_{2} \right\} \in {\mathbb {R}}^{C\times N},$ where $L=H\times \ W.$ Next, the spatial attention map $f^{sa} \in {\mathbb {R}}^{L\times N}$ is given by

$$\begin{aligned} f^{sa} = {\bar{f_0}}^{T} \times f_{1}. \end{aligned}$$

(5)

Next, normalize $f^{sa}_{ij}$ according to Eq. (6),

$$\begin{aligned} {f^{sa}_{ij}}^{\prime } = \frac{exp\left( f^{sa}_{ij} \right) }{ {\sum _{k=1}^{L}}exp\left( f^{sa}_{kj} \right) }, \end{aligned}$$

(6)

where, $f^{sa}_{ij}$ is the pixel value of $\left( i, j\right) $ in the feature map $f^{sa},$ and ${f^{sa}_{ij}}^{\prime }$ is the pixel value of $\left( i, j\right) $ in the feature map ${f^{sa}}^{\prime }.$

Finally, the spatial attention map ${f^{sa}}^{\prime } \in {\mathbb {R}}^{L\times N}$ is weighted and summed with the intra-class contextual representation $f_{2} \in {\mathbb {R}}^{C\times N} .$ The output of the FCA is

$$\begin{aligned} O_{fca} = {f^{sa}}^{\prime } \times {f_2}^{T}, \end{aligned}$$

(7)

where $O_{fca} \in {\mathbb {R}} ^{L\times C} .$

Unlike other approaches that use spatial and channel attentions, the output of the FCA in our model serves as an input branch to the CRA. The CRA computes the connections between channels and captures the long-range dependencies between categories, which are used to augment the output of the FCA. In the CRA, two $1 \times 1$ convolutional layers ${{\textbf {PC}}}_{3}$ and ${{\textbf {PC}}}_{4}$ transform the intra-class context representation $f^{sa} \in {\mathbb {R}}^{C \times N \times 1}$ into $\left\{ f_{3}, f_{4} \right\} \in {\mathbb {R}}^{C\times N\times 1}.$ After flattening, the relationship between channels is computed by

$$\begin{aligned} f^{ca} = f_3 \times {f_4}^{T}, \end{aligned}$$

(8)

where $f^{ca} \in {\mathbb {R}} ^{C\times C}$ represents the channel attention map. Then, normalize $f^{ca}$ according to Eq. (9),

$$\begin{aligned} {f^{ca}_{ij}}^{\prime } = \frac{exp\left( f^{ca}_{ij} \right) }{ {\sum _{k=1}^{C}} exp\left( f^{ca}_{ik} \right) } , \end{aligned}$$

(9)

where, $f^{ca}_{ij}$ is the pixel value of $\left( i, j\right) $ in the feature map $f^{ca},$ and ${f^{ca}_{ij}}^{\prime }$ is the pixel value of $\left( i, j\right) $ in the feature map ${f^{ca}}^{\prime }.$

Finally, we augment the output of the FCA by mapping the long-distance dependencies between the captured categories to the output of the FCA, the final output Y is given by

$$\begin{aligned} O_{cra} = O_{fca} \times {f^{ca}}^{\prime }, \end{aligned}$$

(10)

where $Y\in {\mathbb {R}} ^{L\times C}.$ We fused the $Y \in {\mathbb {R}} ^{L\times C}$ and pixel feature representation $f_{pix} \in {\mathbb {R}} ^{L\times C},$

$$\begin{aligned} Y = Concate\left( O_{cra}, f_{pix} \right) . \end{aligned}$$

(11)

Finally, feed Y into the classifier to get the output map.

Experiment

We conducted extensive experiments on Cityscapes [24], PASCAL VOC 2012 [25], ADE20K [26] datasets to evaluate the performance of the Nested Attention Network proposed in this paper.

Datasets

Cityscapes. The Cityscapes dataset contains 5000 finely labeled images of town streets, of which 2975 images were used for training, 500 for validation, and 1525 for testing. These images are categorized into 19 classes and each image is $1024\times 2048$ in size.

PASCAL VOC 2012. The PASCAL VOC 2012 dataset contains 13,487 images that are categorized into 21 categories, one background category and 20 object categories. Our model is trained using augmented data containing 10,582 images. The remaining 1449 images are used for validation and 1456 images are used for testing. These images are cropped to $512\times 512.$

ADE20K. The ADE20K dataset contains 25,562 images covering 150 categories. The training set contains 20,210 images, validation set contains 2000 images and test set contains 3352 images. The image cropping size is $512\times 512.$

Implementation details

Training Objectives. Two cross-entropy loss functions, $L_{A}$ and $L_{O}$ , are used to supervise the category feature representation of the CCE module and the final output of the model, respectively. In this paper, we use the weighted sum of these two cross entropy loss functions as the objective function L,

$$\begin{aligned}{} & {} min L = \mu L_{A } + L_{O} ,\end{aligned}$$

(12)

$$\begin{aligned}{} & {} L_{A} = \frac{\gamma }{2} \Vert \varphi \Vert _2^2 - \sum _{i=1}^L \sum _{n=1}^N Y_i^n \log P_n \left( X_i,\varphi \right) \end{aligned}$$

(13)

$$\begin{aligned}{} & {} L_{O}=\frac{\beta }{2} \Vert \theta \Vert _2^2 - \sum _{i=1}^L \sum _{n=1}^N y_i^n \log p_n \left( x_i,\theta \right) , \end{aligned}$$

(14)

where $\theta $ and $\varphi $ are network parameters with regularization added. $p_{n}$ is the maximum value of the predicted probability of pixel $x_{i}$ in the final output feature map with respect to the Nth category. $P_{n}$ is the maximum value of the predicted probability of pixel $X_{i}$ in the Category feature representation with respect to the Nth category. $Y_{i}$ is the groundtruth semantic label of Nth category for pixel $X_{i},$ and $y_{i}$ is the groundtruth semantic label of Nth category for pixel $x_{i}.$ $\gamma $ and $\beta $ are weight decay parameters. $\mu $ indicates the weight on the loss $L_{A},$ and is set to 0.4. Our training goal is to find the optimal parameters of our model that minimize L.

Training Settings. The entire NANet is implemented based on Paddleseg framework. The stochastic gradient descent algorithm SGD is used to optimize the network. At each iteration, the learning rate is updated by multiplying by a factor of $1- \left( \frac{iter}{max\underline{}iter} \right) ^{power},$ here, we set power to 0.9.

Cityscapes. When training our model on Cityscapes dataset, the initial learning rate is set to 0.01, batch size is set to 8, weight decay is set to 0.0005, the number of training iterations is 150K, and the cropping size of the image is $1024\times 512.$

PASCAL VOC 2012. The initial learning rate when training the model on the PASCAL voc 2012 dataset is 0.01, weight decay is 0.0001, batch size is 16, and the number of training iterations is 40K.

ADE20K. When training our model on the ADE20K dataset, the initial learning rate is set to 0.01, weight recession is set to 0.0001, batch size is set to 16, and the number of training iterations is 150K.

Evaluation metric

The mean Intersection over Union (mIoU) and Accuracy (Acc) were employed to evaluate the model performance. For a dataset with K classes, the mIoU is defined as follows:

$$\begin{aligned} mIoU = \frac{1}{K}\sum _{i=0}^{k} {\frac{|A_i \cap B_i|}{|A_i \cup B_i|}}, \end{aligned}$$

(15)

where A is the output predicted by the segmentation model and B is the ground truth.

Accuracy (Acc) is a widely used evaluation metric in semantic segmentation to quantify the pixel-level classification accuracy of a model. The Acc can be expressed as follows:

$$\begin{aligned} Acc= \frac{P_{T}+N_{T}}{P_{T}+N_{T}+P_{F}+N_{F}} \end{aligned}$$

(16)

where $P_{T}$ refers to the number of pixels correctly predicted as positive by the model, $N_{T}$ represents the number of pixels correctly predicted as negative by the model, $P_{F}$ indicates the number of pixels incorrectly predicted as positive when they are actually negative, $N_{F}$ signifies the number of pixels incorrectly predicted as negative when they are actually positive.

Comparisons with other methods

Efficiency comparisons

To assess the effectiveness of our model, we compare NANet with some state-of-the-art methods in terms of total model parameters and model computation. The data in the table was obtained in the same test environment. To avoid the effect of different backbone networks, we only calculate the number of parameters and the total number of floating-point operations in the remaining part after removing the backbone network in different methods. As shown in Table 2, the number of parameters in the DANet [13] model is 23.94M, which is 4.45 times more than the number of parameters in NANet. And the total number of floating-point operations of DANet [13] is 196.09G, which is 41.42G more than NANet. The number of model parameters in the OCR [15] is 5.99M, which is 0.61M more than NANet. And the model computation of OCR [15] is 169.88G, which is 15.21G more than the total floating-point operations of NANet. PSPNet [12] has 8.24 times more parameters and 2.13 times more model computations than NANet. From the above analysis, it can be seen that NANet has the least number of parameters and it is less computationally intensive compared to other models, which suggests that the NANet is a lightweight network with low model complexity.

Table 2 Comparison of the number of parameters (Params) and the total number of floating-point operations (FLOPs) for different models (excluding backbone)

Full size table

Table 3 Performance comparison of our method with several typical methods on the Cityscapes test set

Full size table

Performance comparisons

Cityscapes. In order to evaluate the performance of the method NANet proposed in this paper, we submitted NANet on the official Cityscapes server, and the comparison results with other research methods on the Cityscapes test set are shown in Table 3. Note that our model does not use roughly annotated data. As can be seen in Table 3, both DANet [13] and ANNet [19] achieve a performance of 81.5% and 81.3%, respectively. Our model achieves a performance of 82.6%. The performance obtained by OCR is 82.4%, which is 0.2% lower than the performance of our model. In addition, our model achieves superior performance compared to recent models such as FRM and GC-MobileSeg-Tiny.

Table 4 Performance comparison of our method with several typical methods on the Cityscapes validation set

Full size table

In addition, we train all the models in Table 4 in the same environment using only fine-grained training set data, and evaluate the performance of the models on the Cityscapes validation set. Table 4 shows the results of comparing our model with state-of-the-art models on the Cityscapes validation set. Our model achieves 82.47% mIoU, a 0.32% improvement over the performance of OCR, and a 2.27% mIoU increase over the performance of DANet, a network that uses spatial attention and channel attention in parallel. Furthermore, our model outperforms other models with a higher accuracy (Acc) of 96.82%.

Figure 3 illustrates a comparative figure of the effect of several typical methods. As seen in the first, third and fourth columns of Fig. 3, our model segmented the thin poles better than the other models, indicating that our model has a better ability to capture slender targets. As can be seen from the second column, our model handles large objects without the problem of intra-class confusion and achieves good performance in terms of consistent labeling of intra-class pixels compared to other methods.

Table 5 Performance comparison of our method with several typical methods on PASCAL VOC2012 validation set

Full size table

PASCAL VOC 2012. We report the results of comparing NANet with state-of-the-art methods on the PASCAL VOC 2012 validation set in Table 5. From Table 5, it can be seen that NANet performs better than the other models on the PASCAL VOC 2012 validation set. NANet obtained 81.37% performance, which is 1.42% better than the performance of FCN based on the same backbone network. As can be seen in Table 5, OCR achieves a performance of 80.94%, which is 0.43% lower than the performance of our model. Furthermore, our model achieves higher accuracy (Acc) compared to other models.

Figure 4 shows the segmentation results of several typical methods on the PASCAL VOC 2012 validation set. The first and third columns of Fig. 4 show that our model handles the shape of the bottle better. The fourth and fifth columns show that our model has a more detailed treatment of the human legs compared to the other models, without the occurrence of missing parts of the target. The second column shows that for different objects of the same category, our model guarantees the independence of a single object from other objects labeled with the same category.

Table 6 Performance comparison of our method with several typical methods on ADE20K validation set

Full size table

Table 7 Ablation study in terms of NANet without PPM and with PPM of different output sizes in the CCE module

Full size table

ADE20K. We performed comparative experiments on the challenging dataset ADE20K. Table 6 shows the results of performance comparison between NANet and other models on the ADE20K validation set. Table 6 shows that the performance of our model outperforms most models, obtaining 45.46% mIoU. OCR obtained 45.28% mIoU. Compared to OCR, our model performance improved by 0.18%. The performance of NANet is improved by 1.35% compared to PSANet.

Ablation study

In this section, we do extensive experiments on the Cityscapes validation set to explore the validity of NANet. In NANet, there are two important modules: the Category Context Ensemble Module and the Nested Attention Module. In the CCE Module, we explore the effectiveness of the Pyramid Pooling Module (PPM). For the Nested Attention Module, we explore the role of the Feature Categorization Attention Module (FCA) and the Channel Relationship Attention Module (CRA).

Ablation study for PPM

In “Category context ensemble module”, we add PPM to the CCE module to reduce the computational complexity while guaranteeing performance. At the same time, we explored the effect of the output size of the PPM on the performance of the model. Table 7 shows the experimental results for several options. As can be seen in the first row, the NANet achieves an accuracy of 82.52% when the PPM module is not added. When the PPM is added, there are three choices for the output size of the PPM: (1, 2, 3, 6), (1, 3, 6, 8) and (1, 4, 8, 12). The three outputs were flattened and concatenated together to obtain the number of sampling points as 50, 110 and 225, respectively. The second row of the table shows that the model performance is 82.41% when the output sampling point of the PPM is 50. The third row illustrates that the model performance reaches 82.47% when the PPM output sampling point is 110. The fourth row shows that the model performance achieves 82.49% mIoU when the PPM output sampling point is 225. From the experimental analysis, it can be seen that the model performance decreases slightly by adding PPM, and the model performance will be better with the purpose of increasing the number of sampling points. Considering the balance of computational complexity and model performance, we choose to add PPM with 110 sampling points.

In addition, we conducted an experiment to demonstrate the superiority of pyramid pooling module (PPM) over pyramid convolution (PyConv). In the experiment, we replaced pyramid pooling with pyramid convolution, and the experimental results are shown in the Table 8. The results in the Table 8 indicate that the performance of using pyramid pooling in our model is comparable to using pyramid convolution. However, pyramid convolution introduces more parameters (0.5M) and higher FLOPs (14.2G) compared to pyramid pooling.

Table 8 Ablation study on the validation set of cityscapes about pyramid pooling (PPM) and pyramid convolution (PyConv)

Full size table

Table 9 Ablation study on the validation set of cityscapes about FCA and CRA , both with and without PPM

Full size table

Ablation study for nested attention module

In order to investigate the effectiveness of CRA and FCA, we conduct experiments for both cases of using only FCA and using FCA + CRA module in Nested Attention Module. The results of the experiments for several different cases are shown in Table 9. Rows 4 and 5 in Table 9 are the results when the CCE Module does not use PPM. Row 5 shows that when the Nested Attention Module uses FCA and CRA, NANet has the best performance of 82.52%, and at this time, the computational complexity is also the highest. Row 4 shows that when Nested Attention Module uses only FCA, at this time, 82.15% mIoU is obtained, which is 0.37% lower than the performance when CRA is used. This shows that CRA plays an important role in model performance improvement. Considering the reduction of computational complexity, rows 2 and 3 in Table 9 show the results for the case where PPM is used in the CCE module. Row 2 shows that the model performance is 81.92% when no CRA is added to the Nested Attention Module and only FCA is used. Row 3 shows that when CRA is added to the Nested Attention Module, the model performance achieves 82.47% mIoU, which is a 0.55% improvement over the performance when CRA is not added. The experimental results in Table 9 show that CRA can capture the contextual information ignored by FCA and plays a crucial role in capturing rich contexts.

Conclusion

In this paper, we propose Nested Attention Network (NANet) for semantic segmentation. NANet is a lightweight model for capturing rich and efficient contexts. NANet aggregates category contexts with low computational complexity and uses the obtained category context representation to nest the Feature Categorization Attention (FCA) with the Channel Relation Attention (CRA). After FCA aggregates features of the same category in the spatial dimension, CRA captures the inter-channel dependencies of the category context representation and is used to enhance the output of FCA. We evaluated the performance of NANet on three datasets: Cityscapes, PASCAL VOC 2012, ADE20K. Numerous experiments have shown that NANet has better performance than state-of-the-art methods. In addition, our future work will further enhance the contexts aggregation capability of the category feature representation to allow NANet to break through the original performance and gain performance improvement.

Data availability statement

Data related to the current study are publicly available.

References

Usman M, K TA, Ahmed MR, et al (2023) Exploiting the joint potential of instance segmentation and semantic segmentation in autonomous driving. In: 2023 International Conference for Advancement in Technology (ICONAT). IEEE, Goa, India, pp 1–7
Abdelkader A, Abdelwahab M, Ibrahim F et al (2023) Autonomous driving peripheral and central vision region selection for semantic segmentation. 2023 9th International Conference on Mechatronics and Robotics Engineering (ICMRE). IEEE, Shenzhen, China, pp 118–122
Chapter Google Scholar
Ganchenko V, Starovoitov V, Zheng X (2020) Image semantic segmentation based on highresolution networks for monitoring agricultural vegetation. 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). IEEE, Timisoara, Romania, pp 264–269
Chapter Google Scholar
Fujinaga T, Nakanishi T (2023) Semantic segmentation of strawberry plants using deeplabv3+ for small agricultural robot. 2023 IEEE/SICE International Symposium on System Integration (SII). IEEE, Atlanta, GA, USA, pp 1–6
Google Scholar
Yuan X, Shi J, Gu L (2021) A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst Appl 169:114417. https://doi.org/10.1016/j.eswa.2020.114417
Article Google Scholar
Pan S, Tao Y, Nie C et al (2021) PEGNet: progressive edge guidance network for semantic segmentation of remote sensing images. IEEE Geosci Remote Sens Lett 18(4):637–641. https://doi.org/10.1109/LGRS.2020.2983464
Article Google Scholar
Liu K, Liu F, Liu J et al (2023) Unsupervised domain adaption for remote sensing semantic segmentation with self-attention mechanism. IGARSS 2023–2023 IEEE International Geoscience and Remote Sensing Symposium. IEEE, Pasadena, CA, USA, pp 6916–6919
Chapter Google Scholar
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Boston, MA, USA, pp 3431–3440
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention—MICCAI 2015. Springer International Publishing, Cham, pp 234–241
Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615
Article Google Scholar
Chen LC, Papandreou G, Kokkinos I et al (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Article Google Scholar
Zhao H, Shi J, Qi X, et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp 2881–2890
Fu J, Liu J, Tian H, et al (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp 3146–3154
Woo S, Park J, Lee JY, et al (2018) Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). Springer International Publishing, Cham, pp 3–19
Yuan Y, Chen X, Wang J (2020) Object-contextual representations for semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV). Springer International Publishing, Cham, pp 173–190
Chen LC, Papandreou G, Schroff F et al (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint. arXiv:1706.05587
Li T, Wei Y, Cui Z et al (2023) Mutually reinforcing non-local neural networks for semantic segmentation. Complex Intell Syst 9(5):6037–6049. https://doi.org/10.1007/s40747-023-01056-w
Article Google Scholar
Guo MH, Liu ZN, Mu TJ et al (2023) Beyond self-attention: external attention using two linear layers for visual tasks. IEEE Trans Pattern Anal Mach Intell 45(5):5436–5447. https://doi.org/10.1109/TPAMI.2022.3211006
Article Google Scholar
Zhu Z, Xu M, Bai S, et al (2019) Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). IEEE, Seoul, Korea (South), pp 593–602
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Salt Lake City, UT, pp 7132–7141
Li T, Cui Z, Han Y et al (2023) Enhanced multi-scale networks for semantic segmentation. Complex Intell Syst. https://doi.org/10.1007/s40747-023-01279-x
Article Google Scholar
Chouhan A, Sur A, Chutia D (2022) Aggregated context network for semantic segmentation of aerial images. 2022 IEEE International Conference on Image Processing (ICIP). IEEE, Bordeaux, France, pp 1526–1530
Chapter Google Scholar
Zhu L, Wang X, Ke Z, et al (2023) Biformer: Vision transformer with bi-level routing attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Vancouver, BC, Canada, pp 10323–10333
Cordts M, Omran M, Ramos S, et al (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, pp 3213–3223
Everingham M, Van Gool L, Williams CKI et al (2010) The Pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Zhou B, Zhao H, Puig X, et al (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp 633–641
Huang Z, Wang X, Huang L, et al (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). IEEE, Seoul, Korea (South), pp 603–612
Yang J (2023) Gc-mobileseg: Fast and accurate semantic segmentation network on mobile devices with global context modeling. 2023 IEEE 5th International Conference on Civil Aviation Safety and Information Technology (ICCASIT). IEEE, Dali, China, pp 1281–1286
Chapter Google Scholar
Wang Z, Guo X, Wang S et al (2023) A feature refinement module for light-weight semantic segmentation network. 2023 IEEE International Conference on Image Processing (ICIP). IEEE, Kuala Lumpur, Malaysia, pp 2035–2039
Chapter Google Scholar
Yu L, Xiang W, Fang J et al (2023) eX-ViT: a novel explainable vision transformer for weakly supervised semantic segmentation. Pattern Recognit 142:109666. https://doi.org/10.1016/j.patcog.2023.109666
Article Google Scholar
Li X, Zhong Z, Wu J, et al (2019) Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea (South), pp 9167–9176
Zhang X, Quan Z, Li Q et al (2024) SED: searching enhanced decoder with switchable skip connection for semantic segmentation. Pattern Recognit 149:110196. https://doi.org/10.1016/j.patcog.2023.110196
Article Google Scholar
Zhao H, Zhang Y, Liu S, et al (2018) Psanet: Point-wise spatial attention network for scene parsing. In: Proceedings of the European conference on computer vision (ECCV). Springer International Publishing, Cham, pp 267–283
Wu T, Lu Y, Zhu Y, et al (2020) Ginet: Graph interaction netwoFrk for scene parsing. In:Proceedings of the European conference on computer vision (ECCV). Springer International Publishing, Cham, pp 34–51
Wan Q, Huang Z, Lu J et al (2023) SeaFormer: squeeze-enhanced axial transformer for mobile semantic segmentation. arXiv preprint. arXiv:2301.13156
Chen Z, Duan Y, Wang W et al (2022) Vision transformer adapter for dense predictions. arXiv preprint. arXiv:2205.08534
Tang S, Sun T, Peng J et al (2023) PP-MobileSeg: explore the fast and accurate semantic segmentation model on mobile devices. arXiv preprint. arXiv:2304.05152

Download references

Acknowledgements

The authors gratefully acknowledge the financial support from the National Natural Science Foundation of China (Grant No. 61472220, 61572286).

Funding

This research was supported by the National Natural Science Foundation of China funded by the Ministry of Science and Technology of the People’s Republic of China (no. 61472220, no. 61572286).

Author information

Authors and Affiliations

School of Physics and Electronics, Shandong Normal University, Changqing Lake Campus, Jinan, 250358, Shandong, China
Tianping Li, Meilin Liu & Dongmei Wei

Authors

Tianping Li
View author publications
You can also search for this author in PubMed Google Scholar
Meilin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Wei
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Tianping Li: validation, investigation, data curation. Meilin Liu: data curation, methodology, writing-original draft. Dongmei Wei: writing-review and editing, supervision.

Corresponding author

Correspondence to Dongmei Wei.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

We ensure that accepted principles of ethical and professional conduct have been followed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, T., Liu, M. & Wei, D. Nested attention network based on category contexts learning for semantic segmentation. Complex Intell. Syst. (2024). https://doi.org/10.1007/s40747-024-01520-1

Download citation

Received: 14 December 2023
Accepted: 01 June 2024
Published: 19 June 2024
DOI: https://doi.org/10.1007/s40747-024-01520-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Nested attention network based on category contexts learning for semantic segmentation

Abstract

Similar content being viewed by others

End-to-End Object Detection with Transformers

CBAM: Convolutional Block Attention Module

Learning to Prompt for Vision-Language Models

Introduction

Related work

Semantic segmentation

Attention mechanism

Context

Method

Overview

Category context ensemble module

Nested attention module

Experiment

Datasets

Implementation details

Evaluation metric

Comparisons with other methods

Efficiency comparisons

Performance comparisons

Ablation study

Ablation study for PPM

Ablation study for nested attention module

Conclusion

Data availability statement

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation