Long and short-range relevance context network for semantic segmentation

Liu, Qing; Dong, Yongsheng; Pei, Yuanhua; Zheng, Lintao; Zhang, Lei

doi:10.1007/s40747-023-01103-6

Long and short-range relevance context network for semantic segmentation

Original Article
Open access
Published: 21 June 2023

Volume 9, pages 7155–7170, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Long and short-range relevance context network for semantic segmentation

Download PDF

Qing Liu¹,
Yongsheng Dong ORCID: orcid.org/0000-0002-6281-9658¹,
Yuanhua Pei¹,
Lintao Zheng¹ &
…
Lei Zhang¹

1188 Accesses
1 Citation
Explore all metrics

Abstract

The semantic information can ensure better pixel classification, and the spatial information of the low-level feature map can ensure the detailed location of the pixels. However, this part of spatial information is often ignored in capturing semantic information, it is a huge loss for the spatial location of the image semantic category itself. To better alleviate this problem, we propose a Long and Short-Range Relevance Context Network. Specifically, we first construct a Long-Range Relevance Context Module to capture the global semantic context of the high-level feature and the ignored local spatial context information. At the same time, we build a Short-Range Relevance Context Module to capture the piecewise spatial context information in each stage of the low-level features in the form of jump connections. The whole network adopts a coding and decoding structure to better improve the segmentation results. Finally, we conduct a large number of experiments on three semantic segmentation datasets (PASCAL VOC2012, Cityscapes and ADE20K datasets) to verify the effectiveness of the network.

Dual Context Network for real-time semantic segmentation

Article 19 January 2023

CESegNet:Context-Enhancement Semantic Segmentation Network Based on Transformer

Nested attention network based on category contexts learning for semantic segmentation

Article Open access 19 June 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Semantic segmentation is a basic task of computer vision. It aims at classifying each pixel in an image into different categories and divide the image into different regions. Image segmentation has been applied to a variety of fields, such as medical image diagnosis [1,2,3,4], human posture evaluation, automatic driving [5,6,7,8] and remote sensing [9,10,11] satellite survey.

As we all know, pixels in a local neighbour in a given image are closely connected and are also indirectly connected with the other pixels in the image. Therefore, It is important for a semantic segmentation task to capture the relationship of all the pixels in a local neighbour, as well as the relationship between the pixels in a local neighbour and the other pixels outside the local neighbour. The importance of context information is not only in the segmentation field, but also in other fields [12, 13]. In [14,15,16], the relationship between pixels and other pixels is called relevance context information. In this paper, we call the relationship between all pixels in the local neighborhood short-range relevance information, and the relationship between pixels in the local neighborhood and other pixels outside the local neighborhood long-range relevance information. However, most semantic segmentation algorithms based on deep learning convolution neural network may not capture these two kinds of relevance semantic information including the long and short range relevance information, and thus they may result in inaccurate segmentation of small or large targets and incomplete edges. For example, Fig. 1 shows the segmentation result of FCN on PACAL VOC dataset. It can be seen from Fig. 1 that the spatial details and semantic information of the first line of horse legs are ignored; In the second row, when the big target overlaps the small target, the small target is ignored; The whole contour of the fourth row sofa is chaotic and incomplete. It can be seen that the insufficient use of the relevant pixel information will seriously affect the image segmentation quality.

For the above problems, one strategy is to use the encoding and decoding structure to capture rich context to improve the utilization of semantic information relevance. For example, U-Net [19] forms a U-shaped network structure through jump connection to encode and decode semantic information; Segnet [20] uses encoder to sample image features and decoder to restore image information; The Second Order Coding Network [21] consists of different pooling layers that compose the encoder to collect the semantic information of the image, and the decoder recovers some of the missing encoding information by up-sampling. [22,23,24] fuses information from different stages through a coding and decoder structure for image semantic segmentation. AGLNNet [25] integrates the information of the encoder through global enhancement and local refinement to improve the segmentation quality. CGBNet [26] enhances the pixel pairing result through context encoding and multipath decoding, so that it can adaptively select an appropriate fractional image from a rich feature scale. Built-in Depth-semantic Coupled Encoding [27] selectively highlights the distinction of depth feature information by introducing built-in depth semantic coupling to encode depth information and semantic information; Attribute Aware Feature Encoding [28] uses feature coding regularization and auxiliary attribute learning to improve semantic segmentation tasks. Compensating for Local Ambiguity With Encoder-Decoder [29], A new context aggregation module (CAM) is composed of encoder decoder with appropriate sampling scale to capture semantic information. Although these encoding and decoding structures have achieved good results through different designs, they still do not make good use of the spatial information in different feature maps, which will still face the problem of insufficient relevance semantic information in different feature maps.

Other methods improve the complete capture rate of semantic information through some specific semantic context modules. For example, the DepLab series [30] uses atrous convolutions with different atrous rates to form a atrous space pyramid (aspp) to obtain semantic information. Pyramid Scene Parsing Network [31] improves the utilization of semantic information of different scales by pooling modules of different sizes (PPM); DANet [32] combines spatial attention and channel attention modules for semantic segmentation; Criss-Cross Attention Network [33] collects context information from all pixels of a given image by continuously stacking two cross attention modules; Adaptive Pyramid Context Network [34] constructs multiscale context information for different regions by designing adaptive context modules; Co-occurrent Features [35] explores the semantic context representation using co-occurrence features by introducing the Aggregate Co-occurrence Feature (ACF) module; Gated Path Selection Network [36] further introduces the prediction module to dynamically select the required semantic context information; Context-based Tandem Network [37] collects semantic feature mapping and category feature context information by designing spatial context information (SCM) and channel context information (CCM) modules. Attention guided Chained Context Aggregation Network [38] has designed a context aggregation module (CAM) to achieve diversified feature context propagation in a series parallel hybrid mode. It is a good method to obtain semantic information by setting a special context module, but this special module can only capture certain semantic information, and the use of the relevance spatial information contained in different stages is limited.

For the above failure to make full use of image pixel relevance information, it is just like failure to capture rich semantic information and failure to obtain effective spatial details. These problems will lead to unacceptable segmentation results, including wrong classification of large-scale objects, rough segmentation boundary and neglect of small objects. To alleviate the above problems, we propose a new network structure, Long and Short-Range Relevance Context Network (LSRRCNet). Specifically, we introduce a semantic segmentation network based on encoding and decoding to solve the problem of insufficient utilization of relevance information of different feature maps in the segmentation process. First of all, we design a Long-Range Relevance Context (LRRC) module to obtain the relevance context information of pixels in the local neighborhood and other pixels outside the local neighborhood by combining the global and local contexts in the high-level feature map of the encoder, so as to improve the utilization of the relevance information of the high-level feature map. In addition, a Short-Range Relevance Context (SRRC) module is proposed to capture the pixel spatial details in the local neighborhood in each coding low-level feature map stage, and provide them to each decoding stage in the form of jump connections, so as to better recover the spatial location of semantic categories. Our proposed LSRRCNet captures semantic and spatial information from feature maps at different coding stages, and transmits relevant context information to each decoding stage in the form of jump connections, which improves the utilization of information and restores image pixel information better, thus improving the quality of semantic segmentation.

Our main contributions are as follows:

1.
We propose a Long and Short-Range Relevance Context Network (LSRRCNet). It aggregates long-range global and local relevance contexts through the encoder, and uses jump joints to pass spatial contexts to the decoder. The whole network uses a codec structure to better alleviate the insufficient use of relevance context in image segmentation.
2.
We propose a Long-Range Relevance Context Module (LRRC), which consists of two parts: long-range global context and long-range local context. It combines global and local feature information with the encoder advanced feature map to provide rich semantic information for pixel classification.
3.
We design a Short-Range Relevance Context Module (SRRC), which is compose of different types of convolutions and pooling. It transfers the spatial information captured in the encoding stage to each decoding stage by jumping connection, improves the utilization of spatial relevance information, and thus increases the pixel positioning of semantic categories.

Related work

In this section, we mainly introduce some popular model methods to capture relevance global and local context.

Long-range relevance context information in semantic segmentation

Long-range relevance context information contains both global and local contexts. The effectiveness of long-range relevance contexts in semantic segmentation methods has been well proven in recent years. For example, PSPNet [31] collects global context information through global averaging pooling, and this capture of global context information improves the representation of multiscale features. GSCNet [39] designs a globally-guided global module and a globally-guided local module that flexibly selects different global and local context information for pixels. ParseNet [40] proposes a method to add global context is added to a fully convolutional network, which captures global context information at arbitrary sizes. ENCNet [41] captures the global context using an encoding module, improving the use of context information. These methods capture the global context well using different benchmarks, improving segmentation performance, but they ignore local context information while capturing the global context. In the same vein, our network also focuses on capturing global context information. But it is the combination of global and local contexts that is the focus of our attention, providing richer semantic information by global and local contexts.

Short-range relevance context information in semantic segmentation

Short-range relevance context is mainly the local spatial context of a pixel, which is crucial for class localization and boundary delineation of a pixel. Many methods use low-level feature maps to capture short-range relevance context. For example, RefineNet [42] obtains spatial detail information by refining feature maps at different stages, and this spatial context is derived from each low-level feature map to play a good role in optimising the spatial detail of pixels. DeepLabv3+ [30] fuses the low-level local context with the semantic context through a special decoder structure design, thus better recovering the spatial detail of images. FBSNet [43] directly designs a spatial detail branch for establishing short-range dependencies of each pixel to preserve spatial detail information. DLA-Net [44] proposes a dual local attention feature module to fully exploit the spatial information of local neighborhood, and this special module design is also an important way to capture. RandLA-Net [45] introduces a local feature aggregation module to increase the perceptual field, thus effectively preserving spatial detail information. All of these methods rely on low-level local context to provide spatially detailed information about pixels, but little work has found the same functionality for high-level local context. In this paper, we make use of not only low-level local spatial context but also capture high-level short-range local information to enhance the spatial detail of image pixels with multiple different stages of local spatial context to improve the correct Spatial position of pixels.

Our proposed method

In this section, we first introduce the architecture of our Long and Short-Range Relevance Context Network (LSRRCNet), then introduce our proposed Long-Range Relevance Context Module (LRRC), and finally describe the proposed Short-Range Relevance Context Module (SRRC).

Overview

Our proposed structure for LSRCNet is shown in Fig. 2. We use the encoding and decoding structure as the main architecture of the whole network. First of all, the pretraining residual network ResNet [46] is used as the encoder to extract the feature information of the feature map at each stage. We use dilated convolution to divide the feature resolution of Resnet block into 1/4, 1/8, 1/16 and 1/16. Because the last residual block is rich in semantic information, we apply the Long-Range Relevance Context Module (LRRC) in the last residual block to obtain rich long-range relevance semantic information. At the same time, to improve the spatial information utilization at each stage, we load the Short-range Relevance Context Module (SRRC) after the first three residual blocks and transmit the spatial information at each encoding stage to the corresponding decoder in the form of jump junctions. Finally, the semantic information captured by LRRC is taken as the first decoder stage, and then the spatial details transmitted by each jump connection are connected through Flow Fusion [47] to form a decoder stage by stage. The whole network coding and decoding is combined for pixel dense prediction segmentation task.

Our proposed LSRRCNet aims to improve the information utilization and better perform pixel segmentation. LRRC captures the long-range global and local relevance feature semantic information of advanced feature mapping in the encoder for image pixel classification. SRRC is used to capture the spatial information of feature mapping in each stage of the encoder, and transmit it to each stage of the decoder in the form of jump connection to recover spatial details. Our whole model adopts encoding and decoding structure. The backbone network is used as an encoder to reduce the resolution and extract image feature information. The structure of each module of the whole network is clear, simple and easy to implement, and end-to-end access can be achieved in any network.

Long-range relevance context

Semantic information is enhanced with the decreasing resolution of the image. Generally, people will obtain the global context information from the feature stage with relatively low-resolution, but the local context information in the low-resolution stage is often ignored. The low-resolution stage feature map has global and local relevance context information [48]. If only the global semantic context information in the low-resolution feature map is obtained, but the local context information is ignored, the classification of small target objects will fail or the segmentation edges between objects will be blurred. To further obtain the complete global and local relevance context information of the low-resolution stage feature map, we design a Long-Range Relevance Context Module (LRRC), whose structure is shown in Fig. 3. It can be seen from Fig. 3 that the module is processed by two different parts. Next, we will introduce LRRC in detail.

LRRC is mainly composed of two parts of feature information. The first part is to capture relevant information through stripe pooling and process global image features in the form of self-attention mechanism, thus forming a complete long-range global mapping $A_\textrm{feature}$. The second part captures the local information around image pixels using standard convolution and dilated convolution to obtain the long-range local mapping $B_\textrm{feature}$. Finally, the captured long-range global and local relevance context information is fused by matrix multiplication. To better spread the gradient, we also make residual connection with the initial feature map. Formally, for high stage feature mapping $X_ \textrm{input}$, output $O_\textrm{output}$ can be written as:

$$\begin{aligned} O_\textrm{output}=A_\textrm{feature}\otimes B_\textrm{feature}\oplus x, \end{aligned}$$

(1)

where, $A_\textrm{feature}$ represents long-range global context feature information, $B_\textrm{feature}$ represents long-range local context feature information, $\otimes $ represents matrix multiplication, and $\oplus $ represents element summation.

Figure 4 shows overview of long-range global context feature information-$A_\textrm{feature}$. It can be seen from Fig. 4 that the input feature mapping $X_ \textrm{input}$, we first use horizontal and vertical strip pools to capture strip information $S_\textrm{h}\in R^{1\times H}$ and $S_\textrm{v}\in R^{1\times W}$. Then adjust and capture the characteristics of its adjacent positions as $S_{i}^\textrm{h}\in R^{C\times H}$ and $S_{j}^\textrm{v}\in R^{C\times W}$, to obtain more global information, we combine $S_{i}^\textrm{h}$ and $S_{i}^\textrm{v}$ to form $S_\textrm{in}$, respectively, where $S_\textrm{in}\in R^{C\times H\times W}$. The formula of $S_\textrm{in}$ can be expressed as:

$$\begin{aligned} S_\textrm{in}=S_{i}^\textrm{h}+S_{j}^\textrm{v}, \end{aligned}$$

(2)

where, $S_{i}^\textrm{h}$ and $S_{j}^\textrm{v}$ is the adjacent characteristics of horizontal and vertical respectively, which can be expressed as:

$$\begin{aligned} S_{i}^\textrm{h}= & {} \frac{1}{W}\sum _{0\le j<W}{x_{i,j}}, \end{aligned}$$

(3)

$$\begin{aligned} S_{j}^\textrm{v}= & {} \frac{1}{H}\sum _{0\le i<H}{x_{i,j}}, \end{aligned}$$

(4)

Then we reshape $S_\textrm{in}$ to $ R^{C\times N}$, where $N = H \times W$ is the number of pixels. We perform matrix multiplication for $S_\textrm{in}$ and $S_\textrm{in}$ transpose and apply softmax layer to calculate the global attention map $S_\textrm{mid}\in R^{N\times N}$:

$$\begin{aligned} S_{\textrm{mid}-ji}=\frac{\exp \left( S_{\textrm{in}-i}\cdot S_{{\text {in}}-j} \right) }{\sum _{i=1}^N{\exp \left( S_{\textrm{in}-i}\cdot S_{{\text {in}}-j} \right) }}, \end{aligned}$$

(5)

where, $S_{\textrm{mid}-ji}$ measures the ith pixel impact on jth pixel. The more similar feature representations of the two pixel contributes to greater relevance between them. Then we again perform a matrix multiplication between the transpose of $S_\textrm{in}$ and the transpose of $S_\textrm{mid}$ and reshape the result to $R^{C\times H\times W}$. Finally, we perform a element-wise sum operation with the input feature mapping $X_ \textrm{input}$ to obtain the final output $A_\textrm{feature}\in R^{C\times H\times W}$ as follows:

$$\begin{aligned} A_\textrm{feature}=\sum _{i=1}^N{\left( S_{\textrm{mid}-ji}\cdot S_\textrm{in} \right) +X_\textrm{input}}. \end{aligned}$$

(6)

In the high stage feature map, there is rich long-range global semantic information, but the utilization of local feature information around the target pixel is limited. To capture more long-range local feature information, we capture the surrounding information of long-range pixels through standard convolution and dilated convolution, then capture their channel weights through average pooling, and finally capture the long range local correlation information by selecting channel weights through residual connection. Figure 5 shows overview of long-range local context feature information, the-$B_\textrm{feature}$. The specific formula of long-range local mapping $B_\textrm{feature}$ can be written as:

$$\begin{aligned} B_\textrm{feature}={\text {Avg}}(C_3\left( x \right) \oplus C_{3}^{k}\left( x \right) )\otimes x, \end{aligned}$$

(7)

where, Avg represents the average pooling, $C_3$ represents the standard $3 \times 3$ convolution, $C_{3}^{k}$ represents the $3 \times 3$ dilated convolution, and the dilated rate is k.

Note that our LRRC aims to acquire the long-range global and local relevance feature information in the advanced stage in the encoder stage. String Pooling Net [49] captures the correlation information of different regions through narrow core shape, which helps capture the global context and prevents irrelevant regions from interfering with label prediction. Inspired by it, our global feature information uses strip pool to ensure sufficient pixel semantic information. String pooling Net uses strip pool to capture horizontal and vertical information for summation, and then serves as long-range relevance information. The difference is that we use strip pool to capture adjacent pixel information, and then use the self-attention mechanism to capture long-range global relevance semantic information. In this way, When the resolution is reduced, there is still a relevance between the pixels in the local neighborhood and other pixels outside the local neighborhood. The existence of this relevance ensures the smoothness and integrity of the segmentation edges between categories. We not only capture the global context information at this stage, but also focus on its local correlation information. In addition,we increase the utilization rate of adjacent pixels of strip pool, and gain more global information. For the local relevance context information acquisition in the advanced stage, we obtain the local context information by increasing the receptive field between our own pixel and the surrounding pixels. In this way, we not only capture rich semantic information but also obtain some spatial information in the advanced feature stage. The capture of these two parts of information ensures the classification of target object segmentation semantic categories, and improves the accuracy of target edge segmentation.

Short-range relevance context

In the process of network segmentation, with the reduction of image resolution, all pixels in the local neighborhood some relevant spatial information is often lost, resulting in fuzzy category position. To reduce the loss of spatial details, we construct a Short-Range Relevance Context Module (SRRC). Figure 6 shows overview of the Short-Range Relevance Context Module (SRRC). It can be seen from Fig. 6 that SRRC is an integrate structure and can be flexibly apply to any network structure. Next, we will introduce SRRC in detail.

First, we take the stage characteristics of each residual block as the input. Since the number of channels in each residual block is different, we use the standard 1 $\times $ 1 convolution to unify the number of channels. secondly, to obtain more receptive fields, we not only use standard convolution to obtain feature information, but also use dilated convolution to process input features. Because the number of channels is limited, we set 2 as the dilated rate of the dilated convolution. If we use a large dilated rate, part of the information will be lost, causing mesh effect [50]. Finally, through the captured surrounding information, we use global average pooling and max pooling to compress the channel dimensions and retain more spatial details. At the end of the whole module, we restore the feature map to $C \times H \times W$ through initial connection to ensure complete image spatial information output. The whole SRRC is attached to the jump connection to transfer the captured spatial information to the decoder initiated at the low resolution stage. In this way, the jump connection at different stages provides not only the original information, but also the spatial details of each stage. The output $O_ \textrm{output}$ of SRRC can be formally expressed as:

$$\begin{aligned} O_\textrm{output}=\textrm{Sig}\left( {\text {Max}}\left( X \right) \otimes \textrm{Avg}\left( X \right) \right) \otimes X, \end{aligned}$$

(8)

where, Max represents maximum pooling, Avg represents average pooling, Sig represents sigmoid, and input X represents information around fused pixels, which is formally expressed as:

$$\begin{aligned} X=C_3\left( x \right) \oplus C_{3}^{k}(x), \end{aligned}$$

(9)

where, x represents the input feature $X_ \textrm{input}$, $C_3$ represents the standard $3 \times 3$ convolution, $C_{3}^{k}$ represents a $3 \times 3 $ dilated convolution with a dilated rate of k.

Note that our Short-Range Relevance Context (SRRC) aims to obtain the effective spatial information between the residual blocks, and pass it to the decoder through the jump connection. In different low-level feature maps, more spatial information needs to be captured, while semantic information is more contained in high-level feature maps [10]. Therefore, SRRC only uses small convolutions to obtain short-range spatial information around the target, and does not use large size convolutions to capture remote context information. To obtain as much information as possible, we use standard convolution and dilated convolution to increase receptive field. At the same time, to reduce information redundancy and extract effective spatial information, we use maximum pooling and average pooling to compress channels, maintain more spatial location information, and use them to better provide spatial context for the network.

Experiments

In this section, we compare our LSRRCNet with ten other semantic segmentation methods on three standard segmented datasets, namely PASCAL VOC2012 [17], the Cityscapes [51] and the ADE20K dataset [52]. We first introduce three experimental datasets for semantic segmentation and experimental parameter settings, and then conduct some ablation experiments on the PASCAL VOC2012 dataset for our proposed module. Finally, we compare specific segmentation results on the three datasets mentioned above.

Datasets

PASCAL VOC2012

The dataset is a famous standard semantic segmentation dataset, and it is also the dataset most used by many methods. There are 20 foreground object classes and 1 background class. The whole data set includes 1464 training images, 1449 verification images and 1456 test images.

Cityscapes

Cityscapes is a large high-resolution image dataset that integrates semantic understanding of driving and street scenes, all of which are street view images in real scenes. The dataset image contains 50 cities in Europe, including 24,998 images for urban streets, urban backgrounds, and different scenes. The images contain 30 different categories of annotations, including people, vehicles, traffic signs, buildings, ground, nature, and sky. The dataset is divided into two groups based on the labeling quality of images. One group includes 19,998 images with rough labeling annotations by the author, and the other group includes 5000 images with fine labeling annotations. To facilitate training, fine tags are further divided into training set images, verification set images, and test set images, with 2975, 500, and 1525 images, respectively.

ADE20K MIT

The dataset is a standard dataset for visual scenes. It covers various annotations of scenes, objects, and object parts. It contains 22 K intensive annotation images and 150 fine-grained image semantic concepts. The training set and verification set are composed of 20 K and 2 K images respectively.

Experimental settings

According to most previous work [32, 53,54,55], we use horizontal flipping and random scaling to enhance the dataset, and use random gradient descent (SGD) and cross entropy loss to optimize the gradient model [46]. At the same time, using a multiple learning rate strategy, the initial learning rate is multiplied by $ \left( 1-\frac{{\text {iter}}}{{\text {total}}\_{\text {iter}}} \right) ^\textrm{power} $ after each iteration to train the network to a power of 0.9. We use pixel accuracy (PA) and mean intersection of union (mIOU) as evaluation indicators [56].

We use an initial learning rate of 0.01 and a weight decay to 0.0005 on both PASCAL VOC2012 and ADE20K datasets, and set the batchsize to 16 and the training iteration to 160K. For the Cityscapes dataset: we set the weight decay to 0.0001 and the batchsize to 8, and conduct evaluation experiments on the evaluation dataset. The experimental results ANN [16], DANet [32], OCRNet [57] are from SA-FFNet [58]. Due to the fairness of the experiment, we use the same training parameter settings under the same dataset to implement these methods: FCN [18], DeepLab [59], PSPNet [31], DeepLabv3+ [30], OCNet [60], DenSeASPP [61].

Table 1 The LRRC module is based on the PA and mIOU of two backbone networks on the PASCAL VOC2012 dataset (LRG and LRL represent the long-term global and long-term local relevance contexts of our LRRC module)

Full size table

Table 2 Experimental comparison of PA and mIOU results between the LRRC module we propose and other modules (PPM, ASPP, MPM) on the PASCAL VOC2012 dataset

Full size table

Ablation analysis

In this subsection, we perform ablation experiments on the PASCAL VOC2012 dataset on the proposed two modules, including the Long-range Relevance Context module (LRRC) and the Short-range Relevance Context module (SRRC). In the ablation study, we set 100k as the number of training iterations.

Ablation studies for LRRC

To verify the performance of the LRRC module in our proposed LSRRCNet, we conduct ablation experiments on it. Table 1 shows the PA and mIoU of the LRRC module ablation experiments. We divide the LRRC into two parts for ablation experiment. One part only contains the long-range global context (LRG), and the other part is the long-range local context (LRL). We conduct experimental analysis on the backbone network (resnet50 and resnet101) respectively. As seen in Table 1, the LRRC with ResNet50 as the base network (LRG only) can reach $94.32\%$ and $75.20\%$ in PA and mIOU respectively. Compare to the baseline network, PA increased by $2.79\%$ and mIOU increased by $5.02\%$. In addition, LRRC using ResNet50 as the basic network (LRL only) has a PA of $94.09\%$ and a mIOU of $74.38\%$, respectively. Compare to the baseline network, PA increased by $2.56\%$ and mIOU increased by $4.2\%$. Whether it only includes long-range global context or long-range local context, its performance is better than baseline network, which fully proves the validity of our two modules. However, to improve the utilization of relevance information, we fuse the two parts together to form LRRC to provide better image segmentation quality.

To further demonstrate the effectiveness of our proposed LRRC module, we conduct numerical experiments to compare the LRRC module with three modules with the same functionality, namely PPM, ASPP, and MMP. Based on their respective articles, we find that PPM [31], ASPP [30], and MMP [49] are effective for capturing rich semantic information at the low-level. Therefore, to ensure the fairness of comparative experiments, we have ensured the consistency of parameter settings in all comparative experiments. Table 2 shows the PA and mIOU of the LRRC module comparative experiments. It can be seen from Table 2 that LSRRCNet based on ResNet50 can reach $94.65\%$ in PA and $76.84 \%$ in mIOU, and our proposed mIOU of LRRC is $3.14\%$ higher than PPM. LSRRCNet based on ResNet101 can reach $96.24 \%$ in PA and $78.18\%$ in mIOU, and mIOU of LRRC is higher than MPM $1.14\%$. Both PA and mIOU of our proposed LRRC based on resnet50 and resnet101 are superior to PPM, ASPP, MPM. The main reason is that the LRRC module not only captures the long-range global context information, but also captures the long-range local relevance context information. The combination of these two types of information improves the utilization of information and provides a richer context.

Table 3 Experimental comparison of PA and mIOU results between our propose SRRC module and SPM modules on the PASCAL VOC2012 dataset

Full size table

Ablation studies for SRRC

To demonstrate the advantages of the SRRC module, we conduct ablation experiments on different backbone networks (resnet50 and resnet101). Since SPM [49] has the ability to easily embed any building block to capture spatial context information, we have conducte a comparative experiment with our proposed SSRC module. Table 3 shows the PA and mIOU of the SRRC module comparative experiments. It can be seen from Table 3 that the SRRC module is superior to SPM in the baseline network (resnet50 and resnet101). Based on resnet50, the pixel accuracy of our SRRC module achieves $94.65\%$, and that of mIOU achieves $76.84\%$. It is $0.74\%$ and $3.03\%$ higher than the PA and mIOU of SPM respectively. Based on resnet101, the pixel accuracy of our SRRC module achieves $96.24\%$, and that of mIOU achieves $78.18\%$. Our proposed SRRC is superior to SPM no matter which baseline network. The main principle is that our proposed SRRC module increases the receptive field of pixels to capture more spatial information, filters the channel information through different pooling, saves the spatial position information of relate pixels, and better transmits the spatial details to the decoder to restore the pixel position information through jumping connection, so as to maintain the consistency of semantic and spatial information.

Table 4 Comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the PASCAL VOC2012 dataset

Full size table

Table 5 Comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the ADE20K dataset

Full size table

Table 6 Comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the Cityscapes dataset

Full size table

Table 7 Comparison of numerical results of IOU and mIOU (%) on PASCAL VOC2012 dataset between LSRRCNet and other ten methods

Full size table

Segmentation performances and comparisons

In this section, we test the segmentation and visualization results of the proposed LSRRCNet on three standard semantic segmentation datasets, and compare it with ten semantic segmentation methods to demonstrate the effectiveness of LSRRCNet in semantic segmentation applications.

PASCAL VOC2012

We verify the validity of the LSRCNet proposed by us through some comparative experiments on the VOC2012 dataset. Table 4 shows comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the PASCAL VOC2012 dataset. From Table 4 can be seen PASCAL VOC2012 dataset, our numerical segmentation results are superior to other methods. Using the same backbone ResNet101, our LSRRCNet achieves $96.24\%$ of the PA and $78.18\%$ of the mIOU, which is $15.98\%$ higher than the classic FCN [18] network. Our LSRRCNet is also superior to the three representative networks in recent years, including Denseaspp [61], OCRNet [57]and OCNet [60]. Different from the context information captured by other methods, our method gathers the long-range context information in the low-level feature graph, including global and local context information. It can not only divide the relationship between global pixels, but also distinguish the local semantic relationship between pixels. To further demonstrate the advantages of our method, we report Table 7 in the appendix shows comparison of numerical results of IOU and mIOU (%) on PASCAL VOC2012 dataset between LSRRCNet and other ten methods. From Table 7 can be seen PASCAL VOC2012 dataset, our proposed LSRRCNet outperforms the other eleven methods on 14 classes.

Cityscapes

In this section, we performance the results of our proposed LSRRCNet on the Cityscopes dataset. Table 6 shows comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the Cityscapes dataset. From Table 6 can be seen Cityscopes dataset, our proposed LSRRCNet has obvious advantages. When using backbone network ResNet101, our proposed LSRCNet reach $78.65\%$ mIOU, which is higher than DeepLab [59] $8.24\%$. At the same time, our proposed PA for LSRCNet is $97.41\%$, which is $1.63\%$ higher than other methods. Therefore, our LSRRCNet still has advantages in PA and mIOU. Unlike other methods, our method mainly provides rich pixel spatial detail information. We need the short-range correlation context module to obtain more spatial context by increasing the feature receptive field, and then filter and reselect the captured spatial information through different pooling processing, and delete redundant information to improve the accuracy of spatial detail information, Thus, the segmentation quality of street view data set is improved. In addition, we conducte comparative experiments on each category in the dataset. Table 8 in the appendix shows Comparison of numerical results of IOU and mIOU (%) on Cityscapes dataset between LSRRCNet and other ten methods. From table 8 can be seen Cityscapes dataset, our LSRRCNet is still ahead of the other 13 classes in this dataset.

ADE20K

To further demonstrate the effectiveness of our proposed LSRRCNet, we further conducted experiments on the more challenging ADE20K dataset. Table 5 shows comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the ADE20K dataset. From Table 5 that the pixel accuracy (PA) of LSRRCNet is $86.73\%$, and the mIOU is $40.31\%$. These two results are significantly better than those of the other ten methods. ADE20K dataset has many semantic categories and contains rich category information. Unlike other methods, our LSRRCNet obtains its rich category information through encoding and decoding structure. Our proposed LRRC module improves the utilization of many category information by collecting local and global relevance information, and increases the accuracy of category segmentation. Therefore, our LSRRCNet can achieve better segmentation performance and is superior to the other methods.

Table 8 Comparison of numerical results of IOU and mIOU (%) on Cityscapes dataset between LSRRCNet and other ten methods

Full size table

Visual comparison

To provide the visual advantages of LSRRCNet, we compare the three methods on the Cityscapes dataset in Fig. 7, namely PSPNet [31], OCNet [60] and DeepLabv3+ [30]. It can be seen from Fig. 7 that the proposed LSRRCNet improves the utilization rate of relevance information through encoding and decoding structure, and can successfully achieve segmentation, whether it is a small target car or a densely occluded crowd. In addition, the lack of local relevance information at a high stage has also been successfully alleviate, such as the division and positioning of “overlapping people and vehicles” and “traffic warning lights” in the second line. Therefore, our proposed LSRRCNet can well alleviate the under-utilization of feature relevance information.

We also compare our proposed LSRRCNet with the other three methods for visualization results on the VOC2012 dataset in Fig. 8. From Fig. 8 that LSRRCNet for people, cars and animals can be successfully divide, and ensure that the semantic categories are correct and the outline is clear. Our proposed the SRRC module transmits spatial information to the decoders at each stage through jumping connections at different encoder stages, thus improving the spatial positioning capability of each semantic category and generating clear boundary contours. We randomly select six images from the VOC2012 dataset, and compare the segmentation results of these six images using the AGLN [25] method and our proposed LSRRCNet. The results are shown in Fig. 9, and it can be seen from Fig. 9 that our proposed LSRRCNet segmentation provides better image details, such as the wings of an aircraft, the tail of a horse, the contact surface between a person and a dining table, and so on. We normally perform category differentiation. These comparison results demonstrate that our proposed method increases the utilization of semantic context information, improves the ability to distinguish complex categories, and increases the quality of image semantic segmentation. Therefore, from the perspective of visual analysis, our proposed LSRRCNet is effective in semantic segmentation application.

Conclusion

In this paper, we propose a long and short range relevance context network (LSRRCNet) for semantic segmentation. Specifically, our proposed LRRC module aggregates the long-range global semantic context and local spatial context information of the high-level feature map, improves the utilization of the relevance semantic context information, and guides the semantic pixel classification of high-level features. Our proposed SRRC module captures staged spatial context information in each part of the low-level feature map in the form of jump connections, which improves the utilization of relevant spatial context information and enhances the spatial location of semantic categories. The whole network uses the encoding and decoding structure to maximize the use of relevant context information, so as to better improve the segmentation results. Experimental results show that our proposed LSRRCNet is effective.

Our approach initially alleviates the problem of insufficient context information for simple images, but the results for complex background image segmentation still need to be improved. The processing of complex background images requires not only sufficient context information, but also a greater focus on pixel-to-pixel relationships. For example, overlapping target objects, small target objects and multi-shape target objects are all difficult areas for semantic segmentation of complex images and will be the focus of our future research work.

Data Availability

The data in this paper is publicly available. 1. Cityscapes: https://www.cityscapes-dataset.com// 2. PASCAL VOC: https://host.robots.ox.ac.uk/pascal/VOC// 3. ADE20K: https://groups.csail.mit.edu/vision/datasets/ADE20K/.

References

Liu Z, Tong L, Chen L, Jiang Z, Zhou F, Zhang Q, Zhang X, Jin Y, Zhou H (2022) Deep learning based brain tumor segmentation: a survey. Complex Intell Syst 9:1–26
Google Scholar
Li P, Liu Y, Cui Z, Yang F, Zhao Y, Lian C, Gao C (2022) Semantic graph attention with explicit anatomical association modeling for tooth segmentation from cbct images. IEEE Trans Med Imaging 41:3116–3127
Chen Y, Sun Y, Lv J, Jia B (2021) Huang X End-to-end heart sound segmentation using deep convolutional recurrent network. Complex Intell Syst 7(4):2103–2117
Article Google Scholar
You H, Yu L, Tian S (2022) Cai W Dr-net: dual-rotation network with feature map enhancement for medical image segmentation. Complex Intell Syst 8(1):611–623
Article Google Scholar
Cai Y, Dai L, Wang H, Li Z (2021) Multi-target pan-class intrinsic relevance driven model for improving semantic segmentation in autonomous driving. IEEE Trans Image Process 30:9069–9084
Article Google Scholar
Pasupa K, Kittiworapanya P, Hongngern N, Woraratpanya K (2022) Evaluation of deep learning algorithms for semantic segmentation of car parts. Complex Intell Syst 8(5):3613–3625
Article Google Scholar
Dong Y, Shen L, Pei Y, Yang H, Li X (2023) Field-matching attention network for object detection. Neurocomputing 535:123–133
Article Google Scholar
Wang H, Chen Y, Cai Y, Chen L, Li Y, Sotelo MA, Li Z (2022) Sfnet-n: sn improved sfnet algorithm for semantic segmentation of low-light autonomous driving road scenes. IEEE Trans Intell Transp Syst 23:21405–21417
Li B, Gao J, Chen S, Lim S, Jiang H (2022) Poi detection of high-rise buildings using remote sensing images: a semantic segmentation method based on multi-task attention res-u-net. IEEE Trans Geosci Remote Sens 60:1–16
Ding L, Tang H, Bruzzone L (2020) Lanet: local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans Geosci Remote Sens 59:426–435
Article Google Scholar
Zhao Q, Liu J, Li Y, Zhang H (2021) Semantic segmentation with attention mechanism for remote sensing images. IEEE Trans Geosci Remote Sens 60:1–13
Article Google Scholar
Dong Y, Jiang Z, Tao F, Fu Z (2022) Multiple spatial residual network for object detection. Complex Intell Syst 6:1–16
Google Scholar
Dong Y, Tan W, Tao D, Zheng L, Li X (2021) Cartoonlossgan: learning surface and coloring of images for cartoonization. IEEE Trans Image Process 31:485–498
Article Google Scholar
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7794–7803
Dong Y, Yang H, Pei Y, Shen L, Zheng L, Li P (2023) Compact interactive dual-branch network for real-time semantic segmentation. Complex Intell Syst:1–14
Zhu Z, Xu M, Bai S, Huang T, Bai X (2019) Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 593–602
Everingham M, Van Gool L, Williams CK, Winn J (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: Proceedings of the international conference on medical image computing and computer-assisted intervention, pp 234–241
Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39:2481–2495
Article Google Scholar
Sun Q, Zhang Z, Li P (2021) Second-order encoding networks for semantic segmentation. Neurocomputing 445:50–60
Article Google Scholar
Borse S, Park H, Cai H, Das D, Garrepalli R, Porikli F (2022) Panoptic, instance and semantic relations: a relational context encoder to enhance panoptic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1269–1279
Li J, Sun W, Feng X, Xing G, von Deneen KM, Wang W, Zhang Y, Cui G (2021) A dense connection encoding-decoding convolutional neural network structure for semantic segmentation of thymoma. Neurocomputing 451:1–11
Article Google Scholar
Liu Q, Dong Y, Li X (2023) Multi-stage context refinement network for semantic segmentation. Neurocomputing 535:53–63
Article Google Scholar
Li J, Zha S, Chen C, Ding M, Zhang T, Yu H (2022) Attention guided global enhancement and local refinement network for semantic segmentation. IEEE Trans Image Process 31:3211–3223
Article Google Scholar
Ding H, Jiang X, Shuai B, Liu AQ, Wang G (2020) Semantic segmentation with context encoding and multi-path decoding. IEEE Trans Image Process 29:3520–3533
Article MATH Google Scholar
Liu S, Zhang H, Shao L, Yang J (2020) Built-in depth-semantic coupled encoding for scene parsing, vehicle detection, and road segmentation. IEEE Trans Intell Transp Syst 22(9):5520–5534
Article Google Scholar
Yang S, Wang Y, Chen K, Zeng W (2022) Fei Z Attribute-aware feature encoding for object recognition and segmentation. IEEE Trans Multimedia 24:3611–3623
Article Google Scholar
Tang Q, Liu F, Zhang T, Jiang J, Zhang Y, Zhu B, Tang X (2022) Compensating for local ambiguity with encoder-decoder in urban scene segmentation. IEEE Tran Intell Transp Syst 23:1–12
Google Scholar
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2881–2890
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3146–3154
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 603–612
He J, Deng Z, Zhou L, Wang Y, Qiao Y (2019) Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7519–7528
Zhang H, Zhang H, Wang C, Xie J (2019) Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 548–557
Geng Q, Zhang H, Qi X, Huang G, Yang R, Zhou Z (2021) Gated path selection network for semantic segmentation. IEEE Trans Image Process 30:2436–2449
Article Google Scholar
Li Z, Sun Y, Zhang L, Tang J (2021) CTNET: context-based tandem network for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 44:9904–9917
Article Google Scholar
Tang Q, Liu F, Zhang T, Jiang J, Zhang Y (2021) Attention-guided chained context aggregation for semantic segmentation. Image Vis Comput 115:104309
Article Google Scholar
Jiang J, Liu J, Fu J, Zhu X, Li Z, Lu H (2020) Global-guided selective context network for scene parsing. IEEE Trans Neural Netw Learn Syst 33:1752–1764
Article Google Scholar
Liu W, Rabinovich A, Berg AC (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579
Zhang H, Dana K, Shi J, Zhang Z, Wang X, Tyagi A, Agrawal A (2018) Context encoding for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7151–7160
Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1925–1934
Gao G, Xu G, Li J, Yu Y, Lu H, Yang J (2022) Fbsnet: a fast bilateral symmetrical network for real-time semantic segmentation. IEEE Trans Multimedia
Su Y, Liu W, Yuan Z, Cheng M, Zhang Z, Shen X, Wang C (2022) Dla-net: learning dual local attention features for semantic segmentation of large-scale building facade point clouds. Pattern Recogn 123:108372
Article Google Scholar
Hu Q, Yang B, Xie L, Rosa S, Guo Y, Wang Z, Trigoni N, Markham A (2020) Randla-net: efficient semantic segmentation of large-scale point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 11108–11117
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90
Li X, You A, Zhu Z, Zhao H, Yang M, Yang K, Tan S, Tong Y (2020) Semantic flow for fast and accurate scene parsing. In: Proceedings of the European conference on computer vision (ECCV), pp 775–793
Ji J, Shi R, Li S, Chen P, Miao Q (2020) Encoder-decoder with cascaded crfs for semantic segmentation. IEEE Trans Circuits Syst Video Technol 31(5):1926–1938
Article Google Scholar
Hou Q, Zhang L, Cheng M-M, Feng J (2020) Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4003–4012
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3213–3223
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 633–641
Feng G, Hu Z, Zhang L, Lu H (2021) Encoder fusion network with co-attention embedding for referring image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 15506–15515
Fan M, Lai S, Huang J, Wei X, Chai Z, Luo J, Wei X (2021) Rethinking bisenet for real-time semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 9716–9725
Dong Y, Zhao K, Zheng L, Yang H, Liu Q, Pei Y (2023) Refinement co-supervision network for real-time semantic segmentation. IET Comput Vis:1–11
Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J (2017) A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857
Yuan Y, Chen X, Wang J (2020) Object-contextual representations for semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 173–190
Zhou Z, Zhou Y, Wang D, Mu J, Zhou H (2021) Self-attention feature fusion network for semantic segmentation. Neurocomputing 453:50–59
Article Google Scholar
Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587
Yuan Y, Huang L, Guo J, Zhang C, Chen X, Wang J (2018) Ocnet: object context network for scene parsing. arXiv preprint arXiv:1809.00916
Yang M, Yu K, Zhang C, Li Z, Yang K (2018) Denseaspp for semantic segmentation in street scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3684–3692

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of Henan under Grant 232300421023.

Author information

Authors and Affiliations

School of Information Engineering, Henan University of Science and Technology, Luoyang, 471023, China
Qing Liu, Yongsheng Dong, Yuanhua Pei, Lintao Zheng & Lei Zhang

Authors

Qing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yongsheng Dong
View author publications
You can also search for this author in PubMed Google Scholar
Yuanhua Pei
View author publications
You can also search for this author in PubMed Google Scholar
Lintao Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongsheng Dong.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, Q., Dong, Y., Pei, Y. et al. Long and short-range relevance context network for semantic segmentation. Complex Intell. Syst. 9, 7155–7170 (2023). https://doi.org/10.1007/s40747-023-01103-6

Download citation

Received: 23 December 2022
Accepted: 21 April 2023
Published: 21 June 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s40747-023-01103-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Long and short-range relevance context network for semantic segmentation

Abstract

Similar content being viewed by others

Dual Context Network for real-time semantic segmentation

CESegNet:Context-Enhancement Semantic Segmentation Network Based on Transformer

Nested attention network based on category contexts learning for semantic segmentation

Explore related subjects

Introduction

Related work

Long-range relevance context information in semantic segmentation

Short-range relevance context information in semantic segmentation

Our proposed method

Overview

Long-range relevance context

Short-range relevance context

Experiments

Datasets

PASCAL VOC2012

Cityscapes

ADE20K MIT

Experimental settings

Ablation analysis

Ablation studies for LRRC

Ablation studies for SRRC

Segmentation performances and comparisons

PASCAL VOC2012

Cityscapes

ADE20K

Visual comparison

Conclusion

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation