Introduction

Semantic segmentation is a basic task of computer vision. It aims at classifying each pixel in an image into different categories and divide the image into different regions. Image segmentation has been applied to a variety of fields, such as medical image diagnosis [1,2,3,4], human posture evaluation, automatic driving [5,6,7,8] and remote sensing [9,10,11] satellite survey.

As we all know, pixels in a local neighbour in a given image are closely connected and are also indirectly connected with the other pixels in the image. Therefore, It is important for a semantic segmentation task to capture the relationship of all the pixels in a local neighbour, as well as the relationship between the pixels in a local neighbour and the other pixels outside the local neighbour. The importance of context information is not only in the segmentation field, but also in other fields [12, 13]. In [14,15,16], the relationship between pixels and other pixels is called relevance context information. In this paper, we call the relationship between all pixels in the local neighborhood short-range relevance information, and the relationship between pixels in the local neighborhood and other pixels outside the local neighborhood long-range relevance information. However, most semantic segmentation algorithms based on deep learning convolution neural network may not capture these two kinds of relevance semantic information including the long and short range relevance information, and thus they may result in inaccurate segmentation of small or large targets and incomplete edges. For example, Fig. 1 shows the segmentation result of FCN on PACAL VOC dataset. It can be seen from Fig. 1 that the spatial details and semantic information of the first line of horse legs are ignored; In the second row, when the big target overlaps the small target, the small target is ignored; The whole contour of the fourth row sofa is chaotic and incomplete. It can be seen that the insufficient use of the relevant pixel information will seriously affect the image segmentation quality.

Fig. 1
figure 1

Visualization examples of the segmentation results on PASCAL VOC dataset [17]. From left to right: a image; b FCN [18]; c ground truth

For the above problems, one strategy is to use the encoding and decoding structure to capture rich context to improve the utilization of semantic information relevance. For example, U-Net [19] forms a U-shaped network structure through jump connection to encode and decode semantic information; Segnet [20] uses encoder to sample image features and decoder to restore image information; The Second Order Coding Network [21] consists of different pooling layers that compose the encoder to collect the semantic information of the image, and the decoder recovers some of the missing encoding information by up-sampling. [22,23,24] fuses information from different stages through a coding and decoder structure for image semantic segmentation. AGLNNet [25] integrates the information of the encoder through global enhancement and local refinement to improve the segmentation quality. CGBNet [26] enhances the pixel pairing result through context encoding and multipath decoding, so that it can adaptively select an appropriate fractional image from a rich feature scale. Built-in Depth-semantic Coupled Encoding [27] selectively highlights the distinction of depth feature information by introducing built-in depth semantic coupling to encode depth information and semantic information; Attribute Aware Feature Encoding [28] uses feature coding regularization and auxiliary attribute learning to improve semantic segmentation tasks. Compensating for Local Ambiguity With Encoder-Decoder [29], A new context aggregation module (CAM) is composed of encoder decoder with appropriate sampling scale to capture semantic information. Although these encoding and decoding structures have achieved good results through different designs, they still do not make good use of the spatial information in different feature maps, which will still face the problem of insufficient relevance semantic information in different feature maps.

Other methods improve the complete capture rate of semantic information through some specific semantic context modules. For example, the DepLab series [30] uses atrous convolutions with different atrous rates to form a atrous space pyramid (aspp) to obtain semantic information. Pyramid Scene Parsing Network [31] improves the utilization of semantic information of different scales by pooling modules of different sizes (PPM); DANet [32] combines spatial attention and channel attention modules for semantic segmentation; Criss-Cross Attention Network [33] collects context information from all pixels of a given image by continuously stacking two cross attention modules; Adaptive Pyramid Context Network [34] constructs multiscale context information for different regions by designing adaptive context modules; Co-occurrent Features [35] explores the semantic context representation using co-occurrence features by introducing the Aggregate Co-occurrence Feature (ACF) module; Gated Path Selection Network [36] further introduces the prediction module to dynamically select the required semantic context information; Context-based Tandem Network [37] collects semantic feature mapping and category feature context information by designing spatial context information (SCM) and channel context information (CCM) modules. Attention guided Chained Context Aggregation Network [38] has designed a context aggregation module (CAM) to achieve diversified feature context propagation in a series parallel hybrid mode. It is a good method to obtain semantic information by setting a special context module, but this special module can only capture certain semantic information, and the use of the relevance spatial information contained in different stages is limited.

For the above failure to make full use of image pixel relevance information, it is just like failure to capture rich semantic information and failure to obtain effective spatial details. These problems will lead to unacceptable segmentation results, including wrong classification of large-scale objects, rough segmentation boundary and neglect of small objects. To alleviate the above problems, we propose a new network structure, Long and Short-Range Relevance Context Network (LSRRCNet). Specifically, we introduce a semantic segmentation network based on encoding and decoding to solve the problem of insufficient utilization of relevance information of different feature maps in the segmentation process. First of all, we design a Long-Range Relevance Context (LRRC) module to obtain the relevance context information of pixels in the local neighborhood and other pixels outside the local neighborhood by combining the global and local contexts in the high-level feature map of the encoder, so as to improve the utilization of the relevance information of the high-level feature map. In addition, a Short-Range Relevance Context (SRRC) module is proposed to capture the pixel spatial details in the local neighborhood in each coding low-level feature map stage, and provide them to each decoding stage in the form of jump connections, so as to better recover the spatial location of semantic categories. Our proposed LSRRCNet captures semantic and spatial information from feature maps at different coding stages, and transmits relevant context information to each decoding stage in the form of jump connections, which improves the utilization of information and restores image pixel information better, thus improving the quality of semantic segmentation.

Our main contributions are as follows:

  1. 1.

    We propose a Long and Short-Range Relevance Context Network (LSRRCNet). It aggregates long-range global and local relevance contexts through the encoder, and uses jump joints to pass spatial contexts to the decoder. The whole network uses a codec structure to better alleviate the insufficient use of relevance context in image segmentation.

  2. 2.

    We propose a Long-Range Relevance Context Module (LRRC), which consists of two parts: long-range global context and long-range local context. It combines global and local feature information with the encoder advanced feature map to provide rich semantic information for pixel classification.

  3. 3.

    We design a Short-Range Relevance Context Module (SRRC), which is compose of different types of convolutions and pooling. It transfers the spatial information captured in the encoding stage to each decoding stage by jumping connection, improves the utilization of spatial relevance information, and thus increases the pixel positioning of semantic categories.

Related work

In this section, we mainly introduce some popular model methods to capture relevance global and local context.

Long-range relevance context information in semantic segmentation

Long-range relevance context information contains both global and local contexts. The effectiveness of long-range relevance contexts in semantic segmentation methods has been well proven in recent years. For example, PSPNet [31] collects global context information through global averaging pooling, and this capture of global context information improves the representation of multiscale features. GSCNet [39] designs a globally-guided global module and a globally-guided local module that flexibly selects different global and local context information for pixels. ParseNet [40] proposes a method to add global context is added to a fully convolutional network, which captures global context information at arbitrary sizes. ENCNet [41] captures the global context using an encoding module, improving the use of context information. These methods capture the global context well using different benchmarks, improving segmentation performance, but they ignore local context information while capturing the global context. In the same vein, our network also focuses on capturing global context information. But it is the combination of global and local contexts that is the focus of our attention, providing richer semantic information by global and local contexts.

Short-range relevance context information in semantic segmentation

Short-range relevance context is mainly the local spatial context of a pixel, which is crucial for class localization and boundary delineation of a pixel. Many methods use low-level feature maps to capture short-range relevance context. For example, RefineNet [42] obtains spatial detail information by refining feature maps at different stages, and this spatial context is derived from each low-level feature map to play a good role in optimising the spatial detail of pixels. DeepLabv3+ [30] fuses the low-level local context with the semantic context through a special decoder structure design, thus better recovering the spatial detail of images. FBSNet [43] directly designs a spatial detail branch for establishing short-range dependencies of each pixel to preserve spatial detail information. DLA-Net [44] proposes a dual local attention feature module to fully exploit the spatial information of local neighborhood, and this special module design is also an important way to capture. RandLA-Net [45] introduces a local feature aggregation module to increase the perceptual field, thus effectively preserving spatial detail information. All of these methods rely on low-level local context to provide spatially detailed information about pixels, but little work has found the same functionality for high-level local context. In this paper, we make use of not only low-level local spatial context but also capture high-level short-range local information to enhance the spatial detail of image pixels with multiple different stages of local spatial context to improve the correct Spatial position of pixels.

Fig. 2
figure 2

Overview of LSRRCNet. ResNet is the backbone of the coding phase. The Long-Range Relevance Context Module obtains the long-range relevance context. The Short-Range Relevance Context Module extracts the spatial context at each coding stage. The whole network adopts encoding and decoding structure for semantic segmentation

Our proposed method

In this section, we first introduce the architecture of our Long and Short-Range Relevance Context Network (LSRRCNet), then introduce our proposed Long-Range Relevance Context Module (LRRC), and finally describe the proposed Short-Range Relevance Context Module (SRRC).

Overview

Our proposed structure for LSRCNet is shown in Fig. 2. We use the encoding and decoding structure as the main architecture of the whole network. First of all, the pretraining residual network ResNet [46] is used as the encoder to extract the feature information of the feature map at each stage. We use dilated convolution to divide the feature resolution of Resnet block into 1/4, 1/8, 1/16 and 1/16. Because the last residual block is rich in semantic information, we apply the Long-Range Relevance Context Module (LRRC) in the last residual block to obtain rich long-range relevance semantic information. At the same time, to improve the spatial information utilization at each stage, we load the Short-range Relevance Context Module (SRRC) after the first three residual blocks and transmit the spatial information at each encoding stage to the corresponding decoder in the form of jump junctions. Finally, the semantic information captured by LRRC is taken as the first decoder stage, and then the spatial details transmitted by each jump connection are connected through Flow Fusion [47] to form a decoder stage by stage. The whole network coding and decoding is combined for pixel dense prediction segmentation task.

Our proposed LSRRCNet aims to improve the information utilization and better perform pixel segmentation. LRRC captures the long-range global and local relevance feature semantic information of advanced feature mapping in the encoder for image pixel classification. SRRC is used to capture the spatial information of feature mapping in each stage of the encoder, and transmit it to each stage of the decoder in the form of jump connection to recover spatial details. Our whole model adopts encoding and decoding structure. The backbone network is used as an encoder to reduce the resolution and extract image feature information. The structure of each module of the whole network is clear, simple and easy to implement, and end-to-end access can be achieved in any network.

Fig. 3
figure 3

Overview of Long-Range Relevance Context Module (LRRC). It is mainly composed of the upper and lower parts of long-range global context and long-range local context through matrix multiplication

Fig. 4
figure 4

Overview of long-range global context feature information. It first captures adjacent information by strip pool, and then constructs self-attention to obtain long-range global information

Fig. 5
figure 5

Overview of long-range local context feature information. It captures the surrounding information by two convolutions, and then obtains the long-range local information by channel weight reconstruction

Long-range relevance context

Semantic information is enhanced with the decreasing resolution of the image. Generally, people will obtain the global context information from the feature stage with relatively low-resolution, but the local context information in the low-resolution stage is often ignored. The low-resolution stage feature map has global and local relevance context information [48]. If only the global semantic context information in the low-resolution feature map is obtained, but the local context information is ignored, the classification of small target objects will fail or the segmentation edges between objects will be blurred. To further obtain the complete global and local relevance context information of the low-resolution stage feature map, we design a Long-Range Relevance Context Module (LRRC), whose structure is shown in Fig. 3. It can be seen from Fig. 3 that the module is processed by two different parts. Next, we will introduce LRRC in detail.

LRRC is mainly composed of two parts of feature information. The first part is to capture relevant information through stripe pooling and process global image features in the form of self-attention mechanism, thus forming a complete long-range global mapping \(A_\textrm{feature}\). The second part captures the local information around image pixels using standard convolution and dilated convolution to obtain the long-range local mapping \(B_\textrm{feature}\). Finally, the captured long-range global and local relevance context information is fused by matrix multiplication. To better spread the gradient, we also make residual connection with the initial feature map. Formally, for high stage feature mapping \(X_ \textrm{input}\), output \(O_\textrm{output}\) can be written as:

$$\begin{aligned} O_\textrm{output}=A_\textrm{feature}\otimes B_\textrm{feature}\oplus x, \end{aligned}$$
(1)

where, \(A_\textrm{feature}\) represents long-range global context feature information, \(B_\textrm{feature}\) represents long-range local context feature information, \(\otimes \) represents matrix multiplication, and \(\oplus \) represents element summation.

Figure 4 shows overview of long-range global context feature information-\(A_\textrm{feature}\). It can be seen from Fig. 4 that the input feature mapping \(X_ \textrm{input}\), we first use horizontal and vertical strip pools to capture strip information \(S_\textrm{h}\in R^{1\times H}\) and \(S_\textrm{v}\in R^{1\times W}\). Then adjust and capture the characteristics of its adjacent positions as \(S_{i}^\textrm{h}\in R^{C\times H}\) and \(S_{j}^\textrm{v}\in R^{C\times W}\), to obtain more global information, we combine \(S_{i}^\textrm{h}\) and \(S_{i}^\textrm{v}\) to form \(S_\textrm{in}\), respectively, where \(S_\textrm{in}\in R^{C\times H\times W}\). The formula of \(S_\textrm{in}\) can be expressed as:

$$\begin{aligned} S_\textrm{in}=S_{i}^\textrm{h}+S_{j}^\textrm{v}, \end{aligned}$$
(2)

where, \(S_{i}^\textrm{h}\) and \(S_{j}^\textrm{v}\) is the adjacent characteristics of horizontal and vertical respectively, which can be expressed as:

$$\begin{aligned} S_{i}^\textrm{h}= & {} \frac{1}{W}\sum _{0\le j<W}{x_{i,j}}, \end{aligned}$$
(3)
$$\begin{aligned} S_{j}^\textrm{v}= & {} \frac{1}{H}\sum _{0\le i<H}{x_{i,j}}, \end{aligned}$$
(4)

Then we reshape \(S_\textrm{in}\) to \( R^{C\times N}\), where \(N = H \times W\) is the number of pixels. We perform matrix multiplication for \(S_\textrm{in}\) and \(S_\textrm{in}\) transpose and apply softmax layer to calculate the global attention map \(S_\textrm{mid}\in R^{N\times N}\):

$$\begin{aligned} S_{\textrm{mid}-ji}=\frac{\exp \left( S_{\textrm{in}-i}\cdot S_{{\text {in}}-j} \right) }{\sum _{i=1}^N{\exp \left( S_{\textrm{in}-i}\cdot S_{{\text {in}}-j} \right) }}, \end{aligned}$$
(5)

where, \(S_{\textrm{mid}-ji}\) measures the ith pixel impact on jth pixel. The more similar feature representations of the two pixel contributes to greater relevance between them. Then we again perform a matrix multiplication between the transpose of \(S_\textrm{in}\) and the transpose of \(S_\textrm{mid}\) and reshape the result to \(R^{C\times H\times W}\). Finally, we perform a element-wise sum operation with the input feature mapping \(X_ \textrm{input}\) to obtain the final output \(A_\textrm{feature}\in R^{C\times H\times W}\) as follows:

$$\begin{aligned} A_\textrm{feature}=\sum _{i=1}^N{\left( S_{\textrm{mid}-ji}\cdot S_\textrm{in} \right) +X_\textrm{input}}. \end{aligned}$$
(6)

In the high stage feature map, there is rich long-range global semantic information, but the utilization of local feature information around the target pixel is limited. To capture more long-range local feature information, we capture the surrounding information of long-range pixels through standard convolution and dilated convolution, then capture their channel weights through average pooling, and finally capture the long range local correlation information by selecting channel weights through residual connection. Figure 5 shows overview of long-range local context feature information, the-\(B_\textrm{feature}\). The specific formula of long-range local mapping \(B_\textrm{feature}\) can be written as:

$$\begin{aligned} B_\textrm{feature}={\text {Avg}}(C_3\left( x \right) \oplus C_{3}^{k}\left( x \right) )\otimes x, \end{aligned}$$
(7)

where, Avg represents the average pooling, \(C_3\) represents the standard \(3 \times 3\) convolution, \(C_{3}^{k}\) represents the \(3 \times 3\) dilated convolution, and the dilated rate is k.

Note that our LRRC aims to acquire the long-range global and local relevance feature information in the advanced stage in the encoder stage. String Pooling Net [49] captures the correlation information of different regions through narrow core shape, which helps capture the global context and prevents irrelevant regions from interfering with label prediction. Inspired by it, our global feature information uses strip pool to ensure sufficient pixel semantic information. String pooling Net uses strip pool to capture horizontal and vertical information for summation, and then serves as long-range relevance information. The difference is that we use strip pool to capture adjacent pixel information, and then use the self-attention mechanism to capture long-range global relevance semantic information. In this way, When the resolution is reduced, there is still a relevance between the pixels in the local neighborhood and other pixels outside the local neighborhood. The existence of this relevance ensures the smoothness and integrity of the segmentation edges between categories. We not only capture the global context information at this stage, but also focus on its local correlation information. In addition,we increase the utilization rate of adjacent pixels of strip pool, and gain more global information. For the local relevance context information acquisition in the advanced stage, we obtain the local context information by increasing the receptive field between our own pixel and the surrounding pixels. In this way, we not only capture rich semantic information but also obtain some spatial information in the advanced feature stage. The capture of these two parts of information ensures the classification of target object segmentation semantic categories, and improves the accuracy of target edge segmentation.

Fig. 6
figure 6

Overview of Short-Range Relevance Context Module (SRRC). It captures spatial information through different types of convolutions and pooling to ensure the spatial position of pixels. \(X_\textrm{input}\) represents the input of each residual block, DConv represents \(3 \times 3\) dilated convolution, Conv represents standard ordinary convolution, Avg and Max represent average pooling and maximum pooling. \(O_ \textrm{Output}\) represents the processed feature map output

Short-range relevance context

In the process of network segmentation, with the reduction of image resolution, all pixels in the local neighborhood some relevant spatial information is often lost, resulting in fuzzy category position. To reduce the loss of spatial details, we construct a Short-Range Relevance Context Module (SRRC). Figure 6 shows overview of the Short-Range Relevance Context Module (SRRC). It can be seen from Fig. 6 that SRRC is an integrate structure and can be flexibly apply to any network structure. Next, we will introduce SRRC in detail.

First, we take the stage characteristics of each residual block as the input. Since the number of channels in each residual block is different, we use the standard 1 \(\times \) 1 convolution to unify the number of channels. secondly, to obtain more receptive fields, we not only use standard convolution to obtain feature information, but also use dilated convolution to process input features. Because the number of channels is limited, we set 2 as the dilated rate of the dilated convolution. If we use a large dilated rate, part of the information will be lost, causing mesh effect [50]. Finally, through the captured surrounding information, we use global average pooling and max pooling to compress the channel dimensions and retain more spatial details. At the end of the whole module, we restore the feature map to \(C \times H \times W\) through initial connection to ensure complete image spatial information output. The whole SRRC is attached to the jump connection to transfer the captured spatial information to the decoder initiated at the low resolution stage. In this way, the jump connection at different stages provides not only the original information, but also the spatial details of each stage. The output \(O_ \textrm{output}\) of SRRC can be formally expressed as:

$$\begin{aligned} O_\textrm{output}=\textrm{Sig}\left( {\text {Max}}\left( X \right) \otimes \textrm{Avg}\left( X \right) \right) \otimes X, \end{aligned}$$
(8)

where, Max represents maximum pooling, Avg represents average pooling, Sig represents sigmoid, and input X represents information around fused pixels, which is formally expressed as:

$$\begin{aligned} X=C_3\left( x \right) \oplus C_{3}^{k}(x), \end{aligned}$$
(9)

where, x represents the input feature \(X_ \textrm{input}\), \(C_3\) represents the standard \(3 \times 3\) convolution, \(C_{3}^{k}\) represents a \(3 \times 3 \) dilated convolution with a dilated rate of k.

Note that our Short-Range Relevance Context (SRRC) aims to obtain the effective spatial information between the residual blocks, and pass it to the decoder through the jump connection. In different low-level feature maps, more spatial information needs to be captured, while semantic information is more contained in high-level feature maps [10]. Therefore, SRRC only uses small convolutions to obtain short-range spatial information around the target, and does not use large size convolutions to capture remote context information. To obtain as much information as possible, we use standard convolution and dilated convolution to increase receptive field. At the same time, to reduce information redundancy and extract effective spatial information, we use maximum pooling and average pooling to compress channels, maintain more spatial location information, and use them to better provide spatial context for the network.

Experiments

In this section, we compare our LSRRCNet with ten other semantic segmentation methods on three standard segmented datasets, namely PASCAL VOC2012 [17], the Cityscapes [51] and the ADE20K dataset [52]. We first introduce three experimental datasets for semantic segmentation and experimental parameter settings, and then conduct some ablation experiments on the PASCAL VOC2012 dataset for our proposed module. Finally, we compare specific segmentation results on the three datasets mentioned above.

Datasets

PASCAL VOC2012

The dataset is a famous standard semantic segmentation dataset, and it is also the dataset most used by many methods. There are 20 foreground object classes and 1 background class. The whole data set includes 1464 training images, 1449 verification images and 1456 test images.

Cityscapes

Cityscapes is a large high-resolution image dataset that integrates semantic understanding of driving and street scenes, all of which are street view images in real scenes. The dataset image contains 50 cities in Europe, including 24,998 images for urban streets, urban backgrounds, and different scenes. The images contain 30 different categories of annotations, including people, vehicles, traffic signs, buildings, ground, nature, and sky. The dataset is divided into two groups based on the labeling quality of images. One group includes 19,998 images with rough labeling annotations by the author, and the other group includes 5000 images with fine labeling annotations. To facilitate training, fine tags are further divided into training set images, verification set images, and test set images, with 2975, 500, and 1525 images, respectively.

ADE20K MIT

The dataset is a standard dataset for visual scenes. It covers various annotations of scenes, objects, and object parts. It contains 22 K intensive annotation images and 150 fine-grained image semantic concepts. The training set and verification set are composed of 20 K and 2 K images respectively.

Experimental settings

According to most previous work [32, 53,54,55], we use horizontal flipping and random scaling to enhance the dataset, and use random gradient descent (SGD) and cross entropy loss to optimize the gradient model [46]. At the same time, using a multiple learning rate strategy, the initial learning rate is multiplied by \( \left( 1-\frac{{\text {iter}}}{{\text {total}}\_{\text {iter}}} \right) ^\textrm{power} \) after each iteration to train the network to a power of 0.9. We use pixel accuracy (PA) and mean intersection of union (mIOU) as evaluation indicators [56].

We use an initial learning rate of 0.01 and a weight decay to 0.0005 on both PASCAL VOC2012 and ADE20K datasets, and set the batchsize to 16 and the training iteration to 160K. For the Cityscapes dataset: we set the weight decay to 0.0001 and the batchsize to 8, and conduct evaluation experiments on the evaluation dataset. The experimental results ANN [16], DANet [32], OCRNet [57] are from SA-FFNet [58]. Due to the fairness of the experiment, we use the same training parameter settings under the same dataset to implement these methods: FCN [18], DeepLab [59], PSPNet [31], DeepLabv3+ [30], OCNet [60], DenSeASPP [61].

Table 1 The LRRC module is based on the PA and mIOU of two backbone networks on the PASCAL VOC2012 dataset (LRG and LRL represent the long-term global and long-term local relevance contexts of our LRRC module)
Table 2 Experimental comparison of PA and mIOU results between the LRRC module we propose and other modules (PPM, ASPP, MPM) on the PASCAL VOC2012 dataset

Ablation analysis

In this subsection, we perform ablation experiments on the PASCAL VOC2012 dataset on the proposed two modules, including the Long-range Relevance Context module (LRRC) and the Short-range Relevance Context module (SRRC). In the ablation study, we set 100k as the number of training iterations.

Ablation studies for LRRC

To verify the performance of the LRRC module in our proposed LSRRCNet, we conduct ablation experiments on it. Table 1 shows the PA and mIoU of the LRRC module ablation experiments. We divide the LRRC into two parts for ablation experiment. One part only contains the long-range global context (LRG), and the other part is the long-range local context (LRL). We conduct experimental analysis on the backbone network (resnet50 and resnet101) respectively. As seen in Table 1, the LRRC with ResNet50 as the base network (LRG only) can reach \(94.32\%\) and \(75.20\%\) in PA and mIOU respectively. Compare to the baseline network, PA increased by \(2.79\%\) and mIOU increased by \(5.02\%\). In addition, LRRC using ResNet50 as the basic network (LRL only) has a PA of \(94.09\%\) and a mIOU of \(74.38\%\), respectively. Compare to the baseline network, PA increased by \(2.56\%\) and mIOU increased by \(4.2\%\). Whether it only includes long-range global context or long-range local context, its performance is better than baseline network, which fully proves the validity of our two modules. However, to improve the utilization of relevance information, we fuse the two parts together to form LRRC to provide better image segmentation quality.

To further demonstrate the effectiveness of our proposed LRRC module, we conduct numerical experiments to compare the LRRC module with three modules with the same functionality, namely PPM, ASPP, and MMP. Based on their respective articles, we find that PPM [31], ASPP [30], and MMP [49] are effective for capturing rich semantic information at the low-level. Therefore, to ensure the fairness of comparative experiments, we have ensured the consistency of parameter settings in all comparative experiments. Table 2 shows the PA and mIOU of the LRRC module comparative experiments. It can be seen from Table 2 that LSRRCNet based on ResNet50 can reach \(94.65\%\) in PA and \(76.84 \%\) in mIOU, and our proposed mIOU of LRRC is \(3.14\%\) higher than PPM. LSRRCNet based on ResNet101 can reach \(96.24 \%\) in PA and \(78.18\%\) in mIOU, and mIOU of LRRC is higher than MPM \(1.14\%\). Both PA and mIOU of our proposed LRRC based on resnet50 and resnet101 are superior to PPM, ASPP, MPM. The main reason is that the LRRC module not only captures the long-range global context information, but also captures the long-range local relevance context information. The combination of these two types of information improves the utilization of information and provides a richer context.

Table 3 Experimental comparison of PA and mIOU results between our propose SRRC module and SPM modules on the PASCAL VOC2012 dataset

Ablation studies for SRRC

To demonstrate the advantages of the SRRC module, we conduct ablation experiments on different backbone networks (resnet50 and resnet101). Since SPM [49] has the ability to easily embed any building block to capture spatial context information, we have conducte a comparative experiment with our proposed SSRC module. Table 3 shows the PA and mIOU of the SRRC module comparative experiments. It can be seen from Table 3 that the SRRC module is superior to SPM in the baseline network (resnet50 and resnet101). Based on resnet50, the pixel accuracy of our SRRC module achieves \(94.65\%\), and that of mIOU achieves \(76.84\%\). It is \(0.74\%\) and \(3.03\%\) higher than the PA and mIOU of SPM respectively. Based on resnet101, the pixel accuracy of our SRRC module achieves \(96.24\%\), and that of mIOU achieves \(78.18\%\). Our proposed SRRC is superior to SPM no matter which baseline network. The main principle is that our proposed SRRC module increases the receptive field of pixels to capture more spatial information, filters the channel information through different pooling, saves the spatial position information of relate pixels, and better transmits the spatial details to the decoder to restore the pixel position information through jumping connection, so as to maintain the consistency of semantic and spatial information.

Table 4 Comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the PASCAL VOC2012 dataset
Table 5 Comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the ADE20K dataset
Table 6 Comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the Cityscapes dataset
Table 7 Comparison of numerical results of IOU and mIOU (%) on PASCAL VOC2012 dataset between LSRRCNet and other ten methods

Segmentation performances and comparisons

In this section, we test the segmentation and visualization results of the proposed LSRRCNet on three standard semantic segmentation datasets, and compare it with ten semantic segmentation methods to demonstrate the effectiveness of LSRRCNet in semantic segmentation applications.

PASCAL VOC2012

We verify the validity of the LSRCNet proposed by us through some comparative experiments on the VOC2012 dataset. Table 4 shows comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the PASCAL VOC2012 dataset. From Table 4 can be seen PASCAL VOC2012 dataset, our numerical segmentation results are superior to other methods. Using the same backbone ResNet101, our LSRRCNet achieves \(96.24\%\) of the PA and \(78.18\%\) of the mIOU, which is \(15.98\%\) higher than the classic FCN [18] network. Our LSRRCNet is also superior to the three representative networks in recent years, including Denseaspp [61], OCRNet [57]and OCNet [60]. Different from the context information captured by other methods, our method gathers the long-range context information in the low-level feature graph, including global and local context information. It can not only divide the relationship between global pixels, but also distinguish the local semantic relationship between pixels. To further demonstrate the advantages of our method, we report Table 7 in the appendix shows comparison of numerical results of IOU and mIOU (%) on PASCAL VOC2012 dataset between LSRRCNet and other ten methods. From Table 7 can be seen PASCAL VOC2012 dataset, our proposed LSRRCNet outperforms the other eleven methods on 14 classes.

Cityscapes

In this section, we performance the results of our proposed LSRRCNet on the Cityscopes dataset. Table 6 shows comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the Cityscapes dataset. From Table 6 can be seen Cityscopes dataset, our proposed LSRRCNet has obvious advantages. When using backbone network ResNet101, our proposed LSRCNet reach \(78.65\%\) mIOU, which is higher than DeepLab [59] \(8.24\%\). At the same time, our proposed PA for LSRCNet is \(97.41\%\), which is \(1.63\%\) higher than other methods. Therefore, our LSRRCNet still has advantages in PA and mIOU. Unlike other methods, our method mainly provides rich pixel spatial detail information. We need the short-range correlation context module to obtain more spatial context by increasing the feature receptive field, and then filter and reselect the captured spatial information through different pooling processing, and delete redundant information to improve the accuracy of spatial detail information, Thus, the segmentation quality of street view data set is improved. In addition, we conducte comparative experiments on each category in the dataset. Table 8 in the appendix shows Comparison of numerical results of IOU and mIOU (%) on Cityscapes dataset between LSRRCNet and other ten methods. From table 8 can be seen Cityscapes dataset, our LSRRCNet is still ahead of the other 13 classes in this dataset.

ADE20K

To further demonstrate the effectiveness of our proposed LSRRCNet, we further conducted experiments on the more challenging ADE20K dataset. Table 5 shows comparison of the PA and mIOU of our proposed LSRRCNet and ten other methods on the ADE20K dataset. From Table 5 that the pixel accuracy (PA) of LSRRCNet is \(86.73\%\), and the mIOU is \(40.31\%\). These two results are significantly better than those of the other ten methods. ADE20K dataset has many semantic categories and contains rich category information. Unlike other methods, our LSRRCNet obtains its rich category information through encoding and decoding structure. Our proposed LRRC module improves the utilization of many category information by collecting local and global relevance information, and increases the accuracy of category segmentation. Therefore, our LSRRCNet can achieve better segmentation performance and is superior to the other methods.

Table 8 Comparison of numerical results of IOU and mIOU (%) on Cityscapes dataset between LSRRCNet and other ten methods

Visual comparison

To provide the visual advantages of LSRRCNet, we compare the three methods on the Cityscapes dataset in Fig. 7, namely PSPNet [31], OCNet [60] and DeepLabv3+ [30]. It can be seen from Fig. 7 that the proposed LSRRCNet improves the utilization rate of relevance information through encoding and decoding structure, and can successfully achieve segmentation, whether it is a small target car or a densely occluded crowd. In addition, the lack of local relevance information at a high stage has also been successfully alleviate, such as the division and positioning of “overlapping people and vehicles” and “traffic warning lights” in the second line. Therefore, our proposed LSRRCNet can well alleviate the under-utilization of feature relevance information.

We also compare our proposed LSRRCNet with the other three methods for visualization results on the VOC2012 dataset in Fig. 8. From Fig. 8 that LSRRCNet for people, cars and animals can be successfully divide, and ensure that the semantic categories are correct and the outline is clear. Our proposed the SRRC module transmits spatial information to the decoders at each stage through jumping connections at different encoder stages, thus improving the spatial positioning capability of each semantic category and generating clear boundary contours. We randomly select six images from the VOC2012 dataset, and compare the segmentation results of these six images using the AGLN [25] method and our proposed LSRRCNet. The results are shown in Fig. 9, and it can be seen from Fig. 9 that our proposed LSRRCNet segmentation provides better image details, such as the wings of an aircraft, the tail of a horse, the contact surface between a person and a dining table, and so on. We normally perform category differentiation. These comparison results demonstrate that our proposed method increases the utilization of semantic context information, improves the ability to distinguish complex categories, and increases the quality of image semantic segmentation. Therefore, from the perspective of visual analysis, our proposed LSRRCNet is effective in semantic segmentation application.

Fig. 7
figure 7

Comparison of image visual segmentation results between our proposed LSRRCNet and three other semantic segmentation methods on the Cityscapes dataset: a input; b ground truth; c PSPNet [31]; d OCNet [60]; e DeepLabv3+ [30]; f ours

Fig. 8
figure 8

Comparison of image visual segmentation results between our proposed LSRRCNet and three other semantic segmentation methods on the PASCAL VOC dataset: a image; b ground truth; c PSPNet [31]; d OCNet [60]; e DeepLabv3+ [30]; f ours

Fig. 9
figure 9

Visual segmentation results on AGLN and our proposed LSRRCNet: a image; b ground truth; c AGLN [25]; d our proposed LSRRCNet

Conclusion

In this paper, we propose a long and short range relevance context network (LSRRCNet) for semantic segmentation. Specifically, our proposed LRRC module aggregates the long-range global semantic context and local spatial context information of the high-level feature map, improves the utilization of the relevance semantic context information, and guides the semantic pixel classification of high-level features. Our proposed SRRC module captures staged spatial context information in each part of the low-level feature map in the form of jump connections, which improves the utilization of relevant spatial context information and enhances the spatial location of semantic categories. The whole network uses the encoding and decoding structure to maximize the use of relevant context information, so as to better improve the segmentation results. Experimental results show that our proposed LSRRCNet is effective.

Our approach initially alleviates the problem of insufficient context information for simple images, but the results for complex background image segmentation still need to be improved. The processing of complex background images requires not only sufficient context information, but also a greater focus on pixel-to-pixel relationships. For example, overlapping target objects, small target objects and multi-shape target objects are all difficult areas for semantic segmentation of complex images and will be the focus of our future research work.