Introduction

Accurate understanding of traffic scenes is very important for automatic driving. Based on the understanding of the traffic scene, automatic driving vehicles predict the track of vehicles or pedestrians. Understanding traffic scenes determines the accuracy of the predicted track, then determines the safety performance of automatic driving vehicle. Compared with the object detection of laser radar, using RGB images can complete object detection in severe weather conditions such as fog, snow, sandstorm, and the cost is low. However, object detection lose the relative position information of the scene, so accurate image semantic segmentation is very helpful for understanding traffic scenes. Semantic segmentation of the elements such as lane marks, road arrows, vehicles, pedestrians, buildings and sidewalks is a key technique to realize automatic driving and automatic high-precision map making. Since fully convolutional networks (FCN) [20] used deep learning to deal with semantic segmentation problems firstly, a series of deep learning networks have been proposed, including U-Net [27], SegNet [1], DeepLab series [4,5,6,7], RefineNet [18], PSPNet [41], DeepMask [36], OCRNet [39] and HMS [32]. These methods indicate that semantic segmentation networks must be able to obtain both sufficient semantic information and sufficient detailed information. Even though these models achieved encouraging segmentation accuracy, these networks are too complex to meet the real-time requirements.

For the semantic segmentation of road scenes or street scenes, we don’t want to spend too much inference time, and at the same time we want to obtain high accuracy to meet the instantaneity requirements such as automatic driving, real-time high-precision map making. To satisfy the requirements of real-time or running on mobile devices, many effective networks have been proposed. ENet [23] achieved great speed improvement using a lightweight decoder and downsampling in the early stages of the network. ICNet [42] proposed an image cascade network for semantic segmentation, which used semantic information with low resolution and detailed information with high resolution. MobileNetv2 [28] reduced the complexity of the overall model using depthwise separable convolution. Bisenetv2 [37] used two branches to obtain semantic information and detailed information respectively. Although these networks have used some methods to obtain semantic information and detailed information, the problem of insufficient semantic information or detailed information has not been completely solved. These works significantly reduced the delay and memory usage of segmentation models, but their low accuracy limits their real-world application.

To improve the accuracy of the model while maintaining speed of the model, we analyzed the dual branch networks again in this paper. To solve the problem that sufficient context information and sufficient detailed information cannot be obtained in real-time semantic segmentation networks, this paper proposed a feature fusion module. In addition to feature fusion module, we also did some other research on this issue. In a word, this paper proposed a network based on a multi-resolution feature fusion module, an auxiliary feature extraction module, an upsampling module and multi-ASPP(atrous spatial pyramid pooling) module. Compared with other lightweight real-time semantic segmentation networks, our network achieved the highest accuracy at an appropriate speed. For this paper, the main contributions are as follows:

  1. (1)

    A novel multi-resolution feature fusion module that can be flexibly embedded into any network.

  2. (2)

    An upsampling module that goes beyond deconvolution and it is very easy to use.

  3. (3)

    A method of using multi ASPP to improve network performance.

Related work

Since the design concept of real-time networks is very different from that of ordinary networks, we group the related works into two categories, i.e., high-performance semantic segmentation and real-time semantic segmentation.

High-accuracy semantic segmentation

The early semantic segmentation networks usually used the encoder–decoder architecture [1, 20, 27]. The encoder gradually expands its receptive field through convolution or pooling and the decoder uses deconvolution or upsampling to recover the resolution from the outputs of the encoder. However, it is difficult to retain sufficient details during the downsampling process of the encoder–decoder network. A strategy is to utilize dilated convolutions, which could enlarge field-of-view without reducing the spatial resolution. Based on this idea, DeepLab series [4,5,6,7] achieved great improvement by using dilated convolutions with different expansion rates in the network. However, DeepLabs only used this method after the last downsampling in the network, and did not further explore this method. PSPNet [43] proposed a Pyramid Pooling Module (PPM) to process multi-scale context information. HRNet [34] adopts multi-path and bilateral connection to learn and integrate features of different scales. Inspired by the self-attention mechanism [33] used in machine translation to obtain context information, many meaningful works for semantic segmentation [10, 14, 34] introduced non-local operation [35] into computer vision. A critical problem for the above methods are that they are very time-consuming in the inference stage and can not meet the real-time requirements.

Fig. 1
figure 1

The network architecture proposed in this paper, M and N represent the length and width of the image respectively

Real-time semantic segmentation

Real-time semantic segmentation networks generally adopts two methods: encoder–decoder method and two pathway method.

SwiftNet [22] used a lightweight decoder that used a low-resolution input to obtain semantic information and another high-resolution input to obtain detailed information. DFANet [16] used a light-weight backbone based on depth-wise separable convolution, and reduced input size to speed up inference. ShuffleSeg [11] used ShuffleNet [40] as its backbone. ShuffleNet combined channel shuffling and group convolution, which can significantly reduce computing cost. However, these networks still adopted the encoder–decoder architecture, the encoder extracted contextual information by a deep network and the decoder restored the resolution by interpolation or transposed convolution to complete dense predictions. The encoder–decoder architecture loses some details during repeated downsampling, and these lost information cannot be completely recovered through upsampling, which will impairs the accuracy of semantic segmentation. With this consideration, BiSeNet [37, 38] proposed a two-branch network architecture, which uses one pathway for extracting semantic information, and the other shallow pathway for extracting detailed information as a supplement. Then several works based on the two-branch network architecture were proposed to improve their presentation capability or reduce the model complexity [12, 25, 26, 37]. However, the problem that the network cannot obtain sufficient context information or detailed information still exists. These networks only weaken the phenomenon of information loss in the process of feature extraction. To further alleviate this problem, this paper proposed a network based on a novel multi-resolution feature fusion module, an auxiliary feature extraction module, an upsampling module and multi-ASPP(atrous spatial pyramid pooling) module.

Method

As shown in the Fig. 1, the network architecture proposed in this paper mainly includes two branches, the detail branch and the semantic branch. The network module names beginning with “loss” are side-branch auxiliary feature extraction layers, which draw on the idea of the auxiliary classifier mentioned in inception-v3 [31]. However, there are new discoveries that are different from the said auxiliary classifier. A detailed description about it will be provided in part 3.1. In the feature fusion stage, we adopted hierarchical cascade fusion base on the cascade feature fusion module(CFF), the output of CFF module are further fused by cascade fusion. A detailed description about CFF module will be provided in part 3.2. Another one specific method is to use the multi ASPP in the feature extraction stage of the semantic branch, which will be described in part 3.4. The upsampling module named UCBA will be described in part 3.3.

The auxiliary feature extraction layer

The Fig. 2 is the architecture of the auxiliary feature extraction layer used in the experiment. The parameters of the first convolution operation are as follows: \(kernel\_size\) = \(3\times 3\), \(out\_channels = 2N\), \(stride = 1\), \(padding = 1\), and the parameters of the second convolution operation are as follows: \(kernel\_size = 1\times 1\), \(out\_channels = C\), \(stride = 1\), \(padding = 0\). Then we used an upsampling module to upsample the feature map from \(H/n \times W/n(n=2, 4, 8, 16, 32)\) size to \(H \times W\) size. The upsampling module used here is UCBA, which is described in part 3.3. Finally, we used the cross-entropy to calculate the loss value. Compared with not using the auxiliary feature extraction layer, using the auxiliary feature extraction layer can obtain higher mIoU. The experimental results as shown in Table 4 in the following chapters has proved this. In the process of verifying the effect of this layer, we made a new discovery that the auxiliary feature extraction layer of the network has shown its advantages in the early stage of the training process as shown in Fig. 3. Compared with the network without any auxiliary feature extraction layer, the performance is improved remarkably at the beginning of training.

Fig. 2
figure 2

The architecture of auxiliary feature extraction layer, N represents the number of channels, C represents the number of classes

Fig. 3
figure 3

Comparison of loss downward trend between using auxiliary feature extraction module and not using auxiliary feature extraction module

Fig. 4
figure 4

Cascade feature fusion module

We all know that the side-branch auxiliary feature extraction layer has a regularization effect, which is explained in detail in Inception-v3. To be exact, it has a regularization effect essentially because the auxiliary feature extraction layer increases the gradient of the parameters of the shallow layers, and the gradient of the parameters of the deep layers is reduced relatively. The essence of training a model is to train a function with a huge amount of parameters to fit the real data. If this function is more closely related to the parameters of the shallow layers of the network, the faster adjustment of the parameters of the shallow layers will speed up the convergence of the network. If this function is more closely related to the parameters of the deep layers, the faster adjustment of the parameters of the deep layers will speed up the convergence of the network. The gradient descent method is used to adjust the parameters of the deep learning network during the training process, and the gradient descent method adjusts the parameters according to the gradient of each parameter of the network. The auxiliary feature extraction layers can control the gradient of the parameters of the networks, so it can control the adjustment speed of the parameters of the shallow layers and the deep layers respectively to speed up the convergence of the network. Therefore, the following conclusions can be inferred: the side-branch auxiliary layer may speed up the convergence of the network or slow down the convergence of the network, which is related to the location of the side-branch auxiliary layer, the weight of the side-branch auxiliary layer and the training data. The experimental results of the inception networks and the experiment of our study also further confirmed that.

Fig. 5
figure 5

Fusion mode of CFF module output

Fig. 6
figure 6

The architecture of UCBA upsampling module

Cascade feature fusion module

To obtain sufficient context information and sufficient detailed information, we designed a cascade feature fusion module, we named this module CFF (Cascade Feature Fusion Module), the Fig. 4 is the architecture of the CFF module:

For the feature \(I_d\) with a shape of \(H\times W \times C\) generated by the detail branch (the detail branch in Fig. 1), the feature \(D_1\) with a shape of \(H/2 \times W/2 \times C\) is generated by the first down-sampling, and then the second down-sampling performed, a feature \(D_2\) with a shape of \(H/4 \times W/4 \times C\) is generated. For the feature \(I_u\) with a shape of \(H/4 \times W/4 \times C\) generated by the semantic branch (the semantic branch in Fig. 1), the feature \(U_1\) with the shape of \(H/2 \times W/2 \times C\) is generated by the first upsampling, then the second upsampling performed, the feature \(U_2\) with the shape \(H \times W \times C\) is generated. We use \(F_u\) to represent the upsampling function, and the feature O obtained after the fusion can be expressed as Equation (1):

$$\begin{aligned} O=F_u(I_u+D_2)\ +\ F_u(U_1\ +\ D_1)\ +\ F_u(U_2+I_d) \end{aligned}$$
(1)

For the obtained feature O, we perform another average pooling operation to obtain the final feature fusion result. This architecture can not only fully integrate features of different sizes, but also expand the receptive field of features by downsampling, enabling the network to obtain richer contextual information.

We used the CFF module mentioned above for feature fusion. From Fig. 1, we can see that we used the CFF module to fuse feature of 1/32 and 1/8, 1/16 and 1/4, 1/8 and 1/2 respectively. We denote the three fused features as O1, O2, and O3 respectively. In order to fuse the features of shallow layers and the features of deep layers, we fused O1, O2 and O3 in series. The specific architecture is shown in the Fig. 5.

CFF module not only produces high-resolution features but also integrates the information of low-resolution features into the high-resolution features, which reduced the excessive influence of noise contained in high-resolution features on the result. CFF module uses a parallel pooling layer to strengthen important information during down-sampling, and it integrates information of high-resolution features into low-resolution features before upsampling as a supplement. CFF module can obtain more richer context information and detailed information than general methods.

More effective upsampling module

Instead of using the common deconvolution [29] or a single bilinear interpolation as the upsampling module, we redesigned an upsampling module with the bilinear, convolution, batch normalization and relu. This module is named UCBA. Then we verified the effectiveness of this upsampling module through experiments. The following Fig. 6 illustrates the architecture of this upsampling module:

Suppose we have a grayscale image with a shape of \(4\times 4\), we use a convolution kernel of \(3\times 3\) size to perform convolution operations on the image, and finally get a single-channel feature with a shape of \(2\times 2\). If we use matrix operations to represent this convolution operation process, the following operations will be performed:

Flatten the \(4\times 4\) grayscale image into a matrix X with a shape of \(16\times 1\).

$$\begin{aligned} X =[x_1,x_2, \ldots , x_{16}]^T \end{aligned}$$
(2)

We flatten the single-channel feature with the shape of \(2\times 2\) obtained by the convolution operation into a matrix Y with a shape of \(4\times 1\).

$$\begin{aligned} Y=[y1,y2,y3,y4]^T \end{aligned}$$
(3)

Suppose we use C to represent the convolution matrix, then the convolution operation can be expressed by the mathematical Equation (4):

$$\begin{aligned} Y=CX \end{aligned}$$
(4)

Just fill the 9 parameters corresponding to the convolution kernel into the corresponding positions of the matrix C, and fill the other positions with 0 to get the matrix C. Deconvolution is the inverse operation of the above matrix operation process. The process of deconvolution can be expressed by the following Equation (5):

$$\begin{aligned} X=C^TY \end{aligned}$$
(5)

According to the above deconvolution calculation formula, we can get \(x_1\), we use \(C_1^T\) to represent the first row in the matrix \(C^T\), \(x_1\) can be expressed as Equation (6):

$$\begin{aligned} {x}_1=C_1^TY \end{aligned}$$
(6)

Other elements in the X matrix are calculated in the same way as \(x_1\). We can clearly find that each element of the X matrix is related to all elements of the Y matrix. We can infer that the deconvolution establishes a global relationship. Furthermore, we can consider that deconvolution has a global receptive field. Many theories and facts have proved that a large receptive field is friendly to semantic information and not friendly to detailed information. For the semantic segmentation task we want to do, the result needs to contain more detailed information for obtaining better segmentation results. The large receptive field can not get detailed information very well, and deconvolution has a global receptive field, so deconvolution for upsampling is not very suitable for semantic segmentation tasks. So the upsampling module in our network adopts the bilinear+convolution+batchnormal+relu module. The subsequent experiments also proved the effectiveness of our upsampling method.

Improved ASPP module

The widely used ASPP module is proposed in the deeplab network, which uses atrous convolution to expand the receptive field of the convolution kernel without losing resolution. The following is the calculation formula for the output size of the atrous convolution. \(Input: (N, C_{in}, H_{in}, W_{in}), Output: (N, C_{out}, H_{out}, W_{out})\)

$$\begin{aligned}{} & {} H_1 = dilation[0]*(kernel\_size[0]-1) \end{aligned}$$
(7)
$$\begin{aligned}{} & {} H_2 = H_{in}+2*padding[0]-H_1 \end{aligned}$$
(8)
$$\begin{aligned}{} & {} H_{out} = \frac{H_2}{stride[0]}+1 \end{aligned}$$
(9)
$$\begin{aligned}{} & {} W_1 = dilation[0]*(kernel\_size[0]-1) \end{aligned}$$
(10)
$$\begin{aligned}{} & {} W_2 = W_{in}+2*padding[0]-W_{1} \end{aligned}$$
(11)
$$\begin{aligned}{} & {} W_{out} = \frac{W_2}{stride[0]}+1 \end{aligned}$$
(12)

ASPP performs feature extraction by setting different atrous rates to generate convolution kernels for different receptive fields. It also uses global average pooling to obtain the global receptive field. Intuitively speaking, the purpose of fusing features of different resolutions is to fuse features of different receptive fields. To further integrate the features of different receptive fields, we used the atrous spatial pyramid pooling (ASPP) in the semantic branch, but we used the ASPP module more than once, ASPP modules were used at different locations in the semantic branch.

Many literatures have proved that fusing features of different sizes or resolutions can improve the performance of the network, such as the deeplab series, but the general method is to use the ASPP module only in the last layer of the feature extraction network. Although fusing features of different receptive fields or different sizes can improve the performance of the network, which layer should fuse the features of different receptive fields or different resolutions? There is no such research to answer this question. We did some experimental research on this issue. We added an ASPP layer to the four stages of the semantic branch respectively, the dilation rate of ASPP was set to [5, 11, 17]. And then we added a shorted connection to add up the original features and the features extracted by the ASPP layer. Thereby, we can fuse the features of large receptive field extracted by ASPP module and the relatively small receptive field features. Table 2 shows that the performance is best if ASPP module and shorted connection are added in the last three stages of a semantic branch. The following Fig. 7 shows how we embed the ASPP module:

Fig. 7
figure 7

ASPP module embedding. The meaning and location of S1, S2, S3, S4 in the model are shown in Fig. 1

Experiment

Datasets

We verified the effectiveness of our method in the public dataset cityscapes [8], which is a semantic understanding image dataset of urban street scenes. It mainly contains street scenes from 50 different cities, and has 5000 high-quality pixel-level annotated images of driving scenes in an urban environment (2975 for train, 500 for val, 1525 for test, 19 categories in total). In addition, it has 20,000 coarsely annotated images (gt coarse). Here we use only 5000 finely labeled data to verify the effectiveness of our model, 2975 for training, and 500 for verification. During training, we first scale the \(2048\times 1024\) resolution image to a resolution of \(1024\times 512\), and then input it into the network for training.

To further verify the effectiveness and wide applicability of the proposed network in this paper, we also conduct experiments on the Cambridge-driving Labeled Video Database (CamVid) [2] and COCO-Stuff datasets [3], respectively. Cambridge-driving Labeled Video Database (CamVid) is a dataset of road scenes from the perspective of autonomous vehicles. It contains 701 images with a resolution of \(960\times 720\). We use 367 of these images for training and 101 for validation, 233 images were used for testing. We use only 11 of the 32 candidate categories it provides, like other methods, to facilitate comparisons with different methods. COCO-Stuff is a popular coco dataset with pixel-level labeled. It contains 10K images, 91 things classes and 91 stuff classes. We use 9K of these images for training and 1K for testing like other methods.

Super parameters

We use the stochastic gradient descent (SGD) algorithm with 0.9 momentum to train our model. We set a batchsize of 16 during training. When training with the cityscapes and Cam Vid datasets, we set the weight decay to 0.0005. When training with the COCO-Stuff dataset, we set the weight decay to 0.0001. We set the initial learning rate lr to 5e-2. The learning rate is decayed using the “pol” strategy during training, and the learning rate for each iteration is calculated as: \((1-\frac{iter}{iters_{max}})^{power}*lr\). We train 20K iterations on each dataset separately. The inference speed is measured on the NVIDIA GeForce 1080Ti card.

Ablation study

The following Table 1 is the experimental result on the validation data of the cityscapes dataset after we added the ASPP layer to the network. The second column shows whether we use the ASPP module before extracting 1/32 feature in the semantic branch. The third column shows whether we use the ASPP module before extracting 1/16 feature in the semantic branch. The fourth column shows whether we use the ASPP module before extracting 1/8 feature in the semantic branch. The fifth column shows whether we use the ASPP module before extracting 1/4 feature in the semantic branch. From Table 1 below, we can see that the network achieved the best performance after we added ASPP layers in the last three stages of the network.

Table 1 The impact of adding different numbers of ASPP layers on the validation data of the cityscapes dataset
Table 2 The impact of CFF module on the validation data of the cityscapes dataset
Table 3 The effect of UCBA layer on cityscapes val dataset

Table 2 shows the experimental results of the network on the validation data of the cityscapes dataset after replacing the feature fusion part of Bisenetv2 into CFF module. As can be seen from Table 2 below, the model with CFF module achieved a total mIoU improvement of 3.76%.

Table 4 The effect of the model when the position and number of side branches are different. Bisenetv2 backbone is the semantic branch, detail branch of Bisenetv2. Bisenetv2 backbone with CFF obtained 73.26% mIoU.We use it as a baseline in the experiment

The following Table 3 is the experimental results of the network on the validation data of the cityscapes dataset after using our UCBA module in the network. In the experiment, we set the “align_corners” parameter of bilinear to true. From Table 3 below, it can be seen that the model with UCBA module achieved a total mIoU improvement of 0.2%.

Table 5 The results of different networks on the cityscapes validation dataset

The following Table 4 is the experimental results of the network on the validation data of the cityscapes dataset after using the auxiliary feature extraction layer in the network. From Table 4 below, it can be seen that the model with the auxiliary feature extraction layer achieved a total mIoU improvement of 4.47%. The network obtained the best performance after adding the auxiliary feature extraction layer at each stage of the semantic branch.

Comparison

The following Table 5 is a comparison of the experimental results of our network and the current mainstream semantic segmentation network on the validation data of the cityscapes dataset. We use the network based on bisenetv2 backbone with the multi-resolution feature fusion module, the auxiliary feature extraction module, the upsampling module and multi-ASPP proposed in the paper achieved that result. During inference, the output of the network is a single-channel image with a resolution of \(1024\times 512\). We then resize it to a grayscale image with a resolution of \(2048\times 1024\), and then use the resolution to calculate the mIoU value.

The following Table 6 is a comparison of the experimental results of our network and the current mainstream semantic segmentation network on the test data of the CamVid dataset. We use the network based on bisenetv2 backbone with the multi-resolution feature fusion module, the auxiliary feature extraction module, the upsampling module and multi-ASPP proposed in the paper achieved that result. During inference, the output of the network is a single-channel image with a resolution of \(960\times 720\).

Table 6 The results of different networks on the CamVid test dataset
Table 7 The results of different networks on the COCO-Stuff test dataset
Fig. 8
figure 8

The left column is the original image, the second column is Ground Truth, the third column is the predictions of the network proposed in the paper, the fourth column is the predictions of Bisenetv2

The following Table 7 is a comparison of the experimental results of our network and the current mainstream semantic segmentation network on the test data of the COCO Stuff dataset. We use the network-based on bisenetv2 backbone with the multi-resolution feature fusion module, the auxiliary feature extraction module, the upsampling module and multi-ASPP proposed in the paper achieved that result. During inference, the output of the network is a single-channel image with a resolution of \(640\times 640\).

From the above results, we can see that the network proposed in this paper achieved 81.3% mIoU at 110FPS on Cityscapes validation dataset, 80.7% mIoU at 104FPS on the CamVid test dataset, 32.2% mIoU at 78FPS on the COCO-Stuff test dataset. Better results were obtained at a relatively fast speed. Figure 8 shows the visualization comparison on Cityscapes dataset.

Conclusion

In this paper, we conducted research and experiments on the side branch auxiliary feature extraction module, the feature fusion module, the upsampling module, and the atrous spacial pooling pyramid module. We made new discoveries that the role of side-branch is more than regularization, it may either slow down the convergence or accelerate the convergence by influencing the gradient of different layers of the network, which is dependent on the parameters of the network and the input data. We also proposed an upsampling module and a feature fusion module. We used the ASPP module to further optimize the network. We analyzed the effect of each improvement on the network in detail through experiments, and confirmed the effectiveness of the above methods. The network proposed in this paper achieved 81.3% mIoU at 110FPS on Cityscapes validation dataset, 80.7% mIoU at 104FPS on the CamVid test dataset, 32.2% mIoU at 78FPS on the COCO-Stuff test dataset.