Introduction

Deep learning and related theories have been developing in recent years in the research of transfer learning [1], object detection [2, 3], style transfer [4], nonlinear systems [5, 6] and so on. Semantic segmentation, as one of the most fundamental tasks in the computer vision community, aims to assign a semantic class label to each pixel in the given image. It has been extensively and deeply studied and applied in a variety of fields, such as augmented reality [7, 8], autonomous driving [9, 10], medical images [11, 12], satellite imagery [13], video surveillance [14, 15] and so on. Many mobile terminal tasks have great demand for segmentation speed, so the real-time semantic segmentation [16] method comes into being. Real-time semantic segmentation tasks require speed while ensuring accuracy, So the most challenge of real-time semantic segmentation is to achieve the optimal balance between accuracy and efficiency. That is, it is urgent and important to build a real-time semantic segmentation method that achieves a good balance between accuracy and efficiency.

In recent years, with the development of convolutional neural networks and the proposal of fully convolutional network (FCN) [17], a series of real-time semantic segmentation methods [18,19,20] have been proposed. These methods have low-latency and considerable segmentation accuracy. To capture details information and semantic information, two bilateral segmentation networks were proposed by construct a dual-branch architecture [21, 22]. One pathway is designed to capture the spatial details, the other pathway is introduced to extract the categorical semantics. It’s worth thinking about, complete dual-branch design has brought better segmentation accuracy at the same time also brought more computational cost. However, it is not enough to completely restore the lost spatial information by only relying on upsampling, so it is necessary to introduce high-resolution feature maps.

Some typical real-time semantic segmentation networks have recently proposed by choosing the lightweight backbones to improve real-time inference while using feature fusion or aggregation modules to compensate for the drop of accuracy [21, 23, 24]. However, Short-Term Dense Concatenate (STDC) network (STDCNet) [25] believes that these lightweight backbones borrowed from image classification task may not be be the best choice for solving image segmentation problem. And it uses the characteristics of high resolution feature map as auxiliary, shown in Fig. 1a. Besides, the real-time semantic segmentation demands for an efficient inference speed. In fact, two types of methods can be used to promote the inference speed: (i) Restricting the input image size, smaller input size results in less computation cost with the same network architecture; (ii) Channel Pruning, pruning channels in early stages can improve inference speed. Although these two manners can improve the inference speed to a certain extent, they may also lead to a decrease in accuracy. To tackle this problem, as shown in Fig. 1a, BiSeNet [21] adopt dual-branch architecture to fuse the low-level details and high-level semantics. However, complete dual-branch maybe is time-consuming, and the auxiliary path is not effective enough due to the loss of detail information guidance. The method we propose is different from the method a and b in Fig. 1. The method of Fig. 1a guides the fusion of semantic information and spatial information by extracting spatial information from the low-level stage of semantic branch. Compared with Fig. 1a, our method has one more short-term spatial detail branch, which can make the network extract richer spatial detail features. Compared with Fig. 1b, our method reduces the redundant parts of the spatial detail branches, and transmits information between the two branches, so that the spatial information and semantic information can form a stepwise fusion, which makes the fusion effect better.

Fig. 1
figure 1

Illustration of architectures of STDCNet, BiSeNet and our proposed approach. a Short-Term Dense Concatenate network (STDCNet), which use a Detail Guidance module to encode spatial information in one low-level features stage. b Bilateral Segmentation Network (BiSeNet), which use an extra Spatial Path to encode spatial information. c Our proposed method, which use a interactive module to fuse high-level semantics information with low-level detail features

Inspired by the Short-Term Dense Concatenate module (STDC [25] module), we propose a real-time and dual-branch architecture named CIDNet. As illustrated in Fig. 2, CIDNet adopts the encode-decoder architecture. The compact interactive dual-branch not only maintains the high resolution of the features but also ensures a certain speed. The last and most critical step is the fast and efficient integration of the two branches of semantic information and spatial detail. We use pyramid pooling to further extend high-level semantics, while using the idea of self-attention to fuse the two branches quickly and effectively.

Our main contributions are summarized as follows:

  1. 1.

    We propose an efficient and effective two-pathway architecture, termed the Compact Interactive Dual-Branch network, for real-time semantic segmentation. Due to its special structure, it saves a certain number of parameters and computation, and it is faster and more accurate than common two-pathway networks.

  2. 2.

    We propose a Detail-Semantic Interactive Module (DSIM) to reduce the loss of two branches during fusion. It enables effective interaction between the semantic branch and the detail branch. Meanwhile, it can effectively guide the fusion of semantic and spatial information between high resolution and low resolution.

  3. 3.

    We propose a Dual-Branch Contextual Attention Fusion Module (DBCAFM). It combines pyramid pooling and self-attention mechanism to integrate semantic branch and spatial branch deeply.

  4. 4.

    We conduct extensive experiments to investigate the effectiveness of our methods. Our methods achieves impressive results benchmarks of Cityscapes, CamVid. Specifically, our CID1-50 achieves 75.1% mIoU on the Ctiyscapes val set at a speed of 164.1 FPS on Tesla V100 card. Under the same experiment setting, our CID2-75 achieves 77.7% mIoU at a speed of 92.9 FPS. At the same input image resolution, our params is 1.8 M and 5.2 M less than STDCNet, respectively.

Related work

BiSeNetV2 [22] consists of a Detail Branch and a Semantic Branch, which are merged by an Aggregation Lyer (BGA Module). Each layer of the detail branch consists by a convolution layer followed by batch normalization and activation function, the semantic branch consists by several Inverted Bottleneck and Gather-and-Expansion Layers, which inspired by the lightweight recognition model, e.g., Xception [26], Mobilenet [27], ShuffleNet [28]. However, the complete dual-branch architecture may be inefficient due to repeated processing in the initial stage. STDCNet [25] is a single-stream method, which backbone consist by the Short-Term Dense Concatenate module (STDC Module). But the channel number of each stage output may be redundancy.

In this chapter, our discussion mainly focuses on the four groups of methods most relevant to our work, i.e., generic semantic segmentation methods, real-time semantic segmentation methods, and feature fusion modules.

Generic semantic segmentation

After traditional segmentation methods, e.g., threshold selection [29], region growing [30], super-pixel [31,32,33] and graph algorithms [34, 35], many CNN-based methods have been proposed [36]. Recently, With the development of computer hardware and deep learning, semantic segmentation has also made remarkable leap-forwards. A series methods base on FCN [17] keep improving state-of-the-art performance on various benchmarks. The Deeplabv3 [19] abandoned CRFs [37] post-processing and devises an atrous spatial pyramid pooling to capture multi-scale context. The DeepLabv3plus [38] introduce the decoder to fuse upsample feature maps with low-level feature maps. The Segnet [39] utilized the indices of max-pooling operation in encoder to upsampling operation in decoder. The PSPNet [40] adopts a pyramid pooling module on the dilation backbone to capture local and global context information. Both dilation backbone and encoder–decoder structure are mainstream semantic segmentation architectures. Meanwhile, some methods introduce the attention mechanisms, e.g., OCNet [41] and DFANet [24] use self-attention, PSANet [42] use spatial attention and EncNet [20] use channel attention, to capture long-range dependencies. In this paper, we propose a novel and efficient architecture to achieves good trad-off between speed and accuracy. In this paper, we propose a novel and efficient architecture to achieves good trad-off between speed and accuracy.

Real-time semantic segmentation

Many scene parsing tasks require real-time inference and its semantic segmentation algorithms attract increasing attention. In this situation, there are two mainstream efficient and effective semantic segmentation methods, and most of them adopt a lightweight backbone. (i) encoder–decoder architecture. The encoder–decoder structure is a good paradigm to effectively utilize the multi-level image feature information extracted by the backbone network. The encoder usually uses a deep network to extract contextual information through convolution and downsampling operations. The decoder gradually recovers the resolution and fuses the multi-level feature maps extracted by the backbone network to guarantee dense predictions. STDCNet [25] adopts a pretrained STDC networks as the encoder and use the context path of BiSeNetV1 [21] to encode the context information, meanwhile, the decoder fused the learned detail features and the context features. SwiftNet [43] use lightweight lateral connections to assist with upsampling. (ii) multi-branch architecture. The encoder–decoder architecture saving computation cost to a certain extent, but this greatly impairs the long-range dependency information lost due to multiple downsampling processes, which cannot be recovered by upsampling or deconvolution, and it will whittle down the accuracy of semantic segmentation. To alleviate the problem, the multi-branch architecture is proposed. BiSeNetV1 [21] and BiSeNetV2 [22] proposed a dual-branch network, the two paths are used to extract spatial information and context information respectively. ICNet [44] fused multi-scale feature map in a cascaded manner to achieve a good speed-accuracy trade-off. DDRNet [45] used a deep bilateral networks and multiple bilateral fusions to improve segmentation efficiency. (iii) lightweight encoder. Some semantic segmentation networks used the encoder from image classification task as the backbone. But they are not specifically designed for segmentation tasks. Therefore, the channels of these encoders may be redundant or insufficient. MobileNetV1 [27] use depthwise separable convolutions to reduce parameters and computation. MobileNetV2 [46] introduced bottleneck structure, and it has very few channels. MobileNetV3 [47] combines network architecture search (NAS) and NetAdopt algorithms, and apply the Squeeze-and-Excite (SE) in the residual layer. ShuffleNet [28] utilizes the compactness of grouped convolutions and proposes a channel shuffle operation to stimulate information fusion between different groups.

Feature fusion module

The feature fusion is an important and common operation of semantic segmentation network, which can strengthen the representation of the features extracted by the encoder. The basic fusion operation is the element-wise summation or concatenation. BiSeNetV1 [21] uses the Attention Refinement Module (ARM) to facilitate multi-scale fusion in the context path, and Feature Fusion Module (FFM) fuses the features in spatial path and context path. Although STDCNet [25]uses the same fusion method as BiSeNetV1 [21], it has only a detail path, and the spatial path is one of the stages. In BiSeNetV2 [22], the Bilateral Guided Aggregation (BGA) Layer fuse the complementary information from the detail branch (the low-level) and semantic branch (the high-level). DFANet [24] fuses features in sub-network aggregate and sub-stage aggregate ways. DDRNet [45] has two parallel deep branches with different resolutions, and features fusion through multiple bilateral fusion operations.

Our proposed method

In this section, we first introduce the architecture of our proposed Compact Interactive Dual-Branch Network (CIDNet). We then describe our proposed Spatial-Channel Interactive Module (SCIM) used for reducing the loss of two branches of CIDNet during fusion, and further give the Dual-Branch Contextual Attention Fusion Module (DBCAFM) used to fuse the two branches for semantic segmentation.

The architecture of compact interactive dual-branch network (CIDNet)

Due to that the detail branch is time-consuming, the current complete dual-branch network has a large computational cost and is not easily trained [22]. In this work, we propose a novel Compact Interactive Dual-branch Network (CIDNet), The architecture of our proposed CIDNet consists of a compact interactive detail branch and a semantic branch, shown in Fig. 2. The compact interactive detail branch shares the first three stages with the backbone network and merges with the semantic branch at the S3 stage. Our compact interactive detail branch has two stages (D4 and D5), and each stage has two layer. Each layer of the detail branch is a convolution layer followed by batch normalization and activation function. Each convolution layer has a stride \(s=1\), which can maintain high resolution. High resolution and high channel capacity make detail branch encode rich spatial location information.

Note that, an appropriate resolution size is important for segmentation. due to that the operation of downsampling will change the relative position between pixel–pixel. That is, maintaining a certain resolution to extract position information can make the segmentation effect better. In fact, if a larger resolution is selected for the detail branch, the amount of calculation will increase sharply, and the inference speed will be reduced at the same time. If a smaller resolution is selected, the function of keeping position information will be lost. As used in a large number of segmentation networks (e.g., [21, 22, 25]), 1/8 of the original input is the best choice for resolution, and width channels can retain more location details information. On the other hand, to achieve a trade-off between inference speed and efficiency, we follow the BiSeNetV2 [22] philosophy of using wide channel dimensions and shallow layers. We choose the 1/8 of the original input image resolution and 128 channels as the detail branch layers. Meanwhile, it should also be noted that our constructed detail branch is different from BiSeNetV2 [22] because our detail branch are short-term.

Fig. 2
figure 2

Overview of the compact interactive dual-branch network. The CID network contains a compact interactive detail branch (the blue cubes) and a semantics branch (the green cubes). DSIM denotes the Detail-Semantic Interactive Module. DBCAFM denotes the Dual-Branch Contextual Attention Fusion Module. Meanwhile, the number below the cube is the ratio of the feature map size to the input resolution. In addition, in the accelerated training part, we design three auxiliary segmentation heads to improve the segmentation performance without additional inference cost. The auxiliary segmentation header includes two CrossEntropyLoss and one DetailAggregateLoss [25]. Seg Head denotes the segmentation head, Detail Head denotes the Detail head [25]

Fig. 3
figure 3

The GSTDC module and CatBlock in the semantic branch. a Is the catblock, which adopts a keep-resolution strategy. This block can enlarge the receptive field. b Short-Term Dense Concatenate module with groups (GSTDC module) used in our network. c The GSTDC module with stride = 2. Notation: Conv is convolutional operation. BN is the batch normalization. ReLU is the ReLU activation function. G denotes the group convolution. AVGPool is the average pooling. S denotes the stride. DWConvBN is depthwise convolution operation with batch normalization. Conact means concatenation. “+” represents element-wise adding. Meanwhile, \(1\times 1\), \(3\times 3\) denote the kernel size, \(\textit{H}\times \textit{W}\times \textit{C}\) means the tensor shape (height, width, depth)

The semantic branch consists of six stages, stem is composed of S1–S2 stages, S3–S5 stages are stacked by group short-term dense concatenate modules (GSTDC modules), and finally S6 stage is composed of two \(1\times 1\) convolution and a \(3\times 3\) groups of convolution. To alleviate channel redundancy problem and reduce the parameters, we construct a Group-Short-Term Dense Concatenate module (GSTDC module). It is an extension of Short-Term Dense Concatenate module (STDC module) [25]. As in previous work [22, 45, 48], the number of channels in the backbone network is selected as [32, 64, 128, 256, 512]. In our GSTDC module, we use a stepped group convolution instead of the standard \(3\times 3\) convolution, add residual connections, and halve the number of input channels.

For clarity, we plot the CatBlock, GSTDC Module, and STDC module in Fig. 3. Each GSTDC module has four blocks and a fusion layer as in the STDC Module. The first block has one convolutional layer with \(\hbox {kernel} = 1\times 1\), one batch normalization layer and a ReLU activation function. The last three blocks are the group convolution with kernel \(3\times 3\), and the number of groups is [4, 2, 1], and all of them have one batch normalization layer and a ReLU activation function. The output channels of these four blocks are respectively 1/2, 1/4, 1/8, and 1/8 of the entire module output channels. The last fusion layer is used to fuse the outputs of the previous four blocks in a concatenation manner. Finally, The input is added to the concatenation fusion result, which is the idea of skip connect. It can be seen from Fig. 3b, c that different from the STDC module, our two GSTDC modules adopt residual connection, and we use depthwise convolution in the downsampling module, so that the number of channels can be matched and the diversity between channels can be retained as much as possible.

Note that, to improve the inference speed and make the network more lightweight, our GSTDC module uses half of the output channels in the STDC module. Table 1 show the detailed structure of our compact interactive dual-branch. In the first two stages, we adopt a wide number of channels to maintain rich detail information, use max pooling for fast downsampling, and then use a CatBlock to expand the receptive field. In stages S3–S5 phase, we only use the GSTDC basic block stacking. In S6 stage adopts the bottleneck of flexible structure, which can flexibly adjust output channels, also can consider whether to continue downsampling feature maps, and we call the S6 stage for the Context Expanding Module.

In the following subsection, we will describe our proposed spatial-channel interactive module for reducing the loss of two branches during the branch fusion.

Table 1 Instantiation of the compact interactive detail branch and the semantics branch

Detail-semantic interactive module

For dual-branch networks, it is important to effectively fuse the feature information extracted from both two branches. Current complete dual-branch structures [21, 22] have no interaction between the two ways, and thus the accuracy is not satisfactory. Although there is also have dual-branch with interaction structure [45], but the fusion method is to add the feature maps on both sides directly. This fusion method may have the phenomenon of “fusion loss”, because the semantic branch contains rich and accurate semantic information, and the feature map from details branch may not match the information on its corresponding position. To alleviate the above problems, in this work we propose a spatial-channel interactive module to guide the information exchange between the two branches. In this way, the resulting two branches can have certain common features at different scales. Our spatial-channel interactive module contains two paths: (i) high-resolution maps in the detail branch are integrated into low-resolution feature maps in the semantic branch.(called high-low path) (ii) low-resolution maps in the semantic branch are integrated into high resolution feature maps in the detail branch.(called low-high path)

We next describe the detailed structure of the detail-semantic interactive module, shown in Fig. 4. Firstly, the high-resolution feature map retains a lot of position and detail information. In the high-low path, we adopt the method of fast downsampling and expanding the number of channels, and directly fuse by additive the feature map after downsampling and the low-resolution feature map. In the low-high path, we used the attention guided strategy (AGS), low-resolution features contain abundant semantic information. To effectively integrate them with high-resolution feature maps, we draw on the self-attention mechanism, and first upsample the low-resolution feature maps by bilinear interpolation. Then, the upsampled feature map is element-wise multiplication with the high-resolution feature map, and the high-level feature map is obtained by fusing the channel-wise direction. The obtained high-level feature map is input into the sigmoid function, and finally the relative attention mask \(\alpha \) is obtained. After obtaining the two attention maps, we further perform pixel-level product between mask and prediction, and then perform pixel-level sum between them to obtain the final result.

We can formulate the above procedure as follow:

$$\begin{aligned} {\left\{ \begin{array}{ll} F'_{\textrm{s}} = Downsample(F_{d})+F_{\textrm{s}},\\ F'_{d} = AGS(F_{d}, F_{\textrm{s}})+F_{d}, \end{array}\right. } \end{aligned}$$
(1)

where AGS is the attention guided strategy, \(F_{\textrm{s}}\) is the output of before stage of the semantic branch, and \(F'_{\textrm{s}}\) is the input to the next stage of the semantic branch. \(F_d\) is the output of before stage of the compact interactive detail branch, and \(F'_d\) is the input to the next stage of the compact interactive detail branch. In short, the procedures of the attention guided strategy (AGS) can be formulated as follow:

$$\begin{aligned} {\left\{ \begin{array}{ll} F_{M} = Multiply(F_{d}, { Upsample}(F_{\textrm{s}})), \\ F_{\textrm{sum}} = Sum(F_{M}), \\ \alpha = { Sigmoid}(F_{\textrm{sum}}), \\ F_{AGS} = F_{d}\cdot (1-\alpha )+F_{\textrm{s}}\cdot \alpha ,\\ F'_{d} = F_{AGS}+F_{d}, \end{array}\right. } \end{aligned}$$
(2)

where Multiply is element-wise multiplication operation, Sum is the channel-wise summation operation, Upsample is the standard bilinear interpolation operation, \(F_{AGS}\) is the output of the AGS. \(F_d\) is the output of the compact interactive detail branch. \(\alpha \) is the relative attention mask with \(F_d\) and \(F_{\textrm{s}}\).

Fig. 4
figure 4

Detailed design of the detail-semantic interactive module. Notation: Conv is convolutional operation. BN is the batch normalization. Sigmoid is the Sigmoid activation function. sum denotes the channel-wise summation operation. unsqueeze denotes inserts a dimension of size 1 at the specified location. \(\alpha \) denotes the weight output of Sigmoid function. Up is the standard bilinear interpolation operation. “+” represents element-wise adding. “\(\times \)” represents element-wise product. Meanwhile, \(1\times 1\), \(3\times 3\) denote the kernel size. “\(\times n\)” represents the number of times, and the value of “n” is determined according to the feature map downsampling size

Dual-branch contextual attention fusion module

Large receptive fields can capture higher-level semantic information. Large convolution kernels and dilated convolutions are often used to expand the receptive field, but these operations bring large delay and high computational complexity. Pyramid pooling has been shown to be an effective method for extracting high-level semantics [19, 43] for semantic segmentation [45]. Furthermore, Deep Aggregation Pyramid Pooling Module (DAPPM) was proposed and effectively used in real-time semantic segmentation [45]. However, the structure of DAPPM is somewhat complex. In this work, we propose a simplified DAPPM used to extract contextual information more efficiently from low-resolution images in the semantic branch before fusing the two branches.

Due to the different sampling methods of the compact interactive detail branch and the semantic branch, the two branches have different depths and widths. To better integrate high-level semantic features extracted from semantic branches and spatial details extracted from detail branches, we propose a Dual-Branch Contextual Attention Fusion Module (DBCAFM), which is inspired by attention fusion module of AttanNet [49].

Fig. 5
figure 5

The detailed architecture of dual-branch contextual attention fusion module. Notation: Conv is convolutional operation, and \(1 \times 1\), \(3 \times 3\) denote the kernel size. Pool is global-average-pooling operation, and \(5 \times 5\), \(9 \times 9\), \(17 \times 17\), \(H \times W\) means pooling kernel (“H” indicates the height of the input feature and “W” indicates the width of the input feature). Up is bilinear upsampling operation. “+” represents element-wise addition

For clarity, we plot the detailed structure of the dual-branch contextual attention fusion module in Fig. 5. Assume \(F_d\) and \(F_{\textrm{s}}\) respectively denote the output feature maps from the compact detail branch and semantic branch. The input feature maps \(F_{\textrm{s}}\) are the 1/32 image resolution. The input feature maps \(F_{\textrm{s}}\) is first processed by the pyramid pooling. In the pyramid pooling procedure, we use different large pooling kernels to generate feature maps. The pyramid pooling procedure has four glob-average-pooling operations and the pooling kernels size are \(5\times 5\), \(9\times 9\), \(17\times 17\), and \(H \times W\) (“H” indicates the height of the input feature and “W” indicates the width of the input feature) respectively. Afterwards, the output features are followed by the convolution with kernel size \(1\times 1\) and batch normalization. Then, all the four features are processed by upsampling operations to the same resolution size as the input resolution size. Finally, we can obtain the high-level semantics \(F_{\textrm{sp}}\) by fusing these four features with one without pooling feature (only have \(1\times 1\) convolution operation). The output \(F_{\textrm{sp}}\) of the pyramid pooling is input one \(3\times 3\) convolution layer and further add with \(F_{\textrm{s}}\) (after adjusting the channel). We get the resulting(\(F_{\textrm{p}}\)) of pyramid pooling. Then upsample the \(F_{\textrm{p}}\) to the same size as \(F_d\) and then concatenated with \(F_d\), and then input into the attention layer. The attention layer consists of a convolution layer (kernel size is 1\(\times \)1 with BN and ReLU activation function), a global average pooling layer (the size of the pooling kernel is the same as the size of the feature map), a convolution layer (kernel size is 1\(\times \)1 with BN), and a sigmoid activation function. After the attention layer, mask \(\alpha \) is obtained, and the attention map is obtained by multiplying \(\alpha \) with \(F_{\textrm{p}}\). and then add this attention map with \(F_{\textrm{p}}\) to get \(F_{\textrm{p}}'\). Similarly, multiply (1-\(\alpha \)) with \(F_d\) to get an attention map, and then add this attention map with \(F_d\) to get \(F_d'\). Finally, \(F_{\textrm{p}}'\) and \(F_d'\) are fused by element-wise addition. This operation is defined as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} F_{\textrm{p}} = Pyramid Pooling(F_{\textrm{s}}), \\ F_{\textrm{output}} = Sum[Upsample(F_{\textrm{p}}\cdot \alpha +F_{\textrm{p}}), F_{d}\cdot (1-\alpha )+F_{d}], \end{array}\right. } \nonumber \\ \end{aligned}$$
(3)

where \(F_{\textrm{output}}\) denotes the dual-branch contextual attention fusion module output, Sum is the element-wise addition operation, Upsample is the standard bilinear interpolation operation, \(F_{\textrm{s}}\) is the output of the semantic branch. \(F_d\) is the output of the compact interactive detail branch. \(F_{\textrm{p}}\) is the output of the pyramid pooling. \(\alpha \) is the relative attention mask with \(F_{\textrm{p}}\) and \(F_{\textrm{s}}\).

Experiments

In this section, we test our proposed CIDNet on two datasets: Cityscapes [50] and CamVid [51] and compare it with 20 representative methods. First, we introduce the datasets and implementation details. We then do a comparative test for our backbone network to verify its effectiveness. Next, we investigate the validity of the compact interactive dual-branch structure and the effect of each module of our proposed method on the Cityscapes validation set with the ablation. Finally, we report our final accuracy and speed (FPS) results on different benchmarks by comparing it with the other algorithms.

Datasets and evaluation metrics

Cityscapes

The Cityscapes [50] Dataset focuses on semantic understanding of urban street scenes, which are road scenes taken from the perspective of a car, and it is one of the most commonly used datasets for segmentation tasks. This dataset provides for a volume of 5000 images with high quality dense pixel annotations and split into training, validation and test sets, with 2975, 500 and 1525 images in our experiments to verify the effectiveness of our method. The annotated images include 30 classes, 19 of which are used for semantic segmentation tsak, meanwhile the resolution of the images is high to \(2048\times 1024\). Due to the high resolution of this dataset, this poses a great challenge to real-time semantic segmentation methods.

CamVid

Cambridge-driving Labeled Video Database (CamVid) [51] is a road scene segmentation dataset with small-scale. This CamVid dataset is similar to the Cityscapes dataset, but smaller in magnitude and resolution than Cityscapes. There are 701 densely annotated frames extracted from the video sequence, in which 367 images for training, 101 images for validation and 233 images for testing. All images are at a same 960 \(\times \) 720 resolution and 32 semantic categories, of which the subset of 11 classes are used for our segmentation experiments. To increase the number of training samples, we merge the training and validation sets for training meanwhile evaluate our method on the test set.

Evaluation metrics

On all dataset we used, we adopt the standard metric of the mean intersection of union (mIoU) and Frames Per Second (FPS) as the evaluation metrics. The mIoU is defined as the ratio of the intersection set and union set of the prediction and the ground truth of the model is summed and then averaged. The FPS is defined as: the number of frames of pictures processed by the model per second. Let i denote the true value, j denote the predicted value, pi j denote the prediction of i as j, mIoU can be expressed as:

$$\begin{aligned} mIoU = \frac{1}{k+1}\sum _{i=0}^{k}\frac{p_{ii}}{\sum _{j=0}^{k}p_{ij}+\sum _{j=0}^{k}p_{ji}-p_{ii}}. \end{aligned}$$
(4)

Implementation details

Training settings

We choose the Adam algorithm as optimizer, while utilizing the “polynomial decay” learning rate scheduler and the warm-up strategy. Since we only use a single GPU card, we use different batch sizes when training images of different resolutions. For Cityscapes dataset, resolutions \(768\times 1536\) and \(512\times 1024\) correspond to batch sizes of 10 and 24, respectively. The max iteration is 140,000, the initial learning rate is 0.005, and we set the warm-up strategy at the first 2000 iterations. For CamVid dataset, the batch size is 24 and the input resolutions is \(720\times 960\). The max iteration is set as 60,000, the initial learning rate is set as 0.01, and we set the warm-up strategy at the first 400 iterations.

For data augmentation, we utilize random scaling, random padding crop, random horizontal flipping, random color jittering and normalization. The random scale ranges in [0.25, 2.0], [0.5, 2.5] for Cityscapes and CamVid respectively. The training cropped resolution of Cityscapes is \(512\times 1024\) and \(768\times 1536\), and the training cropped resolution of CamVid is \(720\times 960\). In all training experiments, we conduct our experiments use palddlepaddle [52] on NVIDIA GTX 3090 GPU, CUDA 11.2, CUDNN 8.1.

Table 2 Comparisons with complete dual-branch and compact interactive dual-branch of CIDNet1 and CIDNet2 on Cityscapes val set

Inference settings

We do not use any speedup tricks or acceleration strategy, e.g., sliding-window evaluation and TensorRT acceleration. For Cityscapes we use a resolution of \(768 \times 1536\), \(512\times 1024\) for inference. We use the input of \(1024\times 2048\) resolution, the time of resizing is included in the inference time measurement. We first resize it to \(768 \times 1536\) or \(512\times 1024\) resolution for inference and then resize the prediction to the original size of the input. For CamVid, the resolution of \(960\times 720\) is used for inference. We conduct all inference experiments under CUDA 10.1, CUDNN 7.6 on Tesla V100 GPU. We use standard mtric of the mean intersection of union (mIoU) for segmentation accuracy comparison and frames per second (FPS) for inference speed comparison.

Table 3 Comparisons with different interactive module number of CIDNet1 and different fusion methods on Cityscapes val set

Experiments on cityscapes

Ablation study

In this section we perform ablation experiments to demonstrate the performance of our proposed CIDNet versus other network architectures, as well as the effectiveness of each component in our proposed CIDNet. In the following experiments, we train our methods on the Cityscapes [50] training set and evaluate on the Cityscapes validation set.

Effectiveness of compact interactive dual-branch

To demonstrate the effectiveness of our compact interactive dual-branch architecture, we also compare it with the complete dual-branch network [21, 22], where we divide the input image into two paths from the beginning, such as BiSeNetV2 [22]. Experiments are performed on the same configuration and platform. Table 2 reports their comparison results.

It can be seen from Table 2 that our compact interactive dual-branch architecture outperforms the complete dual-branch architecture in both speed and performance. So, our compact interactive dual-branch architecture is effective.

Table 4 Comparisons with other methods on Cityscapes
Fig. 6
figure 6

Example results of the different methods on Cityscapes dataset. The first line is input images, and Lines 2–5 display the results of BiSeNet V1, BiSeNet V2, STDCNet and CIDNet. The final line is the ground-truth

Effectiveness of detail-semantic interactive module

To investigate the effectiveness of our proposed detail-semantic interactive module (DSIM), we conduct comparative trials using “Add” fusion and using “interactive” fusion, and the results are shown in Table 3. It can be seen that interactive operation has better segmentation effectiveness, and the performance based on two interactive operations is slightly better than one interactive operation.

Comparisons with state-of-the-arts

In this subsection, we demonstrate the capacity of CIDNet for semantic segmentation by comparing it with the 20 representative models on the Cityscapes dataset. For fair comparison, we evaluate our model with resolution \(512\times 1024\) and \(768\times 1536\), respectively. For clarity, we use CIDNet1 and CIDNet2 respectively to denote our proposed CIDNet based on the CID1 and CID2 in Table 1. Furthermore, we denote our proposed CIDNet1 on the two input sizes, 50% and 75% of the original images, respectively as CIDNet1-50 and CIDNet1-75. Similarly, our proposed CIDNet2 on the two input sizes respectively as CIDNet2-50 and CIDNet2-75.

Table 5 Comparisons with state-of-the-art real-time methods on CamVid test set

Table 4 reports the comparison results of our proposed CIDNet and the 20 models. It can be observed that from Table 4 that we present the model name, the backbone name, segmentation accuracy, speed (FPS), input resolution, GFLPOs and parameters of various approaches. Specifically, we can observe CIDNet1-50 achieves 164.1 FPS and 75.1 % mIoU, and with the resolution of \(768\times 1536\), CIDNet2-75 achieves 92.6 FPS and 77.7% mIoU for the validation set. Our method outperforms STDCNet on both GFLOPs and Params, but is slightly inferior to STDCNet in terms of inference speed due to the presence of high-resolution branches. As can be seen from Table 4, many models pre-trained on the ImageNet, which can significantly improve the segmentation accuracy, but it is a very time-consuming process. Our models chose to train from scratch. Moreover, our CIDNet2-75 model is loaded with the training weights of CIDNet1-50. Due to the limitation of GPU memory, the training batch size is limited to a small value, while the training batch size of a small size model (such as \(512\times 1024\)) is larger, and its training gradient is more accurate. Therefore, we load the training weight of CIDNet1-50 into the large size model CIDNet2-75. As shown in Table 4, on Cityscapes val set, our model speed is the fastest among the same performance, so our network achieves the trade-off between speed and performance. At the same input resolution, our model parameters are smaller. For example, at 512 \(\times \) 1024 resolution, our CIDNet1-50 model parameters are 34.3% less than STDC1-Seg50, and at \(1536\times 768\) resolution, our CIDNet1-75 model parameters are 73.2% less than STDC1-Seg75. In order to further demonstrate the superiority of our proposed method, we show the visualization results of several different methods on the cityscapes dataset in Fig. 6. From these figures, it can be seen that our method is the closest to the ground truth, that is, our method is superior to the three comparable methods.

Experiments on CamVid

We also conduct experiments on the CamVid dataset to further demonstrate the performance of CIDNet. Like other works, the input resolution for training and inference is \(720\times 960\). Table 5 shows the comparison results with other methods.

In Table 5 CIDNet1 and CIDNet2 respectively to denote our proposed CIDNet based on the CID1 and CID2 in Table 1. In Table 5 CIDNet2* achieves the best performance, which is 77.8% mIoU with 109.6 FPS. This further demonstrates the superior performance of our method. ‘*’ means the model is loaded with the training weights of CIDNet2-75 (training in citysacpes dataset).

Conclusion

In this paper, we propose an efficient compact interactive dual-branch network (CIDNet) for real-time semantic segmentation. Considering the complete dual-branch network is time-consuming, we first refine the traditional dual-branch network, and construct a compact detail branch and semantic branch. To prevent the loss of information, we propose a detail-semantic interactive fusion module. Finally, we construct a dual-branch contextual attention fusion module to deeply fuse the extracted features, and further predict the final segmentation result. Experimental results on the Cityscapes and CamVid datasets demonstrate that the proposed CIDNet achieves satisfactory trade-off between segmentation accuracy and inference speed, and outperforms 16 representative real-time semantic segmentation methods. Our network also has shortcomings: how to make the spatial detail branch occupy less computational resources, and how to improve the fusion efficiency of the two branches. This will be further improved in future work. In practical applications, we are faced with marginal devices, mobile terminals and other devices with relatively weak computing power, which requires our algorithm to be light enough and the amount of computation should be as small as possible. Therefore, in the future work, we will try to deploy our algorithm on mobile terminals to improve the lightweight and efficient semantic segmentation algorithm.