Introduction

Semantic segmentation assigns a category to each pixel in the input image, which is a dense classification task in computer vision. As a result of dense segmentation predictions, it has a wide range of applications in the real world, including autonomous driving [1], virtual reality [2], scene understanding [3] and so on.

Deep learning technology has made significant progress in many fields, including fault diagnosis [4], automation control [5], and semantic segmentation. Using Convolution Neural Network (CNNs), some advanced semantic segmentation methods have achieved a significant progress in terms of accuracy, including PSPNet [6], RefineNet [7], and DeepLab [8] in recent years. However, they typically possess a complex structure and hundreds of convolution layers and feature channels, which consume a large amount of computing resource and limit the wide application in real world. Consequently, it remains a challenge to design a lightweight network that achieves real-time performance and a satisfactory accuracy.

Fig. 1
figure 1

Comparisons with other methods in terms of parameters and accuracy. Our EBUNet achieved a competitive result

At present, many lightweight semantic segmentation methods have been developed to achieve a good tradoff between inference speed and prediction accuracy, which can be broadly divided into two categories: 1. Mode compression: eliminate unnecessary calculations and reduce the amount of data that needs to be stored by simplifying models, including pruning networks [9], and knowledge distillation [10, 11]. 2. Convolution decomposition: build shallow networks from the perspective of reducing convolutional computational costs, such as depth-wise separable convolution and group convolution.

Based on the idea of convolution decomposition, MobileNet [12] adopted depth-wise convolution to construct the backbone and achieved a fast running speed compared to the traditional convolution. DABNet [13] introduced a depth-wise asymmetric bottleneck, which achieved a well-balanced performance in terms of running speed and segmentation accuracy.

Furthermore, multi-scale feature fusion is often employed in the design of lightweight semantic segmentation. For example, a MAD module was introduced in LMFFNet [14] to combine the different levels of features into one stage and generate more accurate attention maps. CFNet [15] implemented a channel attention mechanism and a cross-fusion module to enhance the fusion effect.

In this paper, we present a novel lightweight network called EBUNet, which employs an encoder–decoder architecture, to achieve real-time semantic segmentation. Our EBUNet is mainly composed of three modules: Efficient Bottleneck Unit (EBU), Image Partition Attention Mechanism (IPAM) and a Lightweight Attentional Decoder (LAD).

Using depth-wise separation and dilated convolution simultaneously, we devise a novel residual-like structure named EBU, which achieves high accuracy with low computation costs. The IPAM module is designed to enhance the feature. The LAD module is presented to recover the spatial information and generate the segmentation results. We compare our method with other semantic segmentation methods in terms of parameters and mIoU. The results can be seen in Fig. 1.

Our main contributions can be listed as follows:

  • A novel Efficient Bottleneck Unit (EBU) module is proposed to gather detailed and semantic information. Based on the EBU module, we have implemented two EBU blocks as the main part of our EBUNet.

  • We propose a novel Lightweight Attentional Decoder (LAD) for recover spatial information effectively. In the proposed LAD module, different attention mechanisms are employed in the proposed LAD module to refine different levels of feature maps.

  • We present a novel Image Partition Attention Module (IPAM) to refine the feature maps from different EBU blocks. The proposed IPAM module partitions the feature maps into sub-regions and generates an attention map based on the partitioned sub-regions.

  • Our EBUNet obtains competitive results, it achieves 73.4% and 72.2% mIoU on Cityscapes and Camvid test sets along with 152 and 147 FPS, respectively.

The structure of this paper is organized as follows: Section II introduces some previous work about residual structures, lightweight semantic methods, and feature fusion methods. Section III presents our method, including the EBU module, IPAM module, and LAD module. Section IV discusses the experiment details and results. Section V concludes the whole paper.

Related works

In this section, we will review some related works, including residual structure, lightweight semantic networks, and multi-scale feature fusion methods.

Residual structure

Residual structure, which has been proven to be an effective way to overcome gradient explosion or vanishing problems. It was originally proposed in ResNet [16], and many residual-like structures have been proposed for various computer vision tasks since then. As an example, ShuffleNet [17, 18] developed lightweight backbone networks with depth-wise convolutions. LEDNet [19] introduced SS-nbt modules that combine factorized and dilated convolution for feature extraction. FBSNet [20] employed BRU modules to capture rich contextual information. DABNet [13] utilized dilated convolution in DAB modules to enlarge the receptive field, which helped to promote the detailed segmentation effect. MSCFNet [21] applied EAR modules to retrieve contextual and detailed information.

Lightweight semantic networks

In recent years, rapid-growing applications have required semantic segmentation approaches to run efficiently in real-world scenarios. The key concept of lightweight semantic segmentation is to maximize accuracy while minimizing feed-forward inference time.

A lot of attention has been paid to the design of lightweight semantic segmentation since ENet [22] was proposed. For example, Bisenet [23] introduced a dual-path structure. The context path was used for extracting contextual information, whereas the spatial path was used for extracting spatial information. A number of other approaches have been developed based on BiseNet, including STDCNet [24] and BiseNet-v2 [25], which produce more efficient and accurate results than the original BiseNet. Many lightweight networks have been developed in recent years to improve efficiency and effectiveness. For example, JPANet [26] presented a joint feature pyramid module for learning multi-stage features. FPANet [27] employed a feature pyramid fusion module to fuse features from different stages. RELAXNet [28] applied EBR and EABR modules to acquire context and detailed information.

Feature fusion

In different fields, feature fusion has different meanings. In signal processing, feature fusion is used to achieve high robustness by combining time and frequency information [29]. In deep learning technology, the feature fusion methods aim to fuse the feature maps from different stages.

There are two types of feature fusion methods that are commonly applied: channel-wise concatenation and element-wise addition. For example, in ContextNet [30] and Fast-SCNN [31], high-level feature maps are upsampled and then concatenated with low-level ones to achieve multi-scale feature fusion. The FBSNet [20] utilized element-wise addition for feature fusion, which combines features from the different branches. Since the attention mechanism was proposed, many attentional methods devoted to promoting the effect of multi-scale feature fusion. For example, The LMFFNet [14] introduced a FFM module for fusing different levels of feature maps, which employs an attention mechanism as well as depth-wise separable convolutions, ABCNet [32] utilized self-attention to fuse the feature maps from different stages.

Methodology

In this section, we fisrt examine the computation costs and parameters of the convolution operation. Then, we will introduce the components of our EBUNet, including EBU module, IPAM module, and LAD module. The architecture design of our EBUNet will be discussed at the end of this section.

Computation complexity analysis

The Convolutional Neural Networks (CNNs) are composed of convolutional layers and fully connected layers. In this section, we will discuss te computation complexity of the CNNs.

Before we start our discussion, we make some definitions to simplify our discussion. Defining a transformation function \(\Phi \) to take \({C_\mathrm{{in}}}\) feature maps with a spatial size of \(d \times d\) as inputs, and output \({C_\mathrm{{out}}}\) feature maps with the same size. \({C_\mathrm{{in}}}\) and \({C_\mathrm{{out}}}\) stand for the number of input and output channels, respectively. The convolutional kernel size is \(k \times k\) and the stride is set to 1. Here, we use square feature maps and convolutional kernels for simplifying our discussion. We omit the bias and Batch normalization terms in the convolutional operation, which are often used in modern CNNs.

In this case, the number of parameters in the convolution is \(k \times k \times {C_\mathrm{{in}}} \times {C_\mathrm{{out}}}\) and the computational complexity in terms of FLOPs is \(k \times k \times {C_\mathrm{{in}}} \times {C_\mathrm{{out}}} \times d \times d\).

Based on the above conclusions, it is necessary to reduce the multiplication cost between \(k \times k \) and \({C_\mathrm{{in}}} \times {C_\mathrm{{out}}}\), which is an effective way to cut down the size and the computation burden of convolutions. The depth-wise convolution applies this approach to explore compact models.

Compared to the standard convolution, the depth-wise separable convolution utilizes a single convolutional kernel independently for each input feature map, thus generating the same number of output channels. Following that is a 1\(\times \)1 convolution layer to merge the information of all output channels. The depth-wise separable decomposes the standard convolution into a depth-wise convolution and a point-wise convolution. By applying depth-wise separable convolution, the number of parameters becomes:

$$\begin{aligned} k \times k \times {C_\mathrm{{in}}} + {C_\mathrm{{in}}} \times {C_\mathrm{{out}}} \end{aligned}$$
(1)

and the computation complexity becomes:

$$\begin{aligned} k \times k \times {C_\mathrm{{in}}} \times d \times d + {C_\mathrm{{in}}} \times {C_\mathrm{{out}}} \times d \times d \end{aligned}$$
(2)

Based on the above equations, the amount of the parameters and computations are reduced by depth-wise separable convolution.

Efficient bottleneck unit

The EBU module is designed to extract semantic information more efficiently and effectively. Previous residual-like works, including bottleneck [22], SS-nbt [19], and EAR modules [21], have proven to be effective in the design of lightweight semantic segmentation.

Fig. 2
figure 2

Comparisons with several residual-like structures. a bottleneck [22] b SSn-bt [19] c BRU [20]. D means dilated convolution, DW stands for depth-wise convolution

As shown in Fig. 2d, we employ a \(3 \times 3\) standard convolution to generate features and reduce the channels by half at the beginning of each EBU module. The output of the convolutional operation is then split into two branches, where each branch has 1/4 channels of the original input.

A convolutional kernel of 3\(\times \)3 is used in the EBU module to preserve adequate spatial information for accurate segmentation. In order to improve computation efficiency, a \(3\times 3 \) depth-wise convolution is employed in the left branch to acquire local information.

The right branch is developed to obtain adequate contextual information. For the purpose of enlarging the receptive field, we use a depth-wise dilated convolution without adding any additional parameters in the right branch.

For the sake of sharing information between two branches, we put the feature interaction operations through an element-wise addition between two branches. So as to the two branches can complement each other.

At the end of the EBU module, another 3\(\times \)3 regular convolution is employed to integrate the multi-scale features and finally restore the number of channels as same as the number of input channels. The whole procedure can be expressed as follows:

$$\begin{aligned} F= & {} {f_{3 \times 3}}(X) \end{aligned}$$
(3)
$$\begin{aligned} {F_1},{F_2}= & {} \text {Split}(X) \end{aligned}$$
(4)
$$\begin{aligned} {F_1}= & {} f_{3 \times 3}^{\text {DW}}({F_1}) \end{aligned}$$
(5)
$$\begin{aligned} {F_2}= & {} f_{3 \times 3}^{\text {DDW}}({F_2}) \end{aligned}$$
(6)
$$\begin{aligned} {F_2}= & {} {F_2} + {F_1}\nonumber \\ {F_1}= & {} {F_1} + {F_2} \end{aligned}$$
(7)
$$\begin{aligned} {F_\mathrm{{out}}}= & {} \text {Concat}(F1,F2) \end{aligned}$$
(8)
$$\begin{aligned} {F_\mathrm{{out}}}= & {} X + {F_\mathrm{{out}}}, \end{aligned}$$
(9)

where, X is the input feature maps. \({F_1}\) and \({F_2}\) are the results of splitting operation. \({f_{3 \times 3}}\) represents standard \(3 \times 3\) convolution. \(f_{3 \times 3}^{\text {DW}}\) and \(f_{3 \times 3}^{\text {DDW}}\)stand for depth-wise convolution and depth-wise dilated convolution. Concat means feature concatenation along with the channel dimension.

Fig. 3
figure 3

Illustration of the proposed IPAM module

Image partition attention module

The Attention mechanism has been widely used in various segmentation methods, such as BiseNet [23], MSCFNet [21], DFANet [33], etc. We introduce an Image Partition Attention Module (IPAM) in this paper.

As Fig. 3 shows, the input features are partitioned into four regions through an average pooling operation. Then, global average pooling operations are applied to each partitioned sub-region in parallel.

Fig. 4
figure 4

Illustration of our LAD module

Each partitioned sub-region \({S_i}\) is then subjected to global average pooling simultaneously. The global average pooling operation is calculated as follows:

$$\begin{aligned} {F_\mathrm{{avg}}} = \frac{1}{{H \times W}}\sum \limits _{i = 1}^H {\sum \limits _{j = 1}^W {{X_H}(i,j)} }. \end{aligned}$$
(10)

Additionally, global average pooling is applied to the original input feature to acquire the global information. Following that, we use an element-wise addition to fuse the results from sub-region pooled features and global pooled features. This procedure is computed as follows:

$$\begin{aligned} T = \sum \limits _i {{S_i} + {F_\mathrm{{global}}}}, i \subset 1,2,3,4, \end{aligned}$$
(11)

where, \({{S_i}}\) represents the pooled results of sub-region. \({{F_{global}}}\) indicates the result from the pooled result from the original input.

The results from addition operation is then send into a projection layer by 1\(\times \)1 convolution to generating attention weigh vector w.

To be specific, the addition results are compressed across the channel dimension, then the ReLU function is applied to introduced non-linearity. After that, the channel increasing layer is employed to recover the channel to the number of original input. A sigmoid function is used to generate the attention weight vector w. The operation of the projection layer can be expressed as follows:

$$\begin{aligned} w = \sigma ({\pi _2}({\mathop {\textrm{ReLU}}\nolimits } ({\pi _1}(T))))), \end{aligned}$$
(12)

where, \({\pi _1}\) and \({\pi _2}\) represent the channel reduction and expansion function implemented by the two regular 1\(\times \)1 convolution, respectively. \({\mathop {\textrm{ReLU}}\nolimits } \) indicates the Rectified Linear Unit function.

At the end of the IPAM module, we can obtain the final output \({F_\mathrm{{out}}}\) as follows:

$$\begin{aligned} {F_\mathrm{{out}}} = {F_\mathrm{{in}}} \otimes w, \end{aligned}$$
(13)

where, \(F_\mathrm{{out}}\) and \({F_\mathrm{{in}}}\) represents the input and output respectively, w means attention weights generated from IPAM module.

Lightweight attentional decoder

There are different roles assigned to encoders and decoders in encoder–decoder segmentation frameworks. The encoder is responsible for producing dense feature maps, whereas the decoder is responsible for upsampling the resolution of feature maps to match the original input size. It is possible to improve the accuracy of prediction with the use of well-designed decoders.

In our paper, we present a novel lightweight attentional decoder (LAD). It consists of two blocks and can fuse different-level features effectively. A channel attention module is proposed for the refinement of high-level feature maps, while a spatial attention module is proposed for the refinement of low-level ones. The structure of our proposed LAD is shown in Fig. 4.

We present a spatial attention module (SAM) to make the low-level features pay more attention to informative features. Let \({X_L}\) denote the input low-level feature maps, \({f_{conv}}\) represents the regular convolution operation, \({f_\mathrm{{mean}}}\) and \({f_\mathrm{{max}}}\) are the mean operation and maximum in the channel dimension, respectively. The spatial attention map S is computed as follows:

$$\begin{aligned} S = \sigma ({f_\mathrm{{conv}}}{f_{\text {cat}}}({f_{\text {mean}}},{f_{\text {max}}})), \end{aligned}$$
(14)

where, \(\sigma (.)\) represents the sigmoid function. After the transformation, the shape of low-level features changes from \(C \times H \times W\) to \(1 \times H \times W\). Finally, we element-wise multiply the input low-level feature \({X_L}\) and the spatial weights map S to get our refined feature \(X_L^S\):

$$\begin{aligned} X_L^S = {X_L} \otimes S, \end{aligned}$$
(15)

where, \( \otimes \) denotes the element-wise multiplication.

Our channel attention module (CAM) uses global average pooling to obtain global contextual informative and generates an channel attention map to refine the high-level features. Let \({X_H}(i,j)\) denotes values of \({X_H}\) at pixel location (ij). \({X_H}\) represents input high-level feature maps. The global average pooling can be expressed as follows:

$$\begin{aligned} {F_{\text {avg}}} = \frac{1}{{H \times W}}\sum \limits _{i = 1}^H {\sum \limits _{j = 1}^W {{X_H}(i,j)} }. \end{aligned}$$
(16)

Consequently, the shape of the high-level features changes from \(C \times H \times W\) to \(1 \times 1 \times C\). Following that, \({F_{\text {avg}}}\) is fed into a convolution layer, and then passed through a sigmoid to generate channel attention map C:

$$\begin{aligned} C = \sigma (f_{\text {conv}}^{1 \times 1}{F_{\text {avg}}}). \end{aligned}$$
(17)

The final weighted high-level feature are acquired by multiplying feature map and the attention map:

$$\begin{aligned} X_H^C = {X_H} \otimes C. \end{aligned}$$
(18)

As a result of the abstracted spatial attention map produced from low-level features, we are able to identify the importance of each pixel, which focuses on locating objects and refining the corresponding shapes and boundaries with spatial details. On the other hand, the squeezed channel attention map generated from upsampled high-level features focuses on the global context to provide context information.

After that, the refined low-level features and high-level features are concatenated along with channel dimension. Finally, another upsampling operation is utilized to restore the feature map to its original size.

Fig. 5
figure 5

Overall architecture of our EBUNet

Table 1 Overall architecture of EBUNet

Architecture design of EBUNet

The overall architecture of the proposed EBUNet is shown in Fig. 5 and is listed in Table 1.

Initial Unit is employed at the beginning of EBUNet to adjust the resolution of the input images and eliminate the redundant information. Initial Unit is composed of three consecutive standard convolutions. To be specific, the first convolution is used to reduce the image resolution by half. In the meanwhile, the channel number of the feature map is adjusted to 32. Afterwards, two 3 \(\times \) 3 convolutions are utilized to obtain abundant contextual information.

Besides, downsampling operation is used to enlarge the receptive field. The downsampling operation is composed of two parallel branches: a standard 3 \(\times \) 3 convolution with a stride of 2 and a 2 \(\times \) 2 maximum pooling operation. Then the outputs of above the two parallel branches are concatenated along with the channel dimension.

After that, the feature map obtained by downsampling the output of the initial unit is input into the first EBU Block for dense feature extraction. The first EBU block contains three EBU modules with a dilated rate of 2. The input feature map of second EBU Block is 1/8 of the input, which contains 10 consecutive EBU modules with a gradually increasing dilated rates {2,2,4,4,6,6,8,8,16,16}. The IPAM is employed to refine the features from EBU block 1 and EBU block 2. Consequently, in the decoder phase, the LAD employs different kinds of attention mechanism for different-level feature maps and produces more accurate outputs.

Experiments

In this section, we first illustrate brief information about Cityscapes [34] and CamVid [35] datasets, following that, we introduce the training protocols for our experiments. Subsequently, ablation studies about several components of our EBUNet will be discussed. At the end of this section, we will discuss the performance of our method in the metric of prediction accuracy and running efficiency.

Datasets

We utilize Cityscapes and CamVid datasets in our training and testing experiments.

The Cityscapes dataset is a well-known dataset for semantic segmentation of urban scenes. There are 5000 fine-annotated images in the Cityscapes dataset: 2975 images for networks training, 500 images for networks validation, and 1525 images for networks testing. The original image resolution of Cityscapes is 2048\(\times \)1024. For fair comparisons, we use the full resolution for performance evaluation in the validation and testing phases. In the training phase, the resolution is resized to 512\(\times \)1024.

The CamVid dataset, derived from car-view videos, is another well-known urban scene dataset. The CamVid dataset consists of 701 images total: 367 images for the training phase, 101 images for the validation phase, and 233 images for the testing phase. The original resolution of CamVid dataset images is 720\(\times \)960.

Training protocols

All the experiments are performed with one NVIDIA RTX 3090 GPU, CUDA 11.6, and cuDNN v8 on pytorch platform, Ubuntu 20.04 operating system with 32GB Memory.

We employ Mini-Batch Stochastic Gradient Descent [36] (SGD) in our optimization strategy, where we set the batch size to 8, the weight decay to \(1 \times {10^{ - 4}}\), the momentum to 0.9, and the initial learning rate to \(4.5 \times {10^{ - 2}}\) in the training procedure of the Cityscapes dataset.

We train our EBUNet by using Adam optimizer when running experiments on the CamVid dataset. The initial learning is set to \(1 \times {10^{ - 3}}\) and the weight decay is set to \(2 \times {10^{ - 4}}\).

Besides, polynomial policy is employed to adjust the learning rate in the training phase. The polynomial policy is expressed as the follow formula:

$$\begin{aligned} \text {l}{\text {r}_{\text {cur}}} = \text {l}{\text {r}_{\text {init}}}*{\left( 1 - \frac{{\text {cur}\_\text {epoch}}}{{\max \_\text {epoch}}}\right) ^{0.9}}, \end{aligned}$$
(19)

where, \(\text {l}{\text {r}_{\text {cur}}}\) represents the learning rate in the current epoch, \({\text {cur}\_\text {epoch}}\) stands for the current epoch, \(\max \_\text {epoch}\) is the total epoch.

The \(\max \_\text {epoch}\) was set to 1000 during the training process for both the Cityscapes and CamVid datasets. During the training phase, data augmentation techniques, such as random scale, mean subtraction, and horizontal flipping are also applied. A variety of random parameters were set to transform training samples to different scales, including 0.75, 1.0, 1.25, 1.5, 1.75, and 2.0. We randomly cropped the training images and labels in the cityscapes dataset from the resolution of 2048\(\times \)1024 to 512\(\times \)1024.

Ablation studies

In this part, we design a series of ablation experiments to validate the effectiveness of some proposed components of our EBUNet. We conduct ablation studies on the EBU module and LAD module. Additionally, we investigate the influence of depth within the EBU block. We perform all the ablation experiments on the Camvid dataset.

Ablation on EBU module

The main part of our EBUNet is constructed using the EBU module. We devise two kinds of ablation study strategies to verify the effectiveness of our EBUNet. In the first step, we design a series of experiments to investigate the influence of different dilated rates. The second is that we compare our EBU module to some other residual structures, including DABNet [13] and ERFNet [37]. The ablation study results can be seen in Tables 2 and 3.

Table 2 The experiment results about the dilated rates

To study the effects of dilated rates, we devised five sequences with varying dilated rates and compared them with baseline. From Table 2, we can learn that when we set all the dialted rates in EBU modules to 2, the accuracy is 1.8% lower than the baseline. In addition, when we set a larger dialted rates sequence in EBU modules, the accuracy is 1.6% higher than R=2 but 0.2% lower than baseline.

Additionally, we designed experiments to test network performance using excessive dilated rates (32 and 48). As shown in Table 2, when the dilated rate was set to 32, the mIoU was decreased 69.8% and the FPS was also decreased from 147 to 140. Besides, both accuracy and speed decreased when all dilated rates were set to 48. We can concluded that the larger dilated rates would cause heavy computation cost. Furthermore, dilated convolution results are convolved from mutually independent subsets, which lose local information.

Table 3 Experiment results on residual structures

From Table 3, we can observe that when EBU modules are substituted with non-bottleneck, the forward inference is higher than EBU modules are used. However, the accuracy of our EBUNet is 1% higher than it. As a result, our EBU module strikes a good balance between accuracy and efficiency. Additionally, a visual comparison was also conducted and the results can be seen in Fig. 6.

Fig. 6
figure 6

Visual results about ablation study on EBU module. From the left column to the right column is: input, ground-truth, baseline, EBUNet with DABmodule, and EBUNet with non-bottleneck

Table 4 Experiment results on LAD
Table 5 Experiment results on the attention mechanism used in LAD

Ablation on LAD module

LAD is used to recover the spatial information to the original input resolution. The ablation design of LAD is based on two strategies. In the first step, we compare our LAD with DABNet’s decoder. We then discuss how our LAD is affected by the attention mechanism. Specifically, we performed ablation studies on the different attention mechanisms in our LAD. Results of ablation studies are presented in Tables 4 and 5.

Fig. 7
figure 7

Visual results about ablation study on EBU module. From the left column to right column is: input, ground-truth, baseline, and decoder in DABNet

Fig. 8
figure 8

Visual comparisons about LAD. From the left-most to right-most are a input b ground-truth c baseline of LAD d only use CAM in LAD e only use SAM in LAD f no attention used in LAD

As shown in 4, the accuracy of LAD increased by 1.33% when compared to the DABNet decoder, but there was only an increase in parameters of 0.01 M, meaning the cost is negligible.

A visual comparison of the ablation results for the LAD module is also performed. The visual results can be seen in Figs. 7 and 8. The difference is highlighted by the yellow dashed line.

From Table 5, LAD achieves the highest mIoU when both SAM and CAM are used to refine different-level feature maps. The accuracy performance (mIoU) of LAD is 0.5% lower when only CAM is used to refine high-level features. When SAM is only used to refine low-level features, mIoU is also 0.3% lower than the baseline of LAD. SAM and CAM canceling in the LAD leads to a 1% reduction in accuracy. As a result, we can conclude that the attention mechanism has the potential to effectively improve segmentation accuracy while consuming negligible computation resources.

Ablation on the depth of EBU Block

There are two parameters M and N that indicate the number of EBU modules contained within EBU block 1 and EBU block 2. In order to investigate the model performance in terms of segmentation accuracy (mIoU) and feed forward speed (FPS), we devised a series of experiments using different values for M and N. The experiment settings and results are listed in Table 6.

According to Table 6, accuracy tends to get better as the depth inside the EBU blocks increases. However, accuracy can only be slightly improved if we increase the depth inside the EBU blocks. Even when we continue to deepen the depth of the EBU blocks, performance drops.

Table 6 The ablation results on the influence of the depth

In general, increasing the depth of the network at the beginning will improve network performance to a certain degree, with a moderate increase in computational cost. However, when we make the network deeper, the accuracy and efficiency of the network fall instead.

Comparisons with other works

We compare the performance of our EBUNet with some other state-of-art semantic segmentation methods on the Cityscapes and CamVid datasets in this subsection. Similar to other lightweight semantic segmentation models, we perform down-sampling operations on the input images on Cityscapes. The resolution is decreased to 512\(\times \)1024 (for Cityscapes). For the CamVid dataset, we use origin resolution 720\(\times \)960 to perform our experiments. In addition, the speed of our EBUNet is measured on three different GPUs: RTX3090, RTX2080Ti, and TiTan XP.

Comparisons on cityscapes

A number of semantic segmentation methods are selected in recent years to demonstrate the effectiveness of our EBUNet, including ENet [22], ESPNet [38], CGNet [39], ContextNet [30], EDANet [40], ERFNet [37], Fast-SCNN [31], BiseNetV1 [23], ICNet [41], DABNet [13], LedNet [19], FBSNet [20], JPANet [26], MSCFNet [21], EDGENet [42], and FPANet [27]. Moreover, we also choose non-real-time semantic segmentation, including SegNet [43], DeepLabV2 [8], and RefineNet [7] to demonstrate the advance of our EBUNet.

Table 7 Comparison with state-of-art semantic segmentation methods on Cityscapes test set

As a means of providing a comprehensive comparison, we have counted the input size, the parameters, the computational complexity (FLOPS), the forward inference speed (FPS), the GPU platform, and the accuracy (mIoU) for each model. The quantitative result is shown in Table 7. Our EBUNet achieves 73.4% mIoU at a speed of 152 FPS on a single RTX3090 GPU card. A speed evaluation of EBUNet on both Titan XP and RTX2080Ti was also conducted and reported in Table 7.

As shown in Table 7, the performance of our EBUNet can even outperform certain non-real-time approaches. It is worth noting that the speed of our EBUNet is 98 frames per second, which is much faster than the speed of DeepLabV2 [8] with RTX 2080Ti. Moreover, the accuracy of EBUNet is 3% higher than that of DeepLabV2. When compared to RefineNet, although the proposed EBUNet achieves a slightly accuracy lower (0.3%) than it. However, our EBUNet produces a much smaller amount of parameters than RefineNet, approximately 20\(\times \) fewer parameters than RefineNet.

It is found that the parameter of our EBUNet is in the same order of magnitude when compared to the lightweight and real-time semantic segmentation methods, but EBUNet achieves a certain improvement in mIoU. Compared with ESNet [46], the mIoU increased 2.7%, and the EBUNet has fewer parameters, which is more lightweight than ESNet. Compared to the MSCFNet [21], the parameter of our EBUNet only increased 0.42 M, while the mIoU increased 1.5%. Meanwhile, the FPS of EBUNet on Titan XP is 63, which is faster than MSCFNet. In comparison to the AGLNet, our parameters increased by 0.45 M, but the mIoU has increased by 2.1%. Moreover, we are able to reach 98 FPS on the same GPU with RTX2080Ti, which is faster than AGLNet (46 FPS faster). When compared to FPANet [27], our EBUNet achieves the same accuracy performance, but with faster speed. Moreover, the number of parameters in our EBUNet is only 1/10 of FANet’s parameter.

The speed of our method has decreased somewhat to some extent when compared to fast semantic segmentation methods on RTX3090, including ContexNet [30], EDANet [45], and DABNet [13], but the accuracy has improved significantly, which are 7.3%, 6.1%, and 3.3%, respectively.

Additionally, we compare the performance of the different semantic classes in the cityscapes test set. The comparison results are shown in Table 8. We can learn from Table 8 that our EBUNet is able to achieve state-of-the-art results in 12 out of 19 semantic classes without requiring any pre-training. In addition, EBUNet achieves significant improvements in the three categories of trucks, sidewalks, and riders, which are 7%, 1.8%, and 1% higher than the second place, respectively. A visual comparison is also presented on the Cityscapes validation set, which can be seen in Fig. 9.

Table 8 The individual class accuracy performance on cityscapes test set
Fig. 9
figure 9

Visual results on cityscapes. From lest column to right column is: input, ground-truth, DABNet, CGNet and our EBUNet

According to the discussion above, our EBUNet achieves a good balance between segmentation accuracy and running efficiency on Cityscapes dataset.

Comparisons on CamVid

We also evaluate the performance of the proposed EBUNet on CamVid to further investigate its robustness, Table 9 reports the performance of our EBUNet and other methods (Fig. 10).

Table 9 Comparisons results on Camvid dataset with state-of-art works
Fig. 10
figure 10

Visual results with several lightweight semantic segmentation works on Camvid test set. From left-most to right-most are: input, ground-truth, our EBUNet, DABNet [13] and CGNet [39]

We can learn from Table 9 that our EBUNet achieves outstanding results. It achieves 72.2 mIoU at a speed of 147 FPS. A number of segmentation methods are selected and compared on a comprehensive basis: pre-training, FPS, parameters, and accuracy (mIoU). As shown in Table 9, among these methods, the proposed EBUNet achieves the best performance in terms of speed and accuracy. The EBUNet parameter has only increased 0.42 M compared to the AGLNet, but the accuracy has increased 2.8%. EBUNet achieves fast speed (27FPS faster) and higher accuracy (3.1 mIoU higher) in comparison to LMFFNet.

Conclusion

In this paper, we proposed an EBUNet for fast and accurate semantic segmentation tasks. Our EBUNet consists of three main components: EBU blocks, IPAM, and LAD. The EBU module adopted depth-wise convolution and depth-wise dilated convolution simultaneously to acquire much useful contextual information with a lower computation cost. The IPAM module was mainly used for refining the feature maps, which promotes segmentation accuracy with negligible parameters and costs. The LAD module was employed to recover the spatial information to the origin resolution. We designed a series of ablation studies to demonstrate the effectiveness of the EBU module and LAD in our EBUNet. We also make extensive comprehensive comparisons with others methods on both Cityscapes and CamVid datasets. To be specific, our EBUNet achieved a 73.4% and 72.26% mIoU on the above datasets. The FPS on the two datasets is 152 and 147 on a NVIDIA RTX 3090 platform, which is a competitive result. In conclusion, our EBUNet strikes a better trade-off between segmentation accuracy and efficiency.

However, our EBUNet has the following problems when applied in practical applications: First of all, the memories of the real applications are usually limited. Although we have adopted some lightweight techniques to reduce the computational complexity, our EBUNet still needs amount of memory resources, which is a huge burden for the equipment of the real applications. Besides, the storage space limited is another problem when deploys the EBUNet in real applications. Even we have adopted depth-wise separable convolution to reduce the amount of parameter and computational complexity in our EBUNet, it also requires about 6MB space to storage, which is consumption for the storage resources. Therefore, we dedicate to explore a novel architecture for semantic segmentation to gain a better tradeoff among running efficiency, segmentation accuracy, and memory consumption in the future.