EBUNet: a fast and accurate semantic segmentation network with lightweight efficient bottleneck unit

Shen, Siyuan; Zhai, Zhengjun; Yu, Guanfeng; Yan, Youyu; Dai, Wei

doi:10.1007/s40747-023-01054-y

EBUNet: a fast and accurate semantic segmentation network with lightweight efficient bottleneck unit

Original Article
Open access
Published: 17 April 2023

Volume 9, pages 5975–5990, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

EBUNet: a fast and accurate semantic segmentation network with lightweight efficient bottleneck unit

Download PDF

1460 Accesses
1 Citation
Explore all metrics

Abstract

It has been difficult to achieve a suitable balance between effectiveness and efficiency in lightweight semantic segmentation networks in recent years. The goal of this work is to present an efficient and reliable semantic segmentation method called EBUNet, which is aimed at achieving a favorable trade-off between inference speed and prediction accuracy. Initially, we develop an Efficient Bottleneck Unit (EBU) that employs depth-wise convolution and depth-wise dilated convolution to obtain adequate features with moderate computation costs. Then, we developed a novel Image Partition Attention Module (IPAM), which divides feature maps into subregions and generates attention weights based on them. As a third step, we developed a novel lightweight attention decoder with which to retrieve spatial information effectively. Extensive experiments show that our EBUNet achieves 73.4% mIou and 152 FPS on the Cityscapes dataset and 72.2% mIoU and 147 FPS on the Camvid dataset with only 1.57 M parameters. The results of the experiment confirm that the proposed model is capable of making a decent trade-off in terms of accuracy, inference, and model size. The source code of our EBUNet is available at (https://github.com/Skybird1101/EBUNet).

A lightweight network with attention decoder for real-time semantic segmentation

Article 07 May 2021

DAABNet: depth-wise asymmetric attention bottleneck for real-time semantic segmentation

Article 24 February 2024

LAANet: lightweight attention-guided asymmetric network for real-time semantic segmentation

Article 24 January 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Semantic segmentation assigns a category to each pixel in the input image, which is a dense classification task in computer vision. As a result of dense segmentation predictions, it has a wide range of applications in the real world, including autonomous driving [1], virtual reality [2], scene understanding [3] and so on.

Deep learning technology has made significant progress in many fields, including fault diagnosis [4], automation control [5], and semantic segmentation. Using Convolution Neural Network (CNNs), some advanced semantic segmentation methods have achieved a significant progress in terms of accuracy, including PSPNet [6], RefineNet [7], and DeepLab [8] in recent years. However, they typically possess a complex structure and hundreds of convolution layers and feature channels, which consume a large amount of computing resource and limit the wide application in real world. Consequently, it remains a challenge to design a lightweight network that achieves real-time performance and a satisfactory accuracy.

At present, many lightweight semantic segmentation methods have been developed to achieve a good tradoff between inference speed and prediction accuracy, which can be broadly divided into two categories: 1. Mode compression: eliminate unnecessary calculations and reduce the amount of data that needs to be stored by simplifying models, including pruning networks [9], and knowledge distillation [10, 11]. 2. Convolution decomposition: build shallow networks from the perspective of reducing convolutional computational costs, such as depth-wise separable convolution and group convolution.

Based on the idea of convolution decomposition, MobileNet [12] adopted depth-wise convolution to construct the backbone and achieved a fast running speed compared to the traditional convolution. DABNet [13] introduced a depth-wise asymmetric bottleneck, which achieved a well-balanced performance in terms of running speed and segmentation accuracy.

Furthermore, multi-scale feature fusion is often employed in the design of lightweight semantic segmentation. For example, a MAD module was introduced in LMFFNet [14] to combine the different levels of features into one stage and generate more accurate attention maps. CFNet [15] implemented a channel attention mechanism and a cross-fusion module to enhance the fusion effect.

In this paper, we present a novel lightweight network called EBUNet, which employs an encoder–decoder architecture, to achieve real-time semantic segmentation. Our EBUNet is mainly composed of three modules: Efficient Bottleneck Unit (EBU), Image Partition Attention Mechanism (IPAM) and a Lightweight Attentional Decoder (LAD).

Using depth-wise separation and dilated convolution simultaneously, we devise a novel residual-like structure named EBU, which achieves high accuracy with low computation costs. The IPAM module is designed to enhance the feature. The LAD module is presented to recover the spatial information and generate the segmentation results. We compare our method with other semantic segmentation methods in terms of parameters and mIoU. The results can be seen in Fig. 1.

Our main contributions can be listed as follows:

A novel Efficient Bottleneck Unit (EBU) module is proposed to gather detailed and semantic information. Based on the EBU module, we have implemented two EBU blocks as the main part of our EBUNet.
We propose a novel Lightweight Attentional Decoder (LAD) for recover spatial information effectively. In the proposed LAD module, different attention mechanisms are employed in the proposed LAD module to refine different levels of feature maps.
We present a novel Image Partition Attention Module (IPAM) to refine the feature maps from different EBU blocks. The proposed IPAM module partitions the feature maps into sub-regions and generates an attention map based on the partitioned sub-regions.
Our EBUNet obtains competitive results, it achieves 73.4% and 72.2% mIoU on Cityscapes and Camvid test sets along with 152 and 147 FPS, respectively.

The structure of this paper is organized as follows: Section II introduces some previous work about residual structures, lightweight semantic methods, and feature fusion methods. Section III presents our method, including the EBU module, IPAM module, and LAD module. Section IV discusses the experiment details and results. Section V concludes the whole paper.

Related works

In this section, we will review some related works, including residual structure, lightweight semantic networks, and multi-scale feature fusion methods.

Residual structure

Residual structure, which has been proven to be an effective way to overcome gradient explosion or vanishing problems. It was originally proposed in ResNet [16], and many residual-like structures have been proposed for various computer vision tasks since then. As an example, ShuffleNet [17, 18] developed lightweight backbone networks with depth-wise convolutions. LEDNet [19] introduced SS-nbt modules that combine factorized and dilated convolution for feature extraction. FBSNet [20] employed BRU modules to capture rich contextual information. DABNet [13] utilized dilated convolution in DAB modules to enlarge the receptive field, which helped to promote the detailed segmentation effect. MSCFNet [21] applied EAR modules to retrieve contextual and detailed information.

Lightweight semantic networks

In recent years, rapid-growing applications have required semantic segmentation approaches to run efficiently in real-world scenarios. The key concept of lightweight semantic segmentation is to maximize accuracy while minimizing feed-forward inference time.

A lot of attention has been paid to the design of lightweight semantic segmentation since ENet [22] was proposed. For example, Bisenet [23] introduced a dual-path structure. The context path was used for extracting contextual information, whereas the spatial path was used for extracting spatial information. A number of other approaches have been developed based on BiseNet, including STDCNet [24] and BiseNet-v2 [25], which produce more efficient and accurate results than the original BiseNet. Many lightweight networks have been developed in recent years to improve efficiency and effectiveness. For example, JPANet [26] presented a joint feature pyramid module for learning multi-stage features. FPANet [27] employed a feature pyramid fusion module to fuse features from different stages. RELAXNet [28] applied EBR and EABR modules to acquire context and detailed information.

Feature fusion

In different fields, feature fusion has different meanings. In signal processing, feature fusion is used to achieve high robustness by combining time and frequency information [29]. In deep learning technology, the feature fusion methods aim to fuse the feature maps from different stages.

There are two types of feature fusion methods that are commonly applied: channel-wise concatenation and element-wise addition. For example, in ContextNet [30] and Fast-SCNN [31], high-level feature maps are upsampled and then concatenated with low-level ones to achieve multi-scale feature fusion. The FBSNet [20] utilized element-wise addition for feature fusion, which combines features from the different branches. Since the attention mechanism was proposed, many attentional methods devoted to promoting the effect of multi-scale feature fusion. For example, The LMFFNet [14] introduced a FFM module for fusing different levels of feature maps, which employs an attention mechanism as well as depth-wise separable convolutions, ABCNet [32] utilized self-attention to fuse the feature maps from different stages.

Methodology

In this section, we fisrt examine the computation costs and parameters of the convolution operation. Then, we will introduce the components of our EBUNet, including EBU module, IPAM module, and LAD module. The architecture design of our EBUNet will be discussed at the end of this section.

Computation complexity analysis

The Convolutional Neural Networks (CNNs) are composed of convolutional layers and fully connected layers. In this section, we will discuss te computation complexity of the CNNs.

Before we start our discussion, we make some definitions to simplify our discussion. Defining a transformation function $\Phi $ to take ${C_\mathrm{{in}}}$ feature maps with a spatial size of $d \times d$ as inputs, and output ${C_\mathrm{{out}}}$ feature maps with the same size. ${C_\mathrm{{in}}}$ and ${C_\mathrm{{out}}}$ stand for the number of input and output channels, respectively. The convolutional kernel size is $k \times k$ and the stride is set to 1. Here, we use square feature maps and convolutional kernels for simplifying our discussion. We omit the bias and Batch normalization terms in the convolutional operation, which are often used in modern CNNs.

In this case, the number of parameters in the convolution is $k \times k \times {C_\mathrm{{in}}} \times {C_\mathrm{{out}}}$ and the computational complexity in terms of FLOPs is $k \times k \times {C_\mathrm{{in}}} \times {C_\mathrm{{out}}} \times d \times d$.

Based on the above conclusions, it is necessary to reduce the multiplication cost between $k \times k $ and ${C_\mathrm{{in}}} \times {C_\mathrm{{out}}}$, which is an effective way to cut down the size and the computation burden of convolutions. The depth-wise convolution applies this approach to explore compact models.

Compared to the standard convolution, the depth-wise separable convolution utilizes a single convolutional kernel independently for each input feature map, thus generating the same number of output channels. Following that is a 1$\times $1 convolution layer to merge the information of all output channels. The depth-wise separable decomposes the standard convolution into a depth-wise convolution and a point-wise convolution. By applying depth-wise separable convolution, the number of parameters becomes:

$$\begin{aligned} k \times k \times {C_\mathrm{{in}}} + {C_\mathrm{{in}}} \times {C_\mathrm{{out}}} \end{aligned}$$

(1)

and the computation complexity becomes:

$$\begin{aligned} k \times k \times {C_\mathrm{{in}}} \times d \times d + {C_\mathrm{{in}}} \times {C_\mathrm{{out}}} \times d \times d \end{aligned}$$

(2)

Based on the above equations, the amount of the parameters and computations are reduced by depth-wise separable convolution.

Efficient bottleneck unit

The EBU module is designed to extract semantic information more efficiently and effectively. Previous residual-like works, including bottleneck [22], SS-nbt [19], and EAR modules [21], have proven to be effective in the design of lightweight semantic segmentation.

As shown in Fig. 2d, we employ a $3 \times 3$ standard convolution to generate features and reduce the channels by half at the beginning of each EBU module. The output of the convolutional operation is then split into two branches, where each branch has 1/4 channels of the original input.

A convolutional kernel of 3$\times $3 is used in the EBU module to preserve adequate spatial information for accurate segmentation. In order to improve computation efficiency, a $3\times 3 $ depth-wise convolution is employed in the left branch to acquire local information.

The right branch is developed to obtain adequate contextual information. For the purpose of enlarging the receptive field, we use a depth-wise dilated convolution without adding any additional parameters in the right branch.

For the sake of sharing information between two branches, we put the feature interaction operations through an element-wise addition between two branches. So as to the two branches can complement each other.

At the end of the EBU module, another 3$\times $3 regular convolution is employed to integrate the multi-scale features and finally restore the number of channels as same as the number of input channels. The whole procedure can be expressed as follows:

$$\begin{aligned} F= & {} {f_{3 \times 3}}(X) \end{aligned}$$

(3)

$$\begin{aligned} {F_1},{F_2}= & {} \text {Split}(X) \end{aligned}$$

(4)

$$\begin{aligned} {F_1}= & {} f_{3 \times 3}^{\text {DW}}({F_1}) \end{aligned}$$

(5)

$$\begin{aligned} {F_2}= & {} f_{3 \times 3}^{\text {DDW}}({F_2}) \end{aligned}$$

(6)

$$\begin{aligned} {F_2}= & {} {F_2} + {F_1}\nonumber \\ {F_1}= & {} {F_1} + {F_2} \end{aligned}$$

(7)

$$\begin{aligned} {F_\mathrm{{out}}}= & {} \text {Concat}(F1,F2) \end{aligned}$$

(8)

$$\begin{aligned} {F_\mathrm{{out}}}= & {} X + {F_\mathrm{{out}}}, \end{aligned}$$

(9)

where, X is the input feature maps. ${F_1}$ and ${F_2}$ are the results of splitting operation. ${f_{3 \times 3}}$ represents standard $3 \times 3$ convolution. $f_{3 \times 3}^{\text {DW}}$ and $f_{3 \times 3}^{\text {DDW}}$stand for depth-wise convolution and depth-wise dilated convolution. Concat means feature concatenation along with the channel dimension.

Image partition attention module

The Attention mechanism has been widely used in various segmentation methods, such as BiseNet [23], MSCFNet [21], DFANet [33], etc. We introduce an Image Partition Attention Module (IPAM) in this paper.

As Fig. 3 shows, the input features are partitioned into four regions through an average pooling operation. Then, global average pooling operations are applied to each partitioned sub-region in parallel.

Each partitioned sub-region ${S_i}$ is then subjected to global average pooling simultaneously. The global average pooling operation is calculated as follows:

$$\begin{aligned} {F_\mathrm{{avg}}} = \frac{1}{{H \times W}}\sum \limits _{i = 1}^H {\sum \limits _{j = 1}^W {{X_H}(i,j)} }. \end{aligned}$$

(10)

Additionally, global average pooling is applied to the original input feature to acquire the global information. Following that, we use an element-wise addition to fuse the results from sub-region pooled features and global pooled features. This procedure is computed as follows:

$$\begin{aligned} T = \sum \limits _i {{S_i} + {F_\mathrm{{global}}}}, i \subset 1,2,3,4, \end{aligned}$$

(11)

where, ${{S_i}}$ represents the pooled results of sub-region. ${{F_{global}}}$ indicates the result from the pooled result from the original input.

The results from addition operation is then send into a projection layer by 1$\times $1 convolution to generating attention weigh vector w.

To be specific, the addition results are compressed across the channel dimension, then the ReLU function is applied to introduced non-linearity. After that, the channel increasing layer is employed to recover the channel to the number of original input. A sigmoid function is used to generate the attention weight vector w. The operation of the projection layer can be expressed as follows:

$$\begin{aligned} w = \sigma ({\pi _2}({\mathop {\textrm{ReLU}}\nolimits } ({\pi _1}(T))))), \end{aligned}$$

(12)

where, ${\pi _1}$ and ${\pi _2}$ represent the channel reduction and expansion function implemented by the two regular 1$\times $1 convolution, respectively. ${\mathop {\textrm{ReLU}}\nolimits } $ indicates the Rectified Linear Unit function.

At the end of the IPAM module, we can obtain the final output ${F_\mathrm{{out}}}$ as follows:

$$\begin{aligned} {F_\mathrm{{out}}} = {F_\mathrm{{in}}} \otimes w, \end{aligned}$$

(13)

where, $F_\mathrm{{out}}$ and ${F_\mathrm{{in}}}$ represents the input and output respectively, w means attention weights generated from IPAM module.

Lightweight attentional decoder

There are different roles assigned to encoders and decoders in encoder–decoder segmentation frameworks. The encoder is responsible for producing dense feature maps, whereas the decoder is responsible for upsampling the resolution of feature maps to match the original input size. It is possible to improve the accuracy of prediction with the use of well-designed decoders.

In our paper, we present a novel lightweight attentional decoder (LAD). It consists of two blocks and can fuse different-level features effectively. A channel attention module is proposed for the refinement of high-level feature maps, while a spatial attention module is proposed for the refinement of low-level ones. The structure of our proposed LAD is shown in Fig. 4.

We present a spatial attention module (SAM) to make the low-level features pay more attention to informative features. Let ${X_L}$ denote the input low-level feature maps, ${f_{conv}}$ represents the regular convolution operation, ${f_\mathrm{{mean}}}$ and ${f_\mathrm{{max}}}$ are the mean operation and maximum in the channel dimension, respectively. The spatial attention map S is computed as follows:

$$\begin{aligned} S = \sigma ({f_\mathrm{{conv}}}{f_{\text {cat}}}({f_{\text {mean}}},{f_{\text {max}}})), \end{aligned}$$

(14)

where, $\sigma (.)$ represents the sigmoid function. After the transformation, the shape of low-level features changes from $C \times H \times W$ to $1 \times H \times W$. Finally, we element-wise multiply the input low-level feature ${X_L}$ and the spatial weights map S to get our refined feature $X_L^S$:

$$\begin{aligned} X_L^S = {X_L} \otimes S, \end{aligned}$$

(15)

where, $ \otimes $ denotes the element-wise multiplication.

Our channel attention module (CAM) uses global average pooling to obtain global contextual informative and generates an channel attention map to refine the high-level features. Let ${X_H}(i,j)$ denotes values of ${X_H}$ at pixel location (i, j). ${X_H}$ represents input high-level feature maps. The global average pooling can be expressed as follows:

$$\begin{aligned} {F_{\text {avg}}} = \frac{1}{{H \times W}}\sum \limits _{i = 1}^H {\sum \limits _{j = 1}^W {{X_H}(i,j)} }. \end{aligned}$$

(16)

Consequently, the shape of the high-level features changes from $C \times H \times W$ to $1 \times 1 \times C$. Following that, ${F_{\text {avg}}}$ is fed into a convolution layer, and then passed through a sigmoid to generate channel attention map C:

$$\begin{aligned} C = \sigma (f_{\text {conv}}^{1 \times 1}{F_{\text {avg}}}). \end{aligned}$$

(17)

The final weighted high-level feature are acquired by multiplying feature map and the attention map:

$$\begin{aligned} X_H^C = {X_H} \otimes C. \end{aligned}$$

(18)

As a result of the abstracted spatial attention map produced from low-level features, we are able to identify the importance of each pixel, which focuses on locating objects and refining the corresponding shapes and boundaries with spatial details. On the other hand, the squeezed channel attention map generated from upsampled high-level features focuses on the global context to provide context information.

After that, the refined low-level features and high-level features are concatenated along with channel dimension. Finally, another upsampling operation is utilized to restore the feature map to its original size.

Table 1 Overall architecture of EBUNet

Full size table

Architecture design of EBUNet

The overall architecture of the proposed EBUNet is shown in Fig. 5 and is listed in Table 1.

Initial Unit is employed at the beginning of EBUNet to adjust the resolution of the input images and eliminate the redundant information. Initial Unit is composed of three consecutive standard convolutions. To be specific, the first convolution is used to reduce the image resolution by half. In the meanwhile, the channel number of the feature map is adjusted to 32. Afterwards, two 3 $\times $ 3 convolutions are utilized to obtain abundant contextual information.

Besides, downsampling operation is used to enlarge the receptive field. The downsampling operation is composed of two parallel branches: a standard 3 $\times $ 3 convolution with a stride of 2 and a 2 $\times $ 2 maximum pooling operation. Then the outputs of above the two parallel branches are concatenated along with the channel dimension.

After that, the feature map obtained by downsampling the output of the initial unit is input into the first EBU Block for dense feature extraction. The first EBU block contains three EBU modules with a dilated rate of 2. The input feature map of second EBU Block is 1/8 of the input, which contains 10 consecutive EBU modules with a gradually increasing dilated rates {2,2,4,4,6,6,8,8,16,16}. The IPAM is employed to refine the features from EBU block 1 and EBU block 2. Consequently, in the decoder phase, the LAD employs different kinds of attention mechanism for different-level feature maps and produces more accurate outputs.

Experiments

In this section, we first illustrate brief information about Cityscapes [34] and CamVid [35] datasets, following that, we introduce the training protocols for our experiments. Subsequently, ablation studies about several components of our EBUNet will be discussed. At the end of this section, we will discuss the performance of our method in the metric of prediction accuracy and running efficiency.

Datasets

We utilize Cityscapes and CamVid datasets in our training and testing experiments.

The Cityscapes dataset is a well-known dataset for semantic segmentation of urban scenes. There are 5000 fine-annotated images in the Cityscapes dataset: 2975 images for networks training, 500 images for networks validation, and 1525 images for networks testing. The original image resolution of Cityscapes is 2048$\times $1024. For fair comparisons, we use the full resolution for performance evaluation in the validation and testing phases. In the training phase, the resolution is resized to 512$\times $1024.

The CamVid dataset, derived from car-view videos, is another well-known urban scene dataset. The CamVid dataset consists of 701 images total: 367 images for the training phase, 101 images for the validation phase, and 233 images for the testing phase. The original resolution of CamVid dataset images is 720$\times $960.

Training protocols

All the experiments are performed with one NVIDIA RTX 3090 GPU, CUDA 11.6, and cuDNN v8 on pytorch platform, Ubuntu 20.04 operating system with 32GB Memory.

We employ Mini-Batch Stochastic Gradient Descent [36] (SGD) in our optimization strategy, where we set the batch size to 8, the weight decay to $1 \times {10^{ - 4}}$, the momentum to 0.9, and the initial learning rate to $4.5 \times {10^{ - 2}}$ in the training procedure of the Cityscapes dataset.

We train our EBUNet by using Adam optimizer when running experiments on the CamVid dataset. The initial learning is set to $1 \times {10^{ - 3}}$ and the weight decay is set to $2 \times {10^{ - 4}}$.

Besides, polynomial policy is employed to adjust the learning rate in the training phase. The polynomial policy is expressed as the follow formula:

$$\begin{aligned} \text {l}{\text {r}_{\text {cur}}} = \text {l}{\text {r}_{\text {init}}}*{\left( 1 - \frac{{\text {cur}\_\text {epoch}}}{{\max \_\text {epoch}}}\right) ^{0.9}}, \end{aligned}$$

(19)

where, $\text {l}{\text {r}_{\text {cur}}}$ represents the learning rate in the current epoch, ${\text {cur}\_\text {epoch}}$ stands for the current epoch, $\max \_\text {epoch}$ is the total epoch.

The $\max \_\text {epoch}$ was set to 1000 during the training process for both the Cityscapes and CamVid datasets. During the training phase, data augmentation techniques, such as random scale, mean subtraction, and horizontal flipping are also applied. A variety of random parameters were set to transform training samples to different scales, including 0.75, 1.0, 1.25, 1.5, 1.75, and 2.0. We randomly cropped the training images and labels in the cityscapes dataset from the resolution of 2048$\times $1024 to 512$\times $1024.

Ablation studies

In this part, we design a series of ablation experiments to validate the effectiveness of some proposed components of our EBUNet. We conduct ablation studies on the EBU module and LAD module. Additionally, we investigate the influence of depth within the EBU block. We perform all the ablation experiments on the Camvid dataset.

Ablation on EBU module

The main part of our EBUNet is constructed using the EBU module. We devise two kinds of ablation study strategies to verify the effectiveness of our EBUNet. In the first step, we design a series of experiments to investigate the influence of different dilated rates. The second is that we compare our EBU module to some other residual structures, including DABNet [13] and ERFNet [37]. The ablation study results can be seen in Tables 2 and 3.

Table 2 The experiment results about the dilated rates

Full size table

To study the effects of dilated rates, we devised five sequences with varying dilated rates and compared them with baseline. From Table 2, we can learn that when we set all the dialted rates in EBU modules to 2, the accuracy is 1.8% lower than the baseline. In addition, when we set a larger dialted rates sequence in EBU modules, the accuracy is 1.6% higher than R=2 but 0.2% lower than baseline.

Additionally, we designed experiments to test network performance using excessive dilated rates (32 and 48). As shown in Table 2, when the dilated rate was set to 32, the mIoU was decreased 69.8% and the FPS was also decreased from 147 to 140. Besides, both accuracy and speed decreased when all dilated rates were set to 48. We can concluded that the larger dilated rates would cause heavy computation cost. Furthermore, dilated convolution results are convolved from mutually independent subsets, which lose local information.

Table 3 Experiment results on residual structures

Full size table

From Table 3, we can observe that when EBU modules are substituted with non-bottleneck, the forward inference is higher than EBU modules are used. However, the accuracy of our EBUNet is 1% higher than it. As a result, our EBU module strikes a good balance between accuracy and efficiency. Additionally, a visual comparison was also conducted and the results can be seen in Fig. 6.

Table 4 Experiment results on LAD

Full size table

Table 5 Experiment results on the attention mechanism used in LAD

Full size table

Ablation on LAD module

LAD is used to recover the spatial information to the original input resolution. The ablation design of LAD is based on two strategies. In the first step, we compare our LAD with DABNet’s decoder. We then discuss how our LAD is affected by the attention mechanism. Specifically, we performed ablation studies on the different attention mechanisms in our LAD. Results of ablation studies are presented in Tables 4 and 5.

As shown in 4, the accuracy of LAD increased by 1.33% when compared to the DABNet decoder, but there was only an increase in parameters of 0.01 M, meaning the cost is negligible.

A visual comparison of the ablation results for the LAD module is also performed. The visual results can be seen in Figs. 7 and 8. The difference is highlighted by the yellow dashed line.

From Table 5, LAD achieves the highest mIoU when both SAM and CAM are used to refine different-level feature maps. The accuracy performance (mIoU) of LAD is 0.5% lower when only CAM is used to refine high-level features. When SAM is only used to refine low-level features, mIoU is also 0.3% lower than the baseline of LAD. SAM and CAM canceling in the LAD leads to a 1% reduction in accuracy. As a result, we can conclude that the attention mechanism has the potential to effectively improve segmentation accuracy while consuming negligible computation resources.

Ablation on the depth of EBU Block

There are two parameters M and N that indicate the number of EBU modules contained within EBU block 1 and EBU block 2. In order to investigate the model performance in terms of segmentation accuracy (mIoU) and feed forward speed (FPS), we devised a series of experiments using different values for M and N. The experiment settings and results are listed in Table 6.

According to Table 6, accuracy tends to get better as the depth inside the EBU blocks increases. However, accuracy can only be slightly improved if we increase the depth inside the EBU blocks. Even when we continue to deepen the depth of the EBU blocks, performance drops.

Table 6 The ablation results on the influence of the depth

Full size table

In general, increasing the depth of the network at the beginning will improve network performance to a certain degree, with a moderate increase in computational cost. However, when we make the network deeper, the accuracy and efficiency of the network fall instead.

Comparisons with other works

We compare the performance of our EBUNet with some other state-of-art semantic segmentation methods on the Cityscapes and CamVid datasets in this subsection. Similar to other lightweight semantic segmentation models, we perform down-sampling operations on the input images on Cityscapes. The resolution is decreased to 512$\times $1024 (for Cityscapes). For the CamVid dataset, we use origin resolution 720$\times $960 to perform our experiments. In addition, the speed of our EBUNet is measured on three different GPUs: RTX3090, RTX2080Ti, and TiTan XP.

Comparisons on cityscapes

A number of semantic segmentation methods are selected in recent years to demonstrate the effectiveness of our EBUNet, including ENet [22], ESPNet [38], CGNet [39], ContextNet [30], EDANet [40], ERFNet [37], Fast-SCNN [31], BiseNetV1 [23], ICNet [41], DABNet [13], LedNet [19], FBSNet [20], JPANet [26], MSCFNet [21], EDGENet [42], and FPANet [27]. Moreover, we also choose non-real-time semantic segmentation, including SegNet [43], DeepLabV2 [8], and RefineNet [7] to demonstrate the advance of our EBUNet.

Table 7 Comparison with state-of-art semantic segmentation methods on Cityscapes test set

Full size table

As a means of providing a comprehensive comparison, we have counted the input size, the parameters, the computational complexity (FLOPS), the forward inference speed (FPS), the GPU platform, and the accuracy (mIoU) for each model. The quantitative result is shown in Table 7. Our EBUNet achieves 73.4% mIoU at a speed of 152 FPS on a single RTX3090 GPU card. A speed evaluation of EBUNet on both Titan XP and RTX2080Ti was also conducted and reported in Table 7.

As shown in Table 7, the performance of our EBUNet can even outperform certain non-real-time approaches. It is worth noting that the speed of our EBUNet is 98 frames per second, which is much faster than the speed of DeepLabV2 [8] with RTX 2080Ti. Moreover, the accuracy of EBUNet is 3% higher than that of DeepLabV2. When compared to RefineNet, although the proposed EBUNet achieves a slightly accuracy lower (0.3%) than it. However, our EBUNet produces a much smaller amount of parameters than RefineNet, approximately 20$\times $ fewer parameters than RefineNet.

It is found that the parameter of our EBUNet is in the same order of magnitude when compared to the lightweight and real-time semantic segmentation methods, but EBUNet achieves a certain improvement in mIoU. Compared with ESNet [46], the mIoU increased 2.7%, and the EBUNet has fewer parameters, which is more lightweight than ESNet. Compared to the MSCFNet [21], the parameter of our EBUNet only increased 0.42 M, while the mIoU increased 1.5%. Meanwhile, the FPS of EBUNet on Titan XP is 63, which is faster than MSCFNet. In comparison to the AGLNet, our parameters increased by 0.45 M, but the mIoU has increased by 2.1%. Moreover, we are able to reach 98 FPS on the same GPU with RTX2080Ti, which is faster than AGLNet (46 FPS faster). When compared to FPANet [27], our EBUNet achieves the same accuracy performance, but with faster speed. Moreover, the number of parameters in our EBUNet is only 1/10 of FANet’s parameter.

The speed of our method has decreased somewhat to some extent when compared to fast semantic segmentation methods on RTX3090, including ContexNet [30], EDANet [45], and DABNet [13], but the accuracy has improved significantly, which are 7.3%, 6.1%, and 3.3%, respectively.

Additionally, we compare the performance of the different semantic classes in the cityscapes test set. The comparison results are shown in Table 8. We can learn from Table 8 that our EBUNet is able to achieve state-of-the-art results in 12 out of 19 semantic classes without requiring any pre-training. In addition, EBUNet achieves significant improvements in the three categories of trucks, sidewalks, and riders, which are 7%, 1.8%, and 1% higher than the second place, respectively. A visual comparison is also presented on the Cityscapes validation set, which can be seen in Fig. 9.

Table 8 The individual class accuracy performance on cityscapes test set

Full size table

According to the discussion above, our EBUNet achieves a good balance between segmentation accuracy and running efficiency on Cityscapes dataset.

Comparisons on CamVid

We also evaluate the performance of the proposed EBUNet on CamVid to further investigate its robustness, Table 9 reports the performance of our EBUNet and other methods (Fig. 10).

Table 9 Comparisons results on Camvid dataset with state-of-art works

Full size table

We can learn from Table 9 that our EBUNet achieves outstanding results. It achieves 72.2 mIoU at a speed of 147 FPS. A number of segmentation methods are selected and compared on a comprehensive basis: pre-training, FPS, parameters, and accuracy (mIoU). As shown in Table 9, among these methods, the proposed EBUNet achieves the best performance in terms of speed and accuracy. The EBUNet parameter has only increased 0.42 M compared to the AGLNet, but the accuracy has increased 2.8%. EBUNet achieves fast speed (27FPS faster) and higher accuracy (3.1 mIoU higher) in comparison to LMFFNet.

Conclusion

In this paper, we proposed an EBUNet for fast and accurate semantic segmentation tasks. Our EBUNet consists of three main components: EBU blocks, IPAM, and LAD. The EBU module adopted depth-wise convolution and depth-wise dilated convolution simultaneously to acquire much useful contextual information with a lower computation cost. The IPAM module was mainly used for refining the feature maps, which promotes segmentation accuracy with negligible parameters and costs. The LAD module was employed to recover the spatial information to the origin resolution. We designed a series of ablation studies to demonstrate the effectiveness of the EBU module and LAD in our EBUNet. We also make extensive comprehensive comparisons with others methods on both Cityscapes and CamVid datasets. To be specific, our EBUNet achieved a 73.4% and 72.26% mIoU on the above datasets. The FPS on the two datasets is 152 and 147 on a NVIDIA RTX 3090 platform, which is a competitive result. In conclusion, our EBUNet strikes a better trade-off between segmentation accuracy and efficiency.

However, our EBUNet has the following problems when applied in practical applications: First of all, the memories of the real applications are usually limited. Although we have adopted some lightweight techniques to reduce the computational complexity, our EBUNet still needs amount of memory resources, which is a huge burden for the equipment of the real applications. Besides, the storage space limited is another problem when deploys the EBUNet in real applications. Even we have adopted depth-wise separable convolution to reduce the amount of parameter and computational complexity in our EBUNet, it also requires about 6MB space to storage, which is consumption for the storage resources. Therefore, we dedicate to explore a novel architecture for semantic segmentation to gain a better tradeoff among running efficiency, segmentation accuracy, and memory consumption in the future.

Data availability statement

Two public datasets are used in this research. Researchers can get the datasets from the following websites: [Cityscapes]: https://www.cityscapes-dataset.com/ [Camvid]:http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/.

References

Lianos K-N, Schonberger JL, Pollefeys M, Sattler T (2018) Vso: visual semantic odometry. In: Proceedings of the European conference on computer vision (ECCV), pp 234–250
Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J (2017) A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857
Ess A, Müller T, Grabner H, Van Gool L (2009) Segmentation-based urban traffic scene understanding. In: BMVC, vol 1. Citeseer, p 2
Tao H, Cheng L, Qiu J, Stojanovic V (2022) Few shot cross equipment fault diagnosis method based on parameter optimization and feature mertic. Meas Sci Technol 33:115005
Djordjevic V, Stojanovic V, Tao H, Song X, He S, Gao W (2022) Data-driven control of hydraulic servo actuator based on adaptive dynamic programming. Discrete Contin Dyn Syst Ser S 15
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1925–1934
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40:834–848
Article Google Scholar
Xia M, Zhong Z, Chen D (2022) Structured pruning learns compact and accurate models. arXiv:2204.00408
Zhao B, Cui Q, Song R, Qiu Y, Liang J (2022) Decoupled knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11953–11962
Hou Y, Zhu X, Ma Y, Loy CC, Li Y (2022) Point-to-voxel knowledge distillation for lidar semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8479–8488
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Li G, Yun I, Kim J, Kim J (2019) Dabnet: depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv:1907.11357
Shi Min, Shen Jialin, Yi Qingming, Weng Jian, Huang Zunkai, Luo Aiwen, Zhou Yicong (2022) LMFFNet: a well-balanced lightweight network for fast and accurate semantic segmentation. IEEE Trans Neural Netw Learn Syst 1–15. https://doi.org/10.1109/TNNLS.2022.3176493
Peng C, Zhang K, Ma Y, Ma J (2021) Cross fusion net: a fast semantic segmentation network for small-scale semantic information capturing in aerial scenes. IEEE Trans Geosci Remote Sens 60:1–13
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131
Wang Y, Zhou Q, Liu J, Xiong J, Gao G, Wu X, Latecki LJ (2019) Lednet: a lightweight encoder-decoder network for real-time semantic segmentation. In: 2019 IEEE international conference on image processing (ICIP), IEEE, pp 1860–1864
Gao G, Xu G, Li J, Yu Y, Lu H, Yang J (2022) Fbsnet: a fast bilateral symmetrical network for real-time semantic segmentation. IEEE Trans Multim
Gao G, Xu G, Yu Y, Xie J, Yang J, Yue D (2021) Mscfnet: a lightweight network with multi-scale context fusion for real-time semantic segmentation. IEEE Trans Intell Transp Syst
Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147
Yu C, Wang J, Peng C, Gao C, Yu G, Sang N (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 325–341
Fan M, Lai S, Huang J, Wei X, Chai Z, Luo J, Wei X (2021) Rethinking bisenet for real-time semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9716–9725
Yu C, Gao C, Wang J, Yu G, Shen C, Sang N (2021) Bisenet v2: bilateral network with guided aggregation for real-time semantic segmentation. Int J Comput Vis 129:3051–3068
Article Google Scholar
Hu X, Jing L, Sehar U (2022) Joint pyramid attention network for real-time semantic segmentation of urban scenes. Appl Intell 52:580–594
Article Google Scholar
Wu Y, Jiang J, Huang Z, Tian Y (2022) Fpanet: feature pyramid aggregation network for real-time semantic segmentation. Appl Intell 52:3319–3336
Article Google Scholar
Liu J, Xu X, Shi Y, Deng C, Shi M (2022) Relaxnet: residual efficient learning and attention expected fusion network for real-time semantic segmentation. Neurocomputing 474:115–127
Article Google Scholar
Tao H, Qiu J, Chen Y, Stojanovic V, Cheng L (2023) Unsupervised cross-domain rolling bearing fault diagnosis based on time-frequency information fusion. J Frankl Inst 360:1454–1477
Article MATH Google Scholar
Poudel RP, Bonde U, Liwicki S, Zach C (2018) Contextnet: exploring context and detail for semantic segmentation in real-time. arXiv:1805.04554
Poudel RP, Liwicki S, Cipolla R (2019) Fast-scnn: fast semantic segmentation network. arXiv:1902.04502
Li R, Zheng S, Zhang C, Duan C, Wang L, Atkinson PM (2021) Abcnet: attentive bilateral contextual network for efficient semantic segmentation of fine-resolution remotely sensed imagery. ISPRS J Photogramm Remote Sens 181:84–98
Article Google Scholar
Li H, Xiong P, Fan H, Sun J (2019) Dfanet: deep feature aggregation for real-time semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9522–9531
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Brostow GJ, Shotton J, Fauqueur J, Cipolla R (2008) Segmentation and recognition using structure from motion point clouds. In: European conference on computer vision. Springer, pp 44–57
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. Springer, pp 177–186
Romera E, Alvarez JM, Bergasa LM, Arroyo R (2017) Erfnet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans Intell Transport Syst 19:263–272
Article Google Scholar
Mehta S, Rastegari M, Caspi A, Shapiro L, Hajishirzi H (2018) Espnet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 552–568
Wu T, Tang S, Zhang R, Cao J, Zhang Y (2020) Cgnet: a light-weight context guided network for semantic segmentation. IEEE Trans Image Process 30:1169–1179
Article Google Scholar
Yang C, Gao F (2019) Eda-net: dense aggregation of deep and shallow information achieves quantitative photoacoustic blood oxygenation imaging deep in human breast. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 246–254
Zhao H, Qi X, Shen X, Shi J, Jia J (2018) Icnet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European conference on computer vision (ECCV), pp 405–420
Han H-Y, Chen Y-C, Hsiao P-Y, Fu L-C (2020) Using channel-wise attention for deep cnn based real-time semantic segmentation with class-aware edge information. IEEE Trans Intell Transp Syst 22:1041–1051
Article Google Scholar
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39:2481–2495
Article Google Scholar
Treml M, Arjona-Medina J, Unterthiner T, Durgesh R, Friedmann F, Schuberth P, Mayr A, Heusel M, Hofmarcher M, Widrich M et al (2016) Speeding up semantic segmentation for autonomous driving
Lo S-Y, Hang H-M, Chan S-W, Lin J-J (2019) Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In: Proceedings of the ACM multimedia Asia, pp 1–6
Lyu H, Fu H, Hu X, Liu L (2019) Esnet: edge-based segmentation network for real-time semantic segmentation in traffic scenes. In: 2019 IEEE international conference on image processing (ICIP), IEEE, pp 1855–1859
Zhou Quan, Wang Yu, Fan Yawen, Wu Xiaofu, Zhang Suofei, Kang Bin, Latecki Longin Jan (2020) AGLNet: towards real-time semantic segmentation of self-driving images via attention-guided lightweight network. Appl Soft Comput 96:106682. https://doi.org/10.1016/j.asoc.2020.106682
Article Google Scholar
Pohlen T, Hermans A, Mathias M, Leibe B (2017) Full-resolution residual networks for semantic segmentation in street scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4151–4160

Download references

Funding

Funding was provided by Scientific Research Plan Projects of Shaanxi Education Department (Grant no. 21JK0684).

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University, No. 127 Youyi Xilu, Xi’an, 710072, Shaanxi, China
Siyuan Shen, Zhengjun Zhai & Wei Dai
AVIC Xi’an Aeronautics Computing Technique Research Institute, jinye 2rd road, Xi’an, 710065, Shaanxi, China
Guanfeng Yu
School of Marine Science and Technology, Northwestern Polytechnical University, No. 127 Youyi Xilu, Xi’an, 710072, Shaanxi, China
Youyu Yan

Authors

Siyuan Shen
View author publications
You can also search for this author in PubMed Google Scholar
Zhengjun Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Guanfeng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Youyu Yan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Dai
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Introduction: SS and ZZ; methodology: SS, YY, and GY; software: SS and WD; validation: SS; writing—original draft preparation: SS and YY; writing—review and editing: SS, YY, and GY. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Youyu Yan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shen, S., Zhai, Z., Yu, G. et al. EBUNet: a fast and accurate semantic segmentation network with lightweight efficient bottleneck unit. Complex Intell. Syst. 9, 5975–5990 (2023). https://doi.org/10.1007/s40747-023-01054-y

Download citation

Received: 21 December 2022
Accepted: 20 March 2023
Published: 17 April 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s40747-023-01054-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

EBUNet: a fast and accurate semantic segmentation network with lightweight efficient bottleneck unit

Abstract

Similar content being viewed by others

A lightweight network with attention decoder for real-time semantic segmentation

DAABNet: depth-wise asymmetric attention bottleneck for real-time semantic segmentation

LAANet: lightweight attention-guided asymmetric network for real-time semantic segmentation

Introduction

Related works

Residual structure

Lightweight semantic networks

Feature fusion

Methodology

Computation complexity analysis

Efficient bottleneck unit

Image partition attention module

Lightweight attentional decoder

Architecture design of EBUNet

Experiments

Datasets

Training protocols

Ablation studies

Ablation on EBU module

Ablation on LAD module

Ablation on the depth of EBU Block

Comparisons with other works

Comparisons on cityscapes

Comparisons on CamVid

Conclusion

Data availability statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation