EBUNet: a fast and accurate semantic segmentation network with lightweight efficient bottleneck unit

It has been difficult to achieve a suitable balance between effectiveness and efficiency in lightweight semantic segmentation networks in recent years. The goal of this work is to present an efficient and reliable semantic segmentation method called EBUNet, which is aimed at achieving a favorable trade-off between inference speed and prediction accuracy. Initially, we develop an Efficient Bottleneck Unit (EBU) that employs depth-wise convolution and depth-wise dilated convolution to obtain adequate features with moderate computation costs. Then, we developed a novel Image Partition Attention Module (IPAM), which divides feature maps into subregions and generates attention weights based on them. As a third step, we developed a novel lightweight attention decoder with which to retrieve spatial information effectively. Extensive experiments show that our EBUNet achieves 73.4% mIou and 152 FPS on the Cityscapes dataset and 72.2% mIoU and 147 FPS on the Camvid dataset with only 1.57 M parameters. The results of the experiment confirm that the proposed model is capable of making a decent trade-off in terms of accuracy, inference, and model size. The source code of our EBUNet is available at (https://github.com/Skybird1101/EBUNet).


Introduction
Semantic segmentation assigns a category to each pixel in the input image, which is a dense classification task in computer vision.As a result of dense segmentation predictions, it has a wide range of applications in the real world, including autonomous driving [1], virtual reality [2], scene understanding [3] and so on.
Deep learning technology has made significant progress in many fields, including fault diagnosis [4], automation control [5], and semantic segmentation.Using Convolution Neural Network (CNNs), some advanced semantic segmentation methods have achieved a significant progress in terms of accuracy, including PSPNet [6], RefineNet [7], and DeepLab [8] in recent years.However, they typically possess a complex structure and hundreds of convolution layers and feature channels, which consume a large amount of computing resource and limit the wide application in real world.Consequently, it remains a challenge to design a lightweight network that achieves real-time performance and a satisfactory accuracy.
At present, many lightweight semantic segmentation methods have been developed to achieve a good tradoff between inference speed and prediction accuracy, which can be broadly divided into two categories: 1. Mode compression: eliminate unnecessary calculations and reduce the amount of data that needs to be stored by simplifying models, including pruning networks [9], and knowledge distillation [10,11].2.
Convolution decomposition: build shallow networks from the perspective of reducing convolutional computational costs, such as depth-wise separable convolution and group convolution.
Based on the idea of convolution decomposition, MobileNet [12] adopted depth-wise convolution to construct the backbone and achieved a fast running speed compared to the traditional convolution.DABNet [13] introduced a depthwise asymmetric bottleneck, which achieved a well-balanced performance in terms of running speed and segmentation accuracy.
Furthermore, multi-scale feature fusion is often employed in the design of lightweight semantic segmentation.For example, a MAD module was introduced in LMFFNet [14] to combine the different levels of features into one stage and generate more accurate attention maps.CFNet [15] implemented a channel attention mechanism and a cross-fusion module to enhance the fusion effect.
In this paper, we present a novel lightweight network called EBUNet, which employs an encoder-decoder architecture, to achieve real-time semantic segmentation.Our EBUNet is mainly composed of three modules: Efficient Bottleneck Unit (EBU), Image Partition Attention Mechanism (IPAM) and a Lightweight Attentional Decoder (LAD).
Using depth-wise separation and dilated convolution simultaneously, we devise a novel residual-like structure named EBU, which achieves high accuracy with low computation costs.The IPAM module is designed to enhance the feature.The LAD module is presented to recover the spatial information and generate the segmentation results.We com-pare our method with other semantic segmentation methods in terms of parameters and mIoU.The results can be seen in Fig. 1.
Our main contributions can be listed as follows: The structure of this paper is organized as follows: Section II introduces some previous work about residual structures, lightweight semantic methods, and feature fusion methods.Section III presents our method, including the EBU module, IPAM module, and LAD module.Section IV discusses the experiment details and results.Section V concludes the whole paper.

Related works
In this section, we will review some related works, including residual structure, lightweight semantic networks, and multiscale feature fusion methods.

Residual structure
Residual structure, which has been proven to be an effective way to overcome gradient explosion or vanishing problems.It was originally proposed in ResNet [16], and many residuallike structures have been proposed for various computer vision tasks since then.As an example, ShuffleNet [17,18] developed lightweight backbone networks with depthwise convolutions.LEDNet [19] introduced SS-nbt modules that combine factorized and dilated convolution for feature extraction.FBSNet [20] employed BRU modules to capture rich contextual information.DABNet [13] utilized dilated convolution in DAB modules to enlarge the receptive field, which helped to promote the detailed segmentation effect.MSCFNet [21] applied EAR modules to retrieve contextual and detailed information.

Lightweight semantic networks
In recent years, rapid-growing applications have required semantic segmentation approaches to run efficiently in realworld scenarios.The key concept of lightweight semantic segmentation is to maximize accuracy while minimizing feed-forward inference time.
A lot of attention has been paid to the design of lightweight semantic segmentation since ENet [22] was proposed.For example, Bisenet [23] introduced a dual-path structure.The context path was used for extracting contextual information, whereas the spatial path was used for extracting spatial information.A number of other approaches have been developed based on BiseNet, including STDCNet [24] and BiseNet-v2 [25], which produce more efficient and accurate results than the original BiseNet.Many lightweight networks have been developed in recent years to improve efficiency and effectiveness.For example, JPANet [26] presented a joint feature pyramid module for learning multi-stage features.FPANet [27] employed a feature pyramid fusion module to fuse features from different stages.RELAXNet [28] applied EBR and EABR modules to acquire context and detailed information.

Feature fusion
In different fields, feature fusion has different meanings.In signal processing, feature fusion is used to achieve high robustness by combining time and frequency information [29].In deep learning technology, the feature fusion methods aim to fuse the feature maps from different stages.
There are two types of feature fusion methods that are commonly applied: channel-wise concatenation and element-wise addition.For example, in ContextNet [30] and Fast-SCNN [31], high-level feature maps are upsampled and then concatenated with low-level ones to achieve multiscale feature fusion.The FBSNet [20] utilized element-wise addition for feature fusion, which combines features from the different branches.Since the attention mechanism was proposed, many attentional methods devoted to promoting the effect of multi-scale feature fusion.For example, The LMFFNet [14] introduced a FFM module for fusing different levels of feature maps, which employs an attention mechanism as well as depth-wise separable convolutions, ABCNet [32] utilized self-attention to fuse the feature maps from different stages.

Methodology
In this section, we fisrt examine the computation costs and parameters of the convolution operation.Then, we will introduce the components of our EBUNet, including EBU module, IPAM module, and LAD module.The architecture design of our EBUNet will be discussed at the end of this section.

Computation complexity analysis
The Convolutional Neural Networks (CNNs) are composed of convolutional layers and fully connected layers.In this section, we will discuss te computation complexity of the CNNs.
Before we start our discussion, we make some definitions to simplify our discussion.Defining a transformation function to take C in feature maps with a spatial size of d × d as inputs, and output C out feature maps with the same size.C in and C out stand for the number of input and output channels, respectively.The convolutional kernel size is k × k and the stride is set to 1. Here, we use square feature maps and convolutional kernels for simplifying our discussion.We omit the bias and Batch normalization terms in the convolutional operation, which are often used in modern CNNs.
In this case, the number of parameters in the convolution is k × k × C in × C out and the computational complexity in terms of FLOPs is Based on the above conclusions, it is necessary to reduce the multiplication cost between k × k and C in × C out , which is an effective way to cut down the size and the computation burden of convolutions.The depth-wise convolution applies this approach to explore compact models.
Compared to the standard convolution, the depth-wise separable convolution utilizes a single convolutional kernel independently for each input feature map, thus generating the same number of output channels.Following that is a 1×1 convolution layer to merge the information of all output channels.The depth-wise separable decomposes the standard convolution into a depth-wise convolution and a point-wise convolution.By applying depth-wise separable convolution, the number of parameters becomes: and the computation complexity becomes: Based on the above equations, the amount of the parameters and computations are reduced by depth-wise separable convolution.

Efficient bottleneck unit
The EBU module is designed to extract semantic information more efficiently and effectively.Previous residual-like works, including bottleneck [22], SS-nbt [19], and EAR modules [21], have proven to be effective in the design of lightweight semantic segmentation.
As shown in Fig. 2d, we employ a 3 × 3 standard convolution to generate features and reduce the channels by half at the beginning of each EBU module.The output of the convolutional operation is then split into two branches, where each branch has 1/4 channels of the original input.
A convolutional kernel of 3×3 is used in the EBU module to preserve adequate spatial information for accurate segmentation.In order to improve computation efficiency, a 3 × 3 depth-wise convolution is employed in the left branch to acquire local information.
The right branch is developed to obtain adequate contextual information.For the purpose of enlarging the receptive field, we use a depth-wise dilated convolution without adding any additional parameters in the right branch.
For the sake of sharing information between two branches, we put the feature interaction operations through an elementwise addition between two branches.So as to the two branches can complement each other.
At the end of the EBU module, another 3×3 regular convolution is employed to integrate the multi-scale features and finally restore the number of channels as same as the number of input channels.The whole procedure can be expressed as follows: where, X is the input feature maps.F 1 and F 2 are the results of splitting operation.f 3×3 represents standard 3 × 3 convolution.f DW 3×3 and f DDW 3×3 stand for depth-wise convolution and depth-wise dilated convolution.Concat means feature concatenation along with the channel dimension.

Image partition attention module
The Attention mechanism has been widely used in various segmentation methods, such as BiseNet [23], MSCFNet [21], DFANet [33], etc.We introduce an Image Partition Attention Module (IPAM) in this paper.
As Fig. 3 shows, the input features are partitioned into four regions through an average pooling operation.Then, global average pooling operations are applied to each partitioned sub-region in parallel.
Each partitioned sub-region S i is then subjected to global average pooling simultaneously.The global average pooling operation is calculated as follows: Additionally, global average pooling is applied to the original input feature to acquire the global information.Following that, we use an element-wise addition to fuse the results from sub-region pooled features and global pooled features.This procedure is computed as follows: where, S i represents the pooled results of sub-region.F global indicates the result from the pooled result from the original input.
The results from addition operation is then send into a projection layer by 1×1 convolution to generating attention weigh vector w.
To be specific, the addition results are compressed across the channel dimension, then the ReLU function is applied to introduced non-linearity.After that, the channel increasing layer is employed to recover the channel to the number of original input.A sigmoid function is used to generate the attention weight vector w.The operation of the projection layer can be expressed as follows: where, π 1 and π 2 represent the channel reduction and expansion function implemented by the two regular 1×1 convolution, respectively.ReLU indicates the Rectified Linear Unit function.
At the end of the IPAM module, we can obtain the final output F out as follows: where, F out and F in represents the input and output respectively, w means attention weights generated from IPAM module.

Lightweight attentional decoder
There are different roles assigned to encoders and decoders in encoder-decoder segmentation frameworks.The encoder is responsible for producing dense feature maps, whereas the decoder is responsible for upsampling the resolution of feature maps to match the original input size.It is possible to improve the accuracy of prediction with the use of welldesigned decoders.
In our paper, we present a novel lightweight attentional decoder (LAD).It consists of two blocks and can fuse different-level features effectively.A channel attention module is proposed for the refinement of high-level feature maps, while a spatial attention module is proposed for the refine- We present a spatial attention module (SAM) to make the low-level features pay more attention to informative features.Let X L denote the input low-level feature maps, f conv represents the regular convolution operation, f mean and f max are the mean operation and maximum in the channel dimension, respectively.The spatial attention map S is computed as follows: where, σ (.) represents the sigmoid function.After the transformation, the shape of low-level features changes from C × H × W to 1× H × W . Finally, we element-wise multiply the input low-level feature X L and the spatial weights map S to get our refined feature X S L : where, ⊗ denotes the element-wise multiplication.Our channel attention module (CAM) uses global average pooling to obtain global contextual informative and generates an channel attention map to refine the high-level features.Let X H (i, j) denotes values of X H at pixel location (i, j).X H represents input high-level feature maps.The global average pooling can be expressed as follows: Consequently, the shape of the high-level features changes from C × H × W to 1 × 1 × C. Following that, F avg is fed into a convolution layer, and then passed through a sigmoid to generate channel attention map C: The final weighted high-level feature are acquired by multiplying feature map and the attention map: As a result of the abstracted spatial attention map produced from low-level features, we are able to identify the importance of each pixel, which focuses on locating objects and refining the corresponding shapes and boundaries with spatial details.On the other hand, the squeezed channel attention map generated from upsampled high-level features focuses on the global context to provide context information.
After that, the refined low-level features and high-level features are concatenated along with channel dimension.Finally, another upsampling operation is utilized to restore the feature map to its original size.

Architecture design of EBUNet
The overall architecture of the proposed EBUNet is shown in Fig. 5 and is listed in Table 1.
Initial Unit is employed at the beginning of EBUNet to adjust the resolution of the input images and eliminate the redundant information.Initial Unit is composed of three  consecutive standard convolutions.To be specific, the first convolution is used to reduce the image resolution by half.
In the meanwhile, the channel number of the feature map is adjusted to 32.Afterwards, two 3 × 3 convolutions are utilized to obtain abundant contextual information.Besides, downsampling operation is used to enlarge the receptive field.The downsampling operation is composed of two parallel branches: a standard 3 × 3 convolution with a stride of 2 and a 2 × 2 maximum pooling operation.Then the outputs of above the two parallel branches are concatenated along with the channel dimension.
After that, the feature map obtained by downsampling the output of the initial unit is input into the first EBU Block for dense feature extraction.The first EBU block contains three EBU modules with a dilated rate of 2. The input feature map of second EBU Block is 1/8 of the input, which contains 10 consecutive EBU modules with a gradually increasing dilated rates {2,2,4,4,6,6,8,8,16,16}.The IPAM is employed to refine the features from EBU block 1 and EBU block 2. Consequently, in the decoder phase, the LAD employs different kinds of attention mechanism for different-level feature maps and produces more accurate outputs.

Experiments
In this section, we first illustrate brief information about Cityscapes [34] and CamVid [35] datasets, following that, we introduce the training protocols for our experiments.Subsequently, ablation studies about several components of our EBUNet will be discussed.At the end of this section, we will discuss the performance of our method in the metric of prediction accuracy and running efficiency.

Datasets
We utilize Cityscapes and CamVid datasets in our training and testing experiments.
The Cityscapes dataset is a well-known dataset for semantic segmentation of urban scenes.There are 5000 fineannotated images in the Cityscapes dataset: 2975 images for networks training, 500 images for networks validation, and 1525 images for networks testing.The original image resolution of Cityscapes is 2048×1024.For fair comparisons, we use the full resolution for performance evaluation in the validation and testing phases.In the training phase, the resolution is resized to 512×1024.
The CamVid dataset, derived from car-view videos, is another well-known urban scene dataset.The CamVid dataset consists of 701 images total: 367 images for the training phase, 101 images for the validation phase, and 233 images for the testing phase.The original resolution of CamVid dataset images is 720×960.

Training protocols
All the experiments are performed with one NVIDIA RTX 3090 GPU, CUDA 11.6, and cuDNN v8 on pytorch platform, Ubuntu 20.04 operating system with 32GB Memory.
We employ Mini-Batch Stochastic Gradient Descent [36] (SGD) in our optimization strategy, where we set the batch size to 8, the weight decay to 1 × 10 −4 , the momentum to 0.9, and the initial learning rate to 4.5 × 10 −2 in the training procedure of the Cityscapes dataset.
We train our EBUNet by using Adam optimizer when running experiments on the CamVid dataset.The initial learning is set to 1 × 10 −3 and the weight decay is set to 2 × 10 −4 .
Besides, polynomial policy is employed to adjust the learning rate in the training phase.The polynomial policy is expressed as the follow formula: where, lr cur represents the learning rate in the current epoch, cur_epoch stands for the current epoch, max _epoch is the total epoch.The max _epoch was set to 1000 during the training process for both the Cityscapes and CamVid datasets.During the training phase, data augmentation techniques, such as random scale, mean subtraction, and horizontal flipping are also applied.A variety of random parameters were set to transform training samples to different scales, including 0.75, 1.0, 1.25, 1.5, 1.75, and 2.0.We randomly cropped the training images and labels in the cityscapes dataset from the resolution of 2048×1024 to 512×1024.

Ablation studies
In this part, we design a series of ablation experiments to validate the effectiveness of some proposed components of our EBUNet.We conduct ablation studies on the EBU module and LAD module.Additionally, we investigate the influence of depth within the EBU block.We perform all the ablation experiments on the Camvid dataset.

Ablation on EBU module
The main part of our EBUNet is constructed using the EBU module.We devise two kinds of ablation study strategies to verify the effectiveness of our EBUNet.In the first step, we design a series of experiments to investigate the influence of different dilated rates.The second is that we compare our EBU module to some other residual structures, including DABNet [13] and ERFNet [37].The ablation study results can be seen in Tables 2 and 3.
To study the effects of dilated rates, we devised five sequences with varying dilated rates and compared them with baseline.From Table 2, we can learn that when we set all the dialted rates in EBU modules to 2, the accuracy is 1.8% lower than the baseline.In addition, when we set a larger dialted rates sequence in EBU modules, the accuracy is 1.6% higher than R=2 but 0.2% lower than baseline.
Additionally, we designed experiments to test network performance using excessive dilated rates (32 and 48).As shown in Table 2, when the dilated rate was set to 32, the mIoU was decreased 69.8% and the FPS was also decreased from 147 to 140.Besides, both accuracy and speed decreased when all dilated rates were set to 48.We can concluded that the larger dilated rates would cause heavy computation cost.Furthermore, dilated convolution results are convolved from mutually independent subsets, which lose local information.
From Table 3, we can observe that when EBU modules are substituted with non-bottleneck, the forward inference is higher than EBU modules are used.However, the accuracy of our EBUNet is 1% higher than it.As a result, our EBU module strikes a good balance between accuracy and efficiency.Additionally, a visual comparison was also conducted and the results can be seen in Fig. 6.

Ablation on LAD module
LAD is used to recover the spatial information to the original input resolution.The ablation design of LAD is based on two strategies.In the first step, we compare our LAD with DAB-Net's decoder.We then discuss how our LAD is affected by the attention mechanism.Specifically, we performed ablation studies on the different attention mechanisms in our LAD.Results of ablation studies are presented in Tables 4 and 5.As shown in 4, the accuracy of LAD increased by 1.33% when compared to the DABNet decoder, but there was only an increase in parameters of 0.01 M, meaning the cost is negligible.
A visual comparison of the ablation results for the LAD module is also performed.The visual results can be seen in Figs.7 and 8.The difference is highlighted by the yellow dashed line.
From Table 5, LAD achieves the highest mIoU when both SAM and CAM are used to refine different-level feature maps.The accuracy performance (mIoU) of LAD is 0.5% lower when only CAM is used to refine high-level features.
When SAM is only used to refine low-level features, mIoU is also 0.3% lower than the baseline of LAD.SAM and CAM canceling in the LAD leads to a 1% reduction in accuracy.As a result, we can conclude that the attention mechanism has the potential to effectively improve segmentation accuracy while consuming negligible computation resources.

Ablation on the depth of EBU Block
There are two parameters M and N that indicate the number of EBU modules contained within EBU block 1 and EBU block 2. In order to investigate the model performance in terms of segmentation accuracy (mIoU) and feed forward speed (FPS), we devised a series of experiments using different values for M and N. The experiment settings and results are listed in Table 6.
According to Table 6, accuracy tends to get better as the depth inside the EBU blocks increases.However, accuracy can only be slightly improved if we increase the depth inside the EBU blocks.Even when we continue to deepen the depth of the EBU blocks, performance drops.
In general, increasing the depth of the network at the beginning will improve network performance to a certain degree, with a moderate increase in computational cost.However, when we make the network deeper, the accuracy and efficiency of the network fall instead.

Comparisons with other works
We compare the performance of our EBUNet with some other state-of-art semantic segmentation methods on the Cityscapes and CamVid datasets in this subsection.Similar to other lightweight semantic segmentation models, we perform down-sampling operations on the input images on Cityscapes.The resolution is decreased to 512×1024 (for Cityscapes).For the CamVid dataset, we use origin resolution 720×960 to perform our experiments.In addition, the speed of our EBUNet is measured on three different GPUs: RTX3090, RTX2080Ti, and TiTan XP.
As a means of providing a comprehensive comparison, we have counted the input size, the parameters, the computational complexity (FLOPS), the forward inference speed (FPS), the GPU platform, and the accuracy (mIoU) for each model.The quantitative result is shown in Table 7.Our EBUNet achieves 73.4% mIoU at a speed of 152 FPS on a single RTX3090 GPU card.A speed evaluation of EBUNet on both Titan XP and RTX2080Ti was also conducted and reported in Table 7.
As shown in Table 7, the performance of our EBUNet can even outperform certain non-real-time approaches.It is worth noting that the speed of our EBUNet is 98 frames per second, which is much faster than the speed of DeepLabV2 [8] with RTX 2080Ti.Moreover, the accuracy of EBUNet is 3% higher than that of DeepLabV2.When compared to RefineNet, although the proposed EBUNet achieves a slightly accuracy lower (0.3%) than it.However, our EBUNet produces a much smaller amount of parameters than RefineNet, approximately 20× fewer parameters than RefineNet.
It is found that the parameter of our EBUNet is in the same order of magnitude when compared to the lightweight and real-time semantic segmentation methods, but EBUNet achieves a certain improvement in mIoU.Compared with ESNet [46], the mIoU increased 2.7%, and the EBUNet has fewer parameters, which is more lightweight than ESNet.Compared to the MSCFNet [21], the parameter of our EBUNet only increased 0.42 M, while the mIoU increased 1.5%.Meanwhile, the FPS of EBUNet on Titan XP is 63, which is faster than MSCFNet.In comparison to the AGLNet, our parameters increased by 0.45 M, but the mIoU has increased by 2.1%.Moreover, we are able to reach 98 FPS on the same GPU with RTX2080Ti, which is faster than AGLNet (46 FPS faster).When compared to FPANet [27], our EBUNet achieves the same accuracy performance, but with faster speed.Moreover, the number of parameters in our EBUNet is only 1/10 of FANet's parameter.
The speed of our method has decreased somewhat to some extent when compared to fast semantic segmentation methods on RTX3090, including ContexNet [30], EDANet [45], The best performance are highlighted in bold and DABNet [13], but the accuracy has improved significantly, which are 7.3%, 6.1%, and 3.3%, respectively.Additionally, we compare the performance of the different semantic classes in the cityscapes test set.The comparison results are shown in Table 8.We can learn from Table 8 that our EBUNet is able to achieve state-of-the-art results in 12 out of 19 semantic classes without requiring any pre-training.In addition, EBUNet achieves significant improvements in the three categories of trucks, sidewalks, and riders, which are 7%, 1.8%, and 1% higher than the second place, respectively.A visual comparison is also presented on the Cityscapes validation set, which can be seen in Fig. 9.
According to the discussion above, our EBUNet achieves a good balance between segmentation accuracy and running efficiency on Cityscapes dataset.

Comparisons on CamVid
We also evaluate the performance of the proposed EBUNet on CamVid to further investigate its robustness, Table 9 reports the performance of our EBUNet and other methods (Fig. 10).
We can learn from Table 9 that our EBUNet achieves outstanding results.It achieves 72.2 mIoU at a speed of 147 FPS.A number of segmentation methods are selected and compared on a comprehensive basis: pre-training, FPS, parameters, and accuracy (mIoU).As shown in Table 9, among these methods, the proposed EBUNet achieves the best performance in terms of speed and accuracy.The EBUNet parameter has only increased 0.42 M compared to the AGLNet, but the accuracy has increased 2.8%.EBUNet achieves fast speed (27FPS faster) and higher accuracy (3.1 mIoU higher) in comparison to LMFFNet.

Conclusion
In this paper, we proposed an EBUNet for fast and accurate semantic segmentation tasks.Our EBUNet consists of three main components: EBU blocks, IPAM, and LAD.The EBU module adopted depth-wise convolution and depthwise dilated convolution simultaneously to acquire much useful contextual information with a lower computation cost.The best accuracy performance are highlighted in bold

Fig. 1
Fig. 1 Comparisons with other methods in terms of parameters and accuracy.Our EBUNet achieved a competitive result

Fig. 4
Fig. 4 Illustration of our LAD module

Fig. 6
Fig. 6 Visual results about ablation study on EBU module.From the left column to the right column is: input, ground-truth, baseline, EBUNet with DABmodule, and EBUNet with non-bottleneck

Fig. 7
Fig. 7 Visual results about ablation study on EBU module.From the left column to right column is: input, ground-truth, baseline, and decoder in DABNet

Fig. 8
Fig. 8 Visual comparisons about LAD.From the left-most to right-most are a input b ground-truth c baseline of LAD d only use CAM in LAD e only use SAM in LAD f no attention used in LAD

Fig. 9
Fig. 9 Visual results on cityscapes.From lest column to right column is: input, ground-truth, DABNet, CGNet and our EBUNet

Table 1
Overall architecture of EBUNet

Table 4
Experiment

Table 6
The ablation results on the influence of the depth

Table 8
The individual class accuracy performance on cityscapes test set