Keywords

1 Introduction

Semantic image segmentation is a fundamental task in computer vision. It predicts dense labels for all pixels in the image, and is regarded as a very important task that can help deep understanding of scene, objects, and human. Development of recent deep convolutional neural networks (CNNs) makes remarkable progress on semantic segmentation [1,2,3,4,5,6]. The effectiveness of these networks largely depends on the sophisticated model design regarding depth and width, which has to involve many operations and parameters.

CNN-based semantic segmentation mainly exploits fully convolutional networks (FCNs). It is common wisdom now that increase of result accuracy almost means more operations, especially for pixel-level prediction tasks like semantic segmentation. To illustrate it, we show in Fig. 1(a) the accuracy and inference time of different frameworks on Cityscapes [7] dataset.

Status of Fast Semantic Segmentation. Contrary to the extraordinary development of high-quality semantic segmentation, research along the line to make semantic segmentation run fast while not sacrificing too much quality is left behind. We note actually this line of work is similarly important since it can inspire or enable many practical tasks in, for example, automatic driving, robotic interaction, online video processing, and even mobile computing where running time becomes a critical factor to evaluate system performance.

Our experiments show that high-accuracy methods of ResNet38 [6] and PSPNet [5] take around 1 second to predict a \(1024\times 2048\) high-resolution image on one Nvidia TitanX GPU card during testing. These methods fall into the area illustrated in Fig. 1(a) with high accuracy and low speed. Recent fast semantic segmentation methods of ENet [8] and SQ [9], contrarily, take quite different positions in the plot. The speed is much accelerated; but accuracy drops, where the final mIoUs are lower than 60%. These methods are located in the lower right phase in the figure.

Fig. 1.
figure 1

(a): Inference speed and mIoU performance on Cityscapes [7] test set. Methods involved are PSPNet [5], ResNet38 [6], DUC [10], RefineNet [11], FRRN [12], DeepLabv2-CRF [13], Dilation10 [14], DPN [15], FCN-8s [1], DeepLab [2], CRF-RNN [16], SQ [9], ENet [8], SegNet [3], and our ICNet. (b): Time spent on PSPNet50 with dilation 8 for two input images. Roughly running time is proportional to the pixel number and kernel number. (Blue ones are tested with downsampled images. Inference speed is reported with single network forward while accuracy of several mIoU aimed approaches (like PSPNet\(^\star \)) may contain testing tricks like multi-scale and flipping, resulting much more time. See supplementary material for detailed information.) (Color figure online)

Our Focus and Contributions. In this paper, we focus on building a practically fast semantic segmentation system with decent prediction accuracy. Our method is the first in its kind to locate in the top-right area shown in Fig. 1(a) and is one of the only two available real-time approaches. It achieves decent trade-off between efficiency and accuracy.

Different from previous architectures, we make comprehensive consideration on the two factors of speed and accuracy that are seemingly contracting. We first make in-depth analysis of time budget in semantic segmentation frameworks and conduct extensive experiments to demonstrate insufficiency of intuitive speedup strategies. This motivates development of image cascade network (ICNet), a high efficiency segmentation system with decent quality. It exploits efficiency of processing low-resolution images and high inference quality of high-resolution ones. The idea is to let low-resolution images go through the full semantic perception network first for a coarse prediction map. Then cascade feature fusion unit and cascade label guidance strategy are proposed to integrate medium and high resolution features, which refine the coarse semantic map gradually. We make all our code and models publicly availableFootnote 1. Our main contributions and performance statistics are the following.

  • We develop a novel and unique image cascade network for real-time semantic segmentation, it utilizes semantic information in low resolution along with details from high-resolution images efficiently.

  • The developed cascade feature fusion unit together with cascade label guidance can recover and refine segmentation prediction progressively with a low computation cost.

  • Our ICNet achieves 5\(\times \) speedup of inference time, and reduces memory consumption by 5\(\times \) times. It can run at high resolution \(1024\times 2048\) in speed of 30 fps while accomplishing high-quality results. It yields real-time inference on various datasets including Cityscapes [7], CamVid [17] and COCO-Stuff [18].

2 Related Work

Traditional semantic segmentation methods [19] adopt handcrafted feature to learn the representation. Recently, CNN based methods largely improve the performance.

High Quality Semantic Segmentation. FCN [1] is the pioneer work to replace the last fully-connected layers in classification with convolution layers. DeepLab [2, 13] and [14] used dilated convolution to enlarge the receptive field for dense labeling. Encoder-decoder structures [3, 4] can combine the high-level semantic information from later layers with the spatial information from earlier ones. Multi-scale feature ensembles are also used in [20,21,22]. In [2, 15, 16], conditional random fields (CRF) or Markov random fields (MRF) were used to model spatial relationship. Zhao et al. [5] used pyramid pooling to aggregate global and local context information. Wu et al. [6] adopted a wider network to boost performance. In [11], a multi-path refinement network combined multi-scale image features. These methods are effective, but preclude real-time inference.

High Efficiency Semantic Segmentation. In object detection, speed became one important factor in system design [23, 24]. Recent Yolo [25, 26] and SSD [27] are representative solutions. In contrast, high speed inference in semantic segmentation is under-explored. ENet [8] and  [28] are lightweight networks. These methods greatly raise efficiency with notably sacrificed accuracy.

Video Semantic Segmentation. Videos contain redundant information in frames, which can be utilized to reduce computation. Recent Clockwork [29] reuses feature maps given stable video input. Deep feature flow [30] is based on a small-scale optical flow network to propagate features from key frames to others. FSO [31] performs structured prediction with dense CRF applied on optimized features to get temporal consistent predictions. NetWarp [32] utilizes optical flow of adjacent frames to warp internal features across time space in video sequences. We note when a good-accuracy fast image semantic-segmentation framework comes into existence, video segmentation will also be benefited.

3 Image Cascade Network

We start by analyzing computation time budget of different components on the high performance segmentation framework PSPNet [5] with experimental statistics. Then we introduce the image cascade network (ICNet) as illustrated in Fig. 2, along with the cascade feature fusion unit and cascade label guidance, for fast semantic segmentation.

3.1 Speed Analysis

In convolution, the transformation function \(\varPhi \) is applied to input feature map \(V \in \mathbb {R}^{c \times h \times w}\) to obtain the output map \(U \in \mathbb {R}^{c' \times h' \times w'}\), where c, h and w denote features channel, height and width respectively. The transformation operation \(\varPhi : V \rightarrow U\) is achieved by applying \(c'\) number of 3D kernels \(K \in \mathbb {R}^{c \times k \times k}\) where \(k \times k\) (e.g, \(3 \times 3\)) is kernel spatial size. Thus the total number of operations \(O(\varPhi )\) in convolution layer is \(c'ck^{2}h'w'\). The spatial size of the output map \(h'\) and \(w'\) are highly related to the input, controlled by parameter stride s as \(h'=h/s, w'=w/s\), making

$$\begin{aligned} O(\varPhi ) \approx c'ck^{2}hw/s^{2}. \end{aligned}$$
(1)

The computation complexity is associated with feature map resolution (e.g., h, w, s), number of kernels and network width (e.g., c, \(c'\)). Figure 1(b) shows the time cost of two resolution images in PSPNet50. Blue curve corresponds to high-resolution input with size \(1024 \times 2048\) and green curve is for image with resolution \(512 \times 1024\). Computation increases squarely regarding image resolution. For either curve, feature maps in stage4 and stage5 are with the same spatial resolution, i.e., 1 / 8 of the original input; but the computation in stage5 is four times heavier than that in stage4. It is because convolutional layers in stage5 double the number of kernels c together with input channel \(c'\).

Fig. 2.
figure 2

Network architecture of ICNet. ‘CFF’ stands for cascade feature fusion detailed in Sect. 3.3. Numbers in parentheses are feature map size ratios to the full-resolution input. Operations are highlighted in brackets. The final \(\times 4\) upsampling in the bottom branch is only used during testing.

3.2 Network Architecture

According to above time budget analysis, we adopt intuitive speedup strategies in experiments to be detailed in Sect. 5, including downsampling input, shrinking feature maps and conducting model compression. The corresponding results show that it is very difficult to keep a good balance between inference accuracy and speed. The intuitive strategies are effective to reduce running time, while they yield very coarse prediction maps. Directly feeding high-resolution images into a network is unbearable in computation.

Our proposed system image cascade network (ICNet) does not simply choose either way. Instead it takes cascade image inputs (i.e., low-, medium- and high resolution images), adopts cascade feature fusion unit (Sect. 3.3) and is trained with cascade label guidance (Sect. 3.4). The new architecture is illustrated in Fig. 2. The input image with full resolution (e.g., \(1024 \times 2048\) in Cityscapes [7]) is downsampled by factors of 2 and 4, forming cascade input to medium- and high-resolution branches.

Segmenting the high-resolution input with classical frameworks like FCN directly is time consuming. To overcome this shortcoming, we get semantic extraction using low-resolution input as shown in top branch of Fig. 2. A 1 / 4 sized image is fed into PSPNet with downsampling rate 8, resulting in a 1 / 32-resolution feature map. To get high quality segmentation, medium and high resolution branches (middle and bottom parts in Fig. 2) help recover and refine the coarse prediction. Though some details are missing and blurry boundaries are generated in the top branch, it already harvests most semantic parts. Thus we can safely limit the number of parameters in both middle and bottom branches. Light weighted CNNs (green dotted box) are adopted in higher resolution branches; different-branch output feature maps are fused by cascade-feature-fusion unit (Sect. 3.3) and trained with cascade label guidance (Sect. 3.4).

Although the top branch is based on a full segmentation backbone, the input resolution is low, resulting in limited computation. Even for PSPNet with 50+ layers, inference time and memory are 18 ms and 0.6 GB for the large images in Cityscapes. Because weights and computation (in 17 layers) can be shared between low- and medium-branches, only 6ms is spent to construct the fusion map. Bottom branch has even less layers. Although the resolution is high, inference only takes 9 ms. Details of the architecture are presented in the supplementary file. With all these three branches, our ICNet becomes a very efficient and memory friendly architecture that can achieve good-quality segmentation.

Fig. 3.
figure 3

Cascade feature fusion.

3.3 Cascade Feature Fusion

To combine cascade features from different-resolution inputs, we propose a cascade feature fusion (CFF) unit as shown in Fig. 3. The input to this unit contains three components: two feature maps \(F_1\) and \(F_2\) with sizes \(C_1 \times H_1 \times W_1\) and \(C_2 \times H_2 \times W_2\) respectively, and a ground-truth label with resolution \(1 \times H_2 \times W_2\). \(F_2\) is with doubled spatial size of \(F_1\).

We first apply upsampling rate 2 on \(F_1\) through bilinear interpolation, yielding the same spatial size as \(F_2\). Then a dilated convolution layer with kernel size \(C_3 \times 3 \times 3\) and dilation 2 is applied to refine the upsampled features. The resulting feature is with size \(C_3 \times H_2 \times W_2\). This dilated convolution combines feature information from several originally neighboring pixels. Compared with deconvolution, upsampling followed by dilated convolution only needs small kernels, to harvest the same receptive field. To keep the same receptive field, deconvolution needs larger kernel sizes than upsampling with dilated convolution (i.e., \(7 \times 7\) vs. \(3 \times 3\)), which causes more computation.

For feature \(F_2\), a projection convolution with kernel size \(C_3 \times 1 \times 1\) is utilized to project \(F_2\) so that it has the same number of channels as the output of \(F_1\). Then two batch normalization layers are used to normalize these two processed features as shown in Fig. 3. Followed by an element-wise ‘sum’ layer and a ‘ReLU’ layer, we obtain the fused feature \(F_2'\) as \(C_3 \times H_2 \times W_2\). To enhance learning of \(F_1\), we use an auxiliary label guidance on the upsampled feature of \(F_1\).

3.4 Cascade Label Guidance

To enhance the learning procedure in each branch, we adopt a cascade label guidance strategy. It utilizes different-scale (e.g., 1 / 16, 1 / 8, and 1 / 4) ground-truth labels to guide the learning stage of low, medium and high resolution input. Given \(\mathcal {T}\) branches (i.e., \(\mathcal {T}\) = 3) and \(\mathcal {N}\) categories. In branch t, the predicted feature map \(\mathcal {F}^t\) has spatial size \(\mathcal {Y}_t \times \mathcal {X}_t\). The value at position (nyx) is \(\mathcal {F}^{t}_{n,y,x}\). The corresponding ground truth label for 2D position (yx) is \(\hat{n}\). To train ICNet, we append weighted softmax cross entropy loss in each branch with related loss weight \(\lambda _t\). Thus we minimize the loss function \(\mathcal {L}\) defined as

$$\begin{aligned} \mathcal {L} = - \sum _{t=1}^{\mathcal {T}}\lambda _t\frac{1}{\mathcal {Y}_t \mathcal {X}_t}\sum _{y=1}^{\mathcal {Y}_t}\sum _{x=1}^{\mathcal {X}_t}\log {\frac{e^{\mathcal {F}^{t}_{\hat{n},y,x}}}{\sum _{n=1}^{\mathcal {N}} e^{\mathcal {F}^{t}_{n,y,x}}}}. \end{aligned}$$
(2)

In the testing phase, the low and medium guidance operations are simply abandoned, where only high-resolution branch is retained. This strategy makes gradient optimization smoother for easy training. With more powerful learning ability in each branch, the final prediction map is not dominated by any single branch.

4 Structure Comparison and Analysis

Now we illustrate the difference of ICNet from existing cascade architectures for semantic segmentation. Typical structures in previous semantic segmentation systems are illustrated in Fig. 4. Our proposed ICNet (Fig. 4(d)) is by nature different from others. Previous frameworks are all with relatively intensive computation given the high-resolution input. While in our cascade structure, only the lowest-resolution input is fed into the heavy CNN with much reduced computation to get the coarse semantic prediction. The higher-res inputs are designed to recover and refine the prediction progressively regarding blurred boundaries and missing details. Thus they are processed by light-weighted CNNs. Newly introduced cascade-feature-fusion unit and cascade label guidance strategy integrate medium and high resolution features to refine the coarse semantic map gradually. In this special design, ICNet achieves high-efficiency inference with reasonable-quality segmentation results.

Fig. 4.
figure 4

Comparison of semantic segmentation frameworks. (a) Intermediate skip connection used by FCN [1] and Hypercolumns [21]. (b) Encoder-decoder structure incorporated in SegNet [3], DeconvNet [4], UNet [33], ENet [8], and step-wise reconstruction & refinement from LRR [34] and RefineNet [11]. (c) Multi-scale prediction ensemble adopted by DeepLab-MSC [2] and PSPNet-MSC [5]. (d) Our ICNet architecture.

5 Experimental Evaluation

Our method is effective for high resolution images. We evaluate the architecture on three challenging datasets, including urban-scene understanding dataset Cityscapes [7] with image resolution \(1024 \times 2048\), CamVid [17] with image resolution \(720 \times 960\) and stuff understanding dataset COCO-Stuff [18] with image resolution up to \(640 \times 640\). There is a notable difference between COCO-Stuff and object/scene segmentation datasets of VOC2012 [35] and ADE20K [36]. In the latter two sets, most images are of low resolution (e.g., \(300 \times 500\)), which can already be processed quickly. While in COCO-Stuff, most images are larger, making it more difficult to achieve real-time performance.

In the following, we first show intuitive speedup strategies and their drawbacks, then reveal our improvement with quantitative and visual analysis.

5.1 Implementation Details

We conduct experiments based on platform Caffe [37]. All experiments are on a workstation with Maxwell TitanX GPU cards under CUDA 7.5 and CUDNN V5. Our testing uses only one card. To measure the forward inference time, we use the time measure tool ‘Caffe time’ and set the repeating iteration number to 100 to eliminate accidental errors during testing. All the parameters in batch normalization layers are merged into the neighboring front convolution layers.

For the training hyper-parameters, the mini-batch size is set to 16. The base learning rate is 0.01 and the ‘poly’ learning rate policy is adopted with power 0.9, together with the maximum iteration number set to 30K for Cityscapes, 10K for CamVid and 30K for COCO-Stuff. Momentum is 0.9 and weight decay is 0.0001. Data augmentation contains random mirror and rand resizing between 0.5 and 2. The auxiliary loss weights are empirically set to 0.4 for \(\lambda _1\) and \(\lambda _2\), 1 for \(\lambda _3\) in Eq. 2, as adopted in [5]. For evaluation, both mean of class-wise intersection over union (mIoU) and network forward time (Time) are used.

5.2 Cityscapes

We first apply our framework to the recent urban scene understanding dataset Cityscapes [7]. This dataset contains high-resolution \(1024 \times 2048\) images, which make it a big challenge for fast semantic segmentation. It contains 5,000 finely annotated images split into training, validation and testing sets with 2,975, 500, and 1,525 images respectively. The dense annotation contains 30 common classes of road, person, car, etc. 19 of them are used in training and testing.

Intuitive Speedup. According to the time complexity shown in Eq. (1), we do intuitive speedup in three aspects, namely downsampling input, downsampling feature, and model compression.

Fig. 5.
figure 5

Downsampling input: prediction of PSPNet50 on the validation set of Cityscapes. Values in the parentheses are the inference time and mIoU.

Table 1. Left: Downsampling feature with factors 8, 16 and 32. Right: model compression with kernel keeping rates 1, 0.5 and 0.25.

Downsampling Input. Image resolution is the most critical factor that affects running speed as analyzed in Sect. 3.1. A simple approach is to use the small-resolution image as input. We test downsampling the image with ratios 1 / 2 and 1 / 4, and feeding the resulting images into PSPNet50. We directly upsample prediction results to the original size. This approach empirically has several drawbacks as illustrated in Fig. 5. With scaling ratio 0.25, although the inference time is reduced by a large margin, the prediction map is very coarse, missing many small but important details compared to the higher resolution prediction. With scaling ratio 0.5, the prediction recovers more information compared to the 0.25 case. Unfortunately, the person and traffic light far from the camera are still missing and object boundaries are blurred. To make things worse, the running time is still too long for a real-time system.

Downsampling Feature. Besides directly downsampling the input image, another simple choice is to scale down the feature map by a large ratio in the inference process. FCN [1] downsampled it for 32 times and DeepLab [2] did that for 8 times. We test PSPNet50 with downsampling ratios of 1:8, 1:16 and 1:32 and show results in the left of Table 1. A smaller feature map can yield faster inference at the cost of sacrificing prediction accuracy. The lost information is mostly detail contained in low-level layers. Also, even with the smallest resulting feature map under ratio 1:32, the system still takes 131ms in inference.

Model Compression. Apart from the above two strategies, another natural way to reduce network complexity is to trim kernels in each layer. Compressing models becomes an active research topic in recent years due to the high demand. The solutions [38,39,40,41] can make a complicated network reduce to a lighter one under user-controlled accuracy reduction. We adopt recent effective classification model compression strategy presented in [41] on our segmentation models. For each filter, we first calculate the sum of kernel \(\ell _1\)-norm. Then we sort these sum results in a descending order and keep only the most significant ones. Disappointingly, this strategy also does not meet our requirement given the compressed models listed in the right of Table 1. Even by keeping only a quarter of kernels, the inference time is still too long. Meanwhile the corresponding mIoU is intolerably low – it already cannot produce reasonable segmentation for many applications.

Table 2. Performance of ICNet with different branches on validation set of Citysapes. The baseline method is PSPNet50 compressed to a half. ‘sub4’, ‘sub24’ and ‘sub124’ represent predictions in low-, medium-, and high-resolution branches respectively.
Table 3. Effectiveness of cascade feature fusion unit (CFF) and cascade label guidance (CLG). ‘DC3’, ‘DC5’ and ‘DC7’ denote replacing ‘bilinear upsampling + dilated convolution’ with deconvolution operation with kernels \(3 \times 3\), \(5 \times 5\) and \(7 \times 7\) respectively.

Cascade Branches. We do ablation study on cascade branches, the results are shown in Table 2. Our baseline is the half-compressed PSPNet50, 170 ms inference time is yielded with mIoU reducing to 67.9%. They indicate that model compression has almost no chance to achieve real-time performance under the condition of keeping decent segmentation quality. Based on this baseline, we test our ICNet on different branches. To show the effectiveness of the proposed cascade framework, we denote the outputs of low-, medium- and high-resolution branches as ‘sub4’, ‘sub24’ and ‘sub124’, where the numbers stand for the information used. The setting ‘sub4’ only uses the top branch with the low-resolution input. ‘sub24’ and ‘sub124’ respectively contain top two and all three branches.

We test these three settings on the validation set of Cityscapes and list the results in Table 2. With just the low-resolution input branch, although running time is short, the result quality drops to 59.6%. Using two and three branches, we increase mIoU to 66.5% and 67.7% respectively. The running time only increases by 7ms and 8ms. Note our segmentation quality nearly stays the same as the baseline, and yet is 5.2\(\times \) times faster. The memory consumption is significantly reduced by 5.8\(\times \).

Cascade Structure. We also do ablation study on cascade feature fusion unit and cascade label guidance. The results are shown in Table 3. Compared to the deconvolution layer with \(3 \times 3\) and \(5 \times 5\) kernels, with similar inference efficiency, cascade feature fusion unit gets higher mIoU performance. Compared to deconvolution layer with a larger kernel with size \(7 \times 7\), the mIoU performance is close, while cascade feature fusion unit yields faster processing speed. Without the cascade label guidance, the performance drops a lot as shown in the last row.

Table 4. Predicted mIoU and inference time on Cityscapes test set with image resolution \(1024 \times 2048\). ‘DR’ stands for image downsampling ratio during testing (e.g, DR = 4 represents testing at resolution \(256 \times 512\)). Methods trained using both fine and coarse data are marked with ‘\(\dag \)’.

Methods Comparison. We finally list mIoU performance and inference time of our proposed ICNet on the test set of Cityscapes. It is trained on training and validation sets of Cityscapes for 90K iterations. Results are included in Table 4. The reported mIoUs and running time of other methods are shown in the official Cityscapes leadboard. For fairness, we do not include methods without reporting running time. Many of these methods may have adopted time-consuming multi-scale testing for the best result quality.

Our ICNet yields mIoU 69.5%. It is even quantitatively better than several methods that do not care about speed. It is about 10 points higher than ENet [8] and SQ [9]. Training with both fine and coarse data boosts mIoU performance to 70.6%. ICNet is a 30 fps method on \(1024 \times 2048\) resolution images using only one TitanX GPU card. Video example can be accessed through linkFootnote 2.

Fig. 6.
figure 6

Visual prediction improvement of ICNet in each branch on Cityscapes dataset.

Fig. 7.
figure 7

Visual prediction improvement of ICNet. White regions in ‘diff1’ and ‘diff2’ denote prediction difference between ‘sub24’ and ‘sub4’, and between ‘sub124’ and ‘sub24’ respectively. (Color figure online)

Visual Improvement

Figures 6 and 7 show the visual results of ICNet on Cityscapes. With proposed gradual feature fusion steps and cascade label guidance structure, we produce decent prediction results. Intriguingly, output of the ‘sub4’ branch can already capture most of semantically meaningful objects. But the prediction is coarse due to the low-resolution input. It misses a few small-size important regions, such as poles and traffic signs.

With the help of medium-resolution information, many of these regions are re-estimated and recovered as shown in the ‘sub24’ branch. It is noticeable that objects far from the camera, such as a few persons, are still missing with blurry object boundaries. The ‘sub124’ branch with full-resolution input helps refine these details – the output of this branch is undoubted the best. It manifests that our different-resolution information is properly made use of in this framework.

Fig. 8.
figure 8

Quantitative analysis of accuracy change in connected components.

Table 5. Results on CamVid test set with time reported on resolution \(720 \times 960\).
Table 6. Results on COCO-Stuff test set with time reported on resolution \(640 \times 640\).

Quantitative Analysis. To further understand accuracy gain in each branch, we quantitatively analyze the predicted label maps based on connected components. For each connected region \(R_{i}\), we calculate the number of pixels it contains, denoted as \(S_{i}\). Then we count the number of pixels correctly predicted in the corresponding map as \(s_{i}\). The predicted region accuracy \(p_{i}\) in \(R_{i}\) is thus \({s_{i}}/{S_{i}}\). According to the region size \(S_{i}\), we project these regions onto a histogram \(\mathcal {H}\) with interval \(\mathcal {K}\) and average all related region accuracy \(p_{i}\) as the value of current bin.

In experiments, we set bin size of the histogram as 30 and interval \(\mathcal {K}\) as 3,000. It thus covers region size \(S_{i}\) between 1 to 90K. We ignore regions with size exceeding 90K. Figure 8 shows the accuracy change in each bin. The blue histogram stands for the difference between ‘sub24’ and ‘sub4’ while the green histogram shows the difference between ‘sub124’ and ‘sub24’. For both histograms, the large difference is mainly on the front bins with small region sizes. This manifests that small region objects like traffic light and pole can be well improved in our framework. The front changes are large positives, proving that ‘sub24’ can restore much information on small objects on top of ‘sub4’. ‘sub124’ is also very useful compared to ‘sub24’.

5.3 CamVid

CamVid [17] dataset contains images extracted from high resolution video sequences with resolution up to \(720 \times 960\). For easy comparison with prior work, we adopt the split of Sturgess et al. [42], which partitions the dataset into 367, 100, and 233 images for training, validation and testing respectively. 11 semantic classes are used for evaluation.

The testing results are listed in Table 5, our base-model is no compressed PSPNet50. ICNet gets much faster inference speed than other methods on this high resolution, reaching the real-time speed of 27.8 fps, 5.7 times faster than the second one and 5.1 times faster compared to the basic model. Apart from high efficiency, it also accomplishes high quality segmentation. Visual results are provided in the supplementary material.

5.4 COCO-Stuff

COCO-Stuff [18] is a recently labeled dataset based on MS-COCO [43] for stuff segmentation in context. We evaluate ICNet following the split in [18] that 9 K images are used for training and another 1 K for testing. This dataset is much more complex for multiple categories – up to 182 classes are used for evaluation, including 91 thing and 91 stuff classes.

Table 6 shows the testing results. ICNet still performs satisfyingly regarding common thing and stuff understanding. It is more efficient and accurate than modern segmentation frameworks, such as FCN and DeepLab. Compared to our baseline model, it achieves 5.4 times speedup. Visual predictions are provided in the supplementary material.

6 Conclusion

We have proposed a real-time semantic segmentation system ICNet. It incorporates effective strategies to accelerate network inference speed without sacrificing much performance. The major contributions include the new framework for saving operations in multiple resolutions and the powerful fusion unit.

We believe the optimal balance of speed and accuracy makes our system important since it can benefit many other tasks that require fast scene and object segmentation. It greatly enhances the practicality of semantic segmentation in other disciplines.