Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Chen, Liang-Chieh; Zhu, Yukun; Papandreou, George; Schroff, Florian; Adam, Hartwig

doi:10.1007/978-3-030-01234-2_49

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Liang-Chieh Chen¹⁷,
Yukun Zhu¹⁷,
George Papandreou¹⁷,
Florian Schroff¹⁷ &
…
Hartwig Adam¹⁷

Conference paper
First Online: 06 October 2018

25k Accesses
4993 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11211))

Abstract

Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Semantic segmentation with the goal to assign semantic labels to every pixel in an image [1,2,3,4,5] is one of the fundamental topics in computer vision. Deep convolutional neural networks [6,7,8,9,10] based on the Fully Convolutional Neural Network [8, 11] show striking improvement over systems relying on hand-crafted features [12,13,14,15,16,17] on benchmark tasks. In this work, we consider two types of neural networks that use spatial pyramid pooling module [18,19,20] or encoder-decoder structure [21, 22] for semantic segmentation, where the former one captures rich contextual information by pooling features at different resolution while the latter one is able to obtain sharp object boundaries.

In order to capture the contextual information at multiple scales, DeepLabv3 [23] applies several parallel atrous convolution with different rates (called Atrous Spatial Pyramid Pooling, or ASPP), while PSPNet [24] performs pooling operations at different grid scales. Even though rich semantic information is encoded in the last feature map, detailed information related to object boundaries is missing due to the pooling or convolutions with striding operations within the network backbone. This could be alleviated by applying the atrous convolution to extract denser feature maps. However, given the design of state-of-art neural networks [7, 9, 10, 25, 26] and limited GPU memory, it is computationally prohibitive to extract output feature maps that are 8, or even 4 times smaller than the input resolution. Taking ResNet-101 [25] for example, when applying atrous convolution to extract output features that are 16 times smaller than input resolution, features within the last 3 residual blocks (9 layers) have to be dilated. Even worse, 26 residual blocks (78 layers!) will be affected if output features that are 8 times smaller than input are desired. Thus, it is computationally intensive if denser output features are extracted for this type of models. On the other hand, encoder-decoder models [21, 22] lend themselves to faster computation (since no features are dilated) in the encoder path and gradually recover sharp object boundaries in the decoder path. Attempting to combine the advantages from both methods, we propose to enrich the encoder module in the encoder-decoder networks by incorporating the multi-scale contextual information.

In particular, our proposed model, called DeepLabv3+, extends DeepLabv3 [23] by adding a simple yet effective decoder module to recover the object boundaries, as illustrated in Fig. 1. The rich semantic information is encoded in the output of DeepLabv3, with atrous convolution allowing one to control the density of the encoder features, depending on the budget of computation resources. Furthermore, the decoder module allows detailed object boundary recovery.

Motivated by the recent success of depthwise separable convolution [26,27,28,29,30], we also explore this operation and show improvement in terms of both speed and accuracy by adapting the Xception model [26], similar to [31], for the task of semantic segmentation, and applying the atrous separable convolution to both the ASPP and decoder modules. Finally, we demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasts and attain the test set performance of 89.0% and 82.1% without any post-processing, setting a new state-of-the-art.

In summary, our contributions are:

We propose a novel encoder-decoder structure which employs DeepLabv3 as a powerful encoder module and a simple yet effective decoder module.
In our structure, one can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime, which is not possible with existing encoder-decoder models.
We adapt the Xception model for the segmentation task and apply depthwise separable convolution to both ASPP module and decoder module, resulting in a faster and stronger encoder-decoder network.
Our proposed model attains a new state-of-art performance on PASCAL VOC 2012 and Cityscapes datasets. We also provide detailed analysis of design choices and model variants.
We make our Tensorflow-based implementation of the proposed model publicly available at https://github.com/tensorflow/models/tree/master/research/deeplab.

2 Related Work

Models based on Fully Convolutional Networks (FCNs) [8, 11] have demonstrated significant improvement on several segmentation benchmarks [1,2,3,4,5]. There are several model variants proposed to exploit the contextual information for segmentation [12,13,14,15,16,17, 32, 33], including those that employ multi-scale inputs (i.e., image pyramid) [34,35,36,37,38,39] or those that adopt probabilistic graphical models (such as DenseCRF [40] with efficient inference algorithm [41]) [37, 39, 42,43,44,45,46,47,48,49,50,51]. In this work, we mainly discuss about the models that use spatial pyramid pooling and encoder-decoder structure.

Spatial Pyramid Pooling: Models, such as PSPNet [24] or DeepLab [23, 39], perform spatial pyramid pooling [18, 19] at several grid scales (including image-level pooling [52]) or apply several parallel atrous convolution with different rates (called Atrous Spatial Pyramid Pooling, or ASPP). These models have shown promising results on several segmentation benchmarks by exploiting the multi-scale information.

Encoder-Decoder: The encoder-decoder networks have been successfully applied to many computer vision tasks, including human pose estimation [53], object detection [54,55,56], and semantic segmentation [11, 21, 22, 57,58,59,60,61,62,63,64]. Typically, the encoder-decoder networks contain (1) an encoder module that gradually reduces the feature maps and captures higher semantic information, and (2) a decoder module that gradually recovers the spatial information. Building on top of this idea, we propose to use DeepLabv3 [23] as the encoder module and add a simple yet effective decoder module to obtain sharper segmentations.

Depthwise Separable Convolution: Depthwise separable convolution [27, 28] or group convolution [7, 65], a powerful operation to reduce the computation cost and number of parameters while maintaining similar (or slightly better) performance. This operation has been adopted in many recent neural network designs [26, 29,30,31, 66,67,68]. In particular, we explore the Xception model [26], similar to [31] for their COCO 2017 detection challenge submission, and show improvement in terms of both accuracy and speed for the task of semantic segmentation.

3 Methods

In this section, we briefly introduce atrous convolution [8, 42, 69,70,71] and depthwise separable convolution [26,27,28,29, 67]. We then review DeepLabv3 [23] which is used as our encoder module before discussing the proposed decoder module appended to the encoder output. We also present a modified Xception model [26, 31] which further improves the performance with faster computation.

3.1 Encoder-Decoder with Atrous Convolution

Atrous Convolution: Atrous convolution, a powerful tool that allows us to explicitly control the resolution of features computed by deep convolutional neural networks and adjust filter’s field-of-view in order to capture multi-scale information, generalizes standard convolution operation. In the case of two-dimensional signals, for each location $\varvec{i}$ on the output feature map $\varvec{y}$ and a convolution filter $\varvec{w}$, atrous convolution is applied over the input feature map $\varvec{x}$ as follows:

$$\begin{aligned} \varvec{y}[\varvec{i}] = \sum _{\varvec{k}} \varvec{x}[\varvec{i} + r \cdot \varvec{k}] \varvec{w}[\varvec{k}] \end{aligned}$$

(1)

where the atrous rate r determines the stride with which we sample the input signal. We refer interested readers to [39] for more details. Note that standard convolution is a special case in which rate $r = 1$. The filter’s field-of-view is adaptively modified by changing the rate value.

Depthwise Separable Convolution: Depthwise separable convolution, factorizing a standard convolution into a depthwise convolution followed by a pointwise convolution (i.e., $1\times 1$ convolution), drastically reduces computation complexity. Specifically, the depthwise convolution performs a spatial convolution independently for each input channel, while the pointwise convolution is employed to combine the output from the depthwise convolution. In the TensorFlow [72] implementation of depthwise separable convolution, atrous convolution has been supported in the depthwise convolution (i.e., the spatial convolution), as illustrated in Fig. 3. In this work, we refer the resulting convolution as atrous separable convolution, and found that atrous separable convolution significantly reduces the computation complexity of proposed model while maintaining similar (or better) performance.

DeepLabv3 as Encoder: DeepLabv3 [23] employs atrous convolution [8, 69,70,71] to extract the features computed by deep convolutional neural networks at an arbitrary resolution. Here, we denote ${output\ stride}$ as the ratio of input image spatial resolution to the final output resolution (before global pooling or fully-connected layer). For the task of image classification, the spatial resolution of the final feature maps is usually 32 times smaller than the input image resolution and thus ${output\ stride}=32$. For the task of semantic segmentation, one can adopt ${output\ stride}=16$ (or 8) for denser feature extraction by removing the striding in the last one (or two) block(s) and applying the atrous convolution correspondingly (e.g., we apply $rate=2$ and $rate=4$ to the last two blocks respectively for ${output\ stride}=8$). Additionally, DeepLabv3 augments the Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales by applying atrous convolution with different rates, with the image-level features [52]. We use the last feature map before logits in the original DeepLabv3 as the encoder output in our proposed encoder-decoder structure. Note the encoder output feature map contains 256 channels and rich semantic information. Besides, one could extract features at an arbitrary resolution by applying the atrous convolution, depending on the computation budget.

Proposed Decoder: The encoder features from DeepLabv3 are usually computed with ${output\ stride}=16$. In the work of [23], the features are bilinearly upsampled by a factor of 16, which could be considered a naive decoder module. However, this naive decoder module may not successfully recover object segmentation details. We thus propose a simple yet effective decoder module, as illustrated in Fig. 2. The encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features [73] from the network backbone that have the same spatial resolution (e.g., Conv2 before striding in ResNet-101 [25]). We apply another $1\times 1$ convolution on the low-level features to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features (only 256 channels in our model) and make the training harder. After the concatenation, we apply a few $3\times 3$ convolutions to refine the features followed by another simple bilinear upsampling by a factor of 4. We show in Sect. 4 that using ${output\ stride}=16$ for the encoder module strikes the best trade-off between speed and accuracy. The performance is marginally improved when using ${output\ stride}=8$ for the encoder module at the cost of extra computation complexity.

3.2 Modified Aligned Xception

The Xception model [26] has shown promising image classification results on ImageNet [74] with fast computation. More recently, the MSRA team [31] modifies the Xception model (called Aligned Xception) and further pushes the performance in the task of object detection. Motivated by these findings, we work in the same direction to adapt the Xception model for the task of semantic image segmentation. In particular, we make a few more changes on top of MSRA’s modifications, namely (1) deeper Xception same as in [31] except that we do not modify the entry flow network structure for fast computation and memory efficiency, (2) all max pooling operations are replaced by depthwise separable convolution with striding, which enables us to apply atrous separable convolution to extract feature maps at an arbitrary resolution (another option is to extend the atrous algorithm to max pooling operations), and (3) extra batch normalization [75] and ReLU activation are added after each $3\times 3$ depthwise convolution, similar to MobileNet design [29]. See Fig. 4 for details.

4 Experimental Evaluation

We employ ImageNet-1k [74] pretrained ResNet-101 [25] or modified aligned Xception [26, 31] to extract dense feature maps by atrous convolution. Our implementation is built on TensorFlow [72] and is made publicly available.

The proposed models are evaluated on the PASCAL VOC 2012 semantic segmentation benchmark [1] which contains 20 foreground object classes and one background class. The original dataset contains 1,464 (train), 1,449 (val), and 1,456 (test) pixel-level annotated images. We augment the dataset by the extra annotations provided by [76], resulting in 10,582 (trainaug) training images. The performance is measured in terms of pixel intersection-over-union averaged across the 21 classes (mIOU).

We follow the same training protocol as in [23] and refer the interested readers to [23] for details. In short, we employ the same learning rate schedule (i.e., “poly” policy [52] and same initial learning rate 0.007), crop size $513\times 513$, fine-tuning batch normalization parameters [75] when ${output\ stride}=16$, and random scale data augmentation during training. Note that we also include batch normalization parameters in the proposed decoder module. Our proposed model is trained end-to-end without piecewise pretraining of each component.

4.1 Decoder Design Choices

We define “DeepLabv3 feature map” as the last feature map computed by DeepLabv3 (i.e., the features containing ASPP features and image-level features), and $[k\times k, f]$ as a convolution operation with kernel $k\times k$ and f filters.

When employing ${output\ stride}=16$, ResNet-101 based DeepLabv3 [23] bilinearly upsamples the logits by 16 during both training and evaluation. This simple bilinear upsampling could be considered as a naive decoder design, attaining the performance of $77.21\%$ [23] on PASCAL VOC 2012 val set and is $1.2\%$ better than not using this naive decoder during training (i.e., downsampling groundtruth during training). To improve over this naive baseline, our proposed model “DeepLabv3+” adds the decoder module on top of the encoder output, as shown in Fig. 2. In the decoder module, we consider three places for different design choices, namely (1) the $1\times 1$ convolution used to reduce the channels of the low-level feature map from the encoder module, (2) the $3\times 3$ convolution used to obtain sharper segmentation results, and (3) what encoder low-level features should be used.

To evaluate the effect of the $1\times 1$ convolution in the decoder module, we employ [3 $\times $ 3,256] and the Conv2 features from ResNet-101 network backbone, i.e., the last feature map in res2x residual block (to be concrete, we use the feature map before striding). As shown in Table 1, reducing the channels of the low-level feature map from the encoder module to either 48 or 32 leads to better performance. We thus adopt [1 $\times $ 1,48] for channel reduction.

We then design the $3\times 3$ convolution structure for the decoder module and report the findings in Table 2. We find that after concatenating the Conv2 feature map (before striding) with DeepLabv3 feature map, it is more effective to employ two $3\times 3$ convolution with 256 filters than using simply one or three convolutions. Changing the number of filters from 256 to 128 or the kernel size from $3\times 3$ to $1\times 1$ degrades performance. We also experiment with the case where both Conv2 and Conv3 feature maps are exploited in the decoder module. In this case, the decoder feature map are gradually upsampled by 2, concatenated with Conv3 first and then Conv2, and each will be refined by the [3 $\times $ 3,256] operation. The whole decoding procedure is then similar to the U-Net/SegNet design [21, 22]. However, we have not observed significant improvement. Thus, in the end, we adopt the very simple yet effective decoder module: the concatenation of the DeepLabv3 feature map and the channel-reduced Conv2 feature map are refined by two [3 $\times $ 3,256] operations. Note that our proposed DeepLabv3+ model has ${output\ stride}=4$. We do not pursue further denser output feature map (i.e., ${output\ stride} < 4$) given the limited GPU resources.

Table 1. PASCAL VOC 2012 val set. Effect of decoder $1\times 1$ convolution used to reduce the channels of low-level feature map from the encoder module. We fix the other components in the decoder structure as using [3 $\times $ 3,256] and Conv2.

Full size table

Table 2. Effect of decoder structure when fixing [1 $\times $ 1,48] to reduce the encoder feature channels. We found that it is most effective to use the Conv2 (before striding) feature map and two extra [3 $\times $ 3,256] operations. Performance on VOC 2012 val set.

Full size table

4.2 ResNet-101 as Network Backbone

To compare the model variants in terms of both accuracy and speed, we report mIOU and Multiply-Adds in Table 3 when using ResNet-101 [25] as network backbone in the proposed DeepLabv3+ model. Thanks to atrous convolution, we are able to obtain features at different resolutions during training and evaluation using a single model.

Baseline: The first row block in Table 3 contains the results from [23] showing that extracting denser feature maps during evaluation (i.e., ${eval\ output\ stride} = 8$) and adopting multi-scale inputs increases performance. Besides, adding left-right flipped inputs doubles the computation complexity with only marginal performance improvement.

Adding Decoder: The second row block in Table 3 contains the results when adopting the proposed decoder structure. The performance is improved from $77.21\%$ to $78.85\%$ or $78.51\%$ to $79.35\%$ when using ${eval\ output\ stride} = 16$ or 8, respectively, at the cost of about 20B extra computation overhead. The performance is further improved when using multi-scale and left-right flipped inputs.

Coarser Feature Maps: We also experiment with the case when using ${train\ output\ stride} = 32$ (i.e., no atrous convolution at all during training) for fast computation. As shown in the third row block in Table 3, adding the decoder brings about 2% improvement while only 74.20B Multiply-Adds are required. However, the performance is always about 1% to 1.5% below the case in which we employ ${train\ output\ stride} = 16$ and different ${eval\ output\ stride}$ values. We thus prefer using ${output\ stride} = 16$ or 8 during training or evaluation depending on the complexity budget.

4.3 Xception as Network Backbone

We further employ the more powerful Xception [26] as network backbone. Following [31], we make a few more changes, as described in Sect. 3.2.

Table 3. Inference strategy on the PASCAL VOC 2012 val set using ResNet-101. train OS: The ${output\ stride}$ used during training. eval OS: The ${output\ stride}$ used during evaluation. Decoder: Employing the proposed decoder structure. MS: Multi-scale inputs during evaluation. Flip: Adding left-right flipped inputs.

Full size table

Table 4. Single-model error rates on ImageNet-1K validation set.

Full size table

ImageNet Pretraining: The proposed Xception network is pretrained on ImageNet-1k dataset [74] with similar training protocol in [26]. Specifically, we adopt Nesterov momentum optimizer with momentum = 0.9, initial learning rate = 0.05, rate decay = 0.94 every 2 epochs, and weight decay $4e-5$. We use asynchronous training with 50 GPUs and each GPU has batch size 32 with image size $299\times 299$. We did not tune the hyper-parameters very hard as the goal is to pretrain the model on ImageNet for semantic segmentation. We report the single-model error rates on the validation set in Table 4 along with the baseline reproduced ResNet-101 [25] under the same training protocol. We have observed 0.75% and 0.29% performance degradation for Top1 and Top5 accuracy when not adding the extra batch normalization and ReLU after each $3\times 3$ depthwise convolution in the modified Xception.

The results of using the proposed Xception as network backbone for semantic segmentation are reported in Table 5.

Baseline: We first report the results without using the proposed decoder in the first row block in Table 5, which shows that employing Xception as network backbone improves the performance by about 2% when ${train\ output\ stride} = {eval\ output\ stride} = 16$ over the case where ResNet-101 is used. Further improvement can also be obtained by using ${eval\ output\ stride} = 8$, multi-scale inputs during inference and adding left-right flipped inputs. Note that we do not employ the multi-grid method [23, 77, 78], which we found does not improve the performance.

Adding Decoder: As shown in the second row block in Table 5, adding decoder brings about 0.8% improvement when using ${eval\ output\ stride}=16$ for all the different inference strategies. The improvement becomes less when using ${eval\ output\ stride} = 8$.

Using Depthwise Separable Convolution: Motivated by the efficient computation of depthwise separable convolution, we further adopt it in the ASPP and the decoder modules. As shown in the third row block in Table 5, the computation complexity in terms of Multiply-Adds is significantly reduced by 33% to 41%, while similar mIOU performance is obtained.

Pretraining on COCO: For comparison with other state-of-art models, we further pretrain our proposed DeepLabv3+ model on MS-COCO dataset [79], which yields about extra 2% improvement for all different inference strategies.

Pretraining on JFT: Similar to [23], we also employ the proposed Xception model that has been pretrained on both ImageNet-1k [74] and JFT-300M dataset [26, 80, 81], which brings extra 0.8% to 1% improvement.

Test Set Results: Since the computation complexity is not considered in the benchmark evaluation, we thus opt for the best performance model and train it with ${output\ stride}=8$ and frozen batch normalization parameters. In the end, our ‘DeepLabv3+’ achieves the performance of 87.8% and 89.0% without and with JFT dataset pretraining.

Qualitative Results: We provide visual results of our best model in Fig. 6. As shown in the figure, our model is able to segment objects very well without any post-processing.

Failure Mode: As shown in the last row of Fig. 6, our model has difficulty in segmenting (a) sofa vs. chair, (b) heavily occluded objects, and (c) objects with rare view.

Table 5. Inference strategy on the PASCAL VOC 2012 val set when using modified Xception. train OS: The ${output\ stride}$ used during training. eval OS: The ${output\ stride}$ used during evaluation. Decoder: Employing the proposed decoder structure. MS: Multi-scale inputs during evaluation. Flip: Adding left-right flipped inputs. SC: Adopting depthwise separable convolution for both ASPP and decoder modules. COCO: Models pretrained on MS-COCO. JFT: Models pretrained on JFT.

Full size table

Table 6. PASCAL VOC 2012 test set results with top-performing models.

Full size table

4.4 Improvement Along Object Boundaries

In this subsection, we evaluate the segmentation accuracy with the trimap experiment [14, 39, 40] to quantify the accuracy of the proposed decoder module near object boundaries. Specifically, we apply the morphological dilation on ‘void’ label annotations on val set, which typically occurs around object boundaries. We then compute the mean IOU for those pixels that are within the dilated band (called trimap) of ‘void’ labels. As shown in Fig. 5(a), employing the proposed decoder for both ResNet-101 [25] and Xception [26] network backbones improves the performance compared to the naive bilinear upsampling. The improvement is more significant when the dilated band is narrow. We have observed 4.8% and 5.4% mIOU improvement for ResNet-101 and Xception respectively at the smallest trimap width as shown in the figure. We also visualize the effect of employing the proposed decoder in Fig. 5(b).

Table 7. (a) DeepLabv3+ on the Cityscapes val set when trained with train_fine set. (b) DeepLabv3+ on Cityscapes test set. Coarse: Use train_extra set (coarse annotations) as well. Only a few top models are listed in this table.

Full size table

4.5 Experimental Results on Cityscapes

In this section, we experiment DeepLabv3+ on the Cityscapes dataset [3], a large-scale dataset containing high quality pixel-level annotations of 5000 images (2975, 500, and 1525 for the training, validation, and test sets respectively) and about 20000 coarsely annotated images.

As shown in Table 7(a), employing the proposed Xception model as network backbone (denoted as X-65) on top of DeepLabv3 [23], which includes the ASPP module and image-level features [52], attains the performance of 77.33% on the validation set. Adding the proposed decoder module significantly improves the performance to 78.79% (1.46% improvement). We notice that removing the augmented image-level feature improves the performance to 79.14%, showing that in DeepLab model, the image-level features are more effective on the PASCAL VOC 2012 dataset. We also discover that on the Cityscapes dataset, it is effective to increase more layers in the entry flow in the Xception [26], the same as what [31] did for the object detection task. The resulting model building on top of the deeper network backbone (denoted as X-71 in the table), attains the best performance of 79.55% on the validation set.

After finding the best model variant on val set, we then further fine-tune the model on the coarse annotations in order to compete with other state-of-art models. As shown in Table 7(b), our proposed DeepLabv3+ attains a performance of 82.1% on the test set, setting a new state-of-art performance on Cityscapes.

5 Conclusion

Our proposed model “DeepLabv3+” employs the encoder-decoder structure where DeepLabv3 is used to encode the rich contextual information and a simple yet effective decoder module is adopted to recover the object boundaries. One could also apply the atrous convolution to extract the encoder features at an arbitrary resolution, depending on the available computation resources. We also explore the Xception model and atrous separable convolution to make the proposed model faster and stronger. Finally, our experimental results show that the proposed model sets a new state-of-the-art performance on PASCAL VOC 2012 and Cityscapes datasets.

References

Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge - a retrospective. IJCV 111, 98–136 (2014)
Article Google Scholar
Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)
Google Scholar
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR (2017)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: COCO-stuff: thing and stuff classes in context. In: CVPR (2018)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE (1998)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. In: ICLR (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Google Scholar
He, X., Zemel, R.S., Carreira-Perpindn, M.: Multiscale conditional random fields for image labeling. In: CVPR (2004)
Google Scholar
Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV 81(2), 2–23 (2009)
Article Google Scholar
Kohli, P., Torr, P.H., et al.: Robust higher order potentials for enforcing label consistency. IJCV 82(3), 302–324 (2009)
Article Google Scholar
Ladicky, L., Russell, C., Kohli, P., Torr, P.H.: Associative hierarchical CRFs for object class image segmentation. In: ICCV (2009)
Google Scholar
Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: ICCV (2009)
Google Scholar
Yao, J., Fidler, S., Urtasun, R.: Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In: CVPR (2012)
Google Scholar
Grauman, K., Darrell, T.: The pyramid match kernel: discriminative classification with sets of image features. In: ICCV (2005)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: ECCV (2014)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Google Scholar
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. PAMI (2017)
Google Scholar
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR (2017)
Google Scholar
Sifre, L.: Rigid-motion scattering for image classification. Ph.D. thesis (2014)
Google Scholar
Vanhoucke, V.: Learning visual representations at scale (invited talk). In: ICLR (2014)
Google Scholar
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: CVPR (2018)
Google Scholar
Qi, H., et al.: Deformable convolutional networks - COCO detection and segmentation challenge 2017 entry. In: ICCV COCO Challenge Workshop (2017)
Google Scholar
Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. In: CVPR (2015)
Google Scholar
Dai, J., He, K., Sun, J.: Convolutional feature masking for joint object and stuff segmentation. In: CVPR (2015)
Google Scholar
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. PAMI (2013)
Google Scholar
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
Google Scholar
Pinheiro, P., Collobert, R.: Recurrent convolutional neural networks for scene labeling. In: ICML (2014)
Google Scholar
Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR (2016)
Google Scholar
Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-aware semantic image segmentation. In: CVPR (2016)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. TPAMI 40, 834–848 (2017)
Article Google Scholar
Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS (2011)
Google Scholar
Adams, A., Baek, J., Davis, M.A.: Fast high-dimensional filtering using the permutohedral lattice. In: Eurographics (2010)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)
Google Scholar
Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: CVPR (2015)
Google Scholar
Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV (2015)
Google Scholar
Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: ICCV (2015)
Google Scholar
Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.: Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. In: ICCV (2015)
Google Scholar
Schwing, A.G., Urtasun, R.: Fully connected deep structured networks. arXiv:1503.02351 (2015)
Jampani, V., Kiefel, M., Gehler, P.V.: Learning sparse high dimensional filters: image filtering, dense CRFs and bilateral neural networks. In: CVPR (2016)
Google Scholar
Vemulapalli, R., Tuzel, O., Liu, M.Y., Chellappa, R.: Gaussian conditional random field network for semantic segmentation. In: CVPR (2016)
Google Scholar
Chandra, S., Kokkinos, I.: Fast, exact and multi-scale inference for semantic image segmentation with deep Gaussian CRFs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 402–418. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_25
Chapter Google Scholar
Chandra, S., Usunier, N., Kokkinos, I.: Dense and low-rank Gaussian CRFs using deep embeddings. In: ICCV (2017)
Google Scholar
Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. arXiv:1506.04579 (2015)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV (2016)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Shrivastava, A., Sukthankar, R., Malik, J., Gupta, A.: Beyond skip connections: top-down modulation for object detection. arXiv:1612.06851 (2016)
Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DSSD: deconvolutional single shot detector. arXiv:1701.06659 (2017)
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015)
Google Scholar
Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. In: CVPR (2017)
Google Scholar
Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR (2017)
Google Scholar
Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters-improve semantic segmentation by global convolutional network. In: CVPR (2017)
Google Scholar
Islam, M.A., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement network for dense image labeling. In: CVPR (2017)
Google Scholar
Wojna, Z., et al.: The devil is in the decoder. In: BMVC (2017)
Google Scholar
Fu, J., Liu, J., Wang, Y., Lu, H.: Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943 (2017)
Zhang, Z., Zhang, X., Peng, C., Cheng, D., Sun, J.: ExFuse: enhancing feature fusion for semantic segmentation. arXiv:1804.03821 (2018)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
Google Scholar
Jin, J., Dundar, A., Culurciello, E.: Flattened convolutional neural networks for feedforward acceleration. arXiv:1412.5474 (2014)
Wang, M., Liu, B., Foroosh, H.: Design of efficient convolutional layers using single intra-channel convolution, topological subdivisioning and spatial “bottleneck” structure. arXiv:1608.04337 (2016)
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR (2018)
Google Scholar
Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P.: A real-time algorithm for signal analysis with the help of the wavelet transform. In: Combes, J.M., Grossmann, A., Tchamitchian, P. (eds.) Wavelets: Time-Frequency Methods and Phase Space, pp. 289–297. Springer, Heidelberg (1989). https://doi.org/10.1007/978-3-642-75988-8_28
Chapter Google Scholar
Giusti, A., Ciresan, D., Masci, J., Gambardella, L., Schmidhuber, J.: Fast image scanning with deep max-pooling convolutional neural networks. In: ICIP (2013)
Google Scholar
Papandreou, G., Kokkinos, I., Savalle, P.A.: Modeling local and global deformations in deep learning: epitomic convolution, multiple instance learning, and sliding window detection. In: CVPR (2015)
Google Scholar
Abadi, M., Agarwal, A., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 (2016)
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Google Scholar
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)
Google Scholar
Wang, P., et al.: Understanding convolution for semantic segmentation. arXiv:1702.08502 (2017)
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS (2014)
Google Scholar
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: ICCV (2017)
Google Scholar
Li, X., Liu, Z., Luo, P., Loy, C.C., Tang, X.: Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In: CVPR (2017)
Google Scholar
Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: revisiting the ResNet model for visual recognition. arXiv:1611.10080 (2016)
Wang, G., Luo, P., Lin, L., Wang, X.: Learning object interactions and descriptions for semantic image segmentation. In: CVPR (2017)
Google Scholar
Luo, P., Wang, G., Lin, L., Wang, X.: Deep dual learning for semantic image segmentation. In: ICCV (2017)
Google Scholar
Bulò, S.R., Porzi, L., Kontschieder, P.: In-place activated BatchNorm for memory-optimized training of DNNs. In: CVPR (2018)
Google Scholar

Download references

Acknowledgments

We would like to acknowledge the valuable discussions with Haozhi Qi and Jifeng Dai about Aligned Xception, the feedback from Chen Sun, and the support from Google Mobile Vision team.

Author information

Authors and Affiliations

Google Inc., Mountain View, USA
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff & Hartwig Adam

Authors

Liang-Chieh Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yukun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
George Papandreou
View author publications
You can also search for this author in PubMed Google Scholar
Florian Schroff
View author publications
You can also search for this author in PubMed Google Scholar
Hartwig Adam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang-Chieh Chen .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, LC., Zhu, Y., Papandreou, G., Schroff, F., Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11211. Springer, Cham. https://doi.org/10.1007/978-3-030-01234-2_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-01234-2_49
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01233-5
Online ISBN: 978-3-030-01234-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Work

3 Methods

3.1 Encoder-Decoder with Atrous Convolution

3.2 Modified Aligned Xception

4 Experimental Evaluation

4.1 Decoder Design Choices

4.2 ResNet-101 as Network Backbone

4.3 Xception as Network Backbone

4.4 Improvement Along Object Boundaries

4.5 Experimental Results on Cityscapes

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation