Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

Sun, Peng; Wu, Jiaxiang; Li, Songyuan; Lin, Peiwen; Huang, Junzhou; Li, Xi

doi:10.1007/s11263-021-01433-3

Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

Published: 19 February 2021

Volume 129, pages 1506–1525, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Peng Sun¹,
Jiaxiang Wu²,
Songyuan Li¹,
Peiwen Lin³,
Junzhou Huang⁴ &
…
Xi Li ORCID: orcid.org/0000-0003-3023-1662^1,5

765 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

To satisfy the stringent requirements for computational resources in the field of real-time semantic segmentation, most approaches focus on the hand-crafted design of light-weight segmentation networks. To enjoy the ability of model auto-design, Neural Architecture Search (NAS) has been introduced to search for the optimal building blocks of networks automatically. However, the network depth, downsampling strategy, and feature aggregation method are still set in advance and nonadjustable during searching. Moreover, these key properties are highly correlated and essential for a remarkable real-time segmentation model. In this paper, we propose a joint search framework, called AutoRTNet, to automate all the aforementioned key properties in semantic segmentation. Specifically, we propose hyper-cells to jointly decide the network depth and the downsampling strategy via a novel cell-level pruning process. Furthermore, we propose an aggregation cell to achieve automatic multi-scale feature aggregation. Extensive experimental results on Cityscapes and CamVid datasets demonstrate that the proposed AutoRTNet achieves the new state-of-the-art trade-off between accuracy and speed. Notably, our AutoRTNet achieves 73.9% mIoU on Cityscapes and 110.0 FPS on an NVIDIA TitanXP GPU card with input images at a resolution of \(768 \times 1536\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

CFFNet: Cross-scale Feature Fusion Network for Real-Time Semantic Segmentation

Lightweight and Progressively-Scalable Networks for Semantic Segmentation

Article 18 May 2023

Dense Dual-Path Network for Real-Time Semantic Segmentation

References

Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495.
Article Google Scholar
Baker, B., Gupta, O., Naik, N., & Raskar, R. (2016). Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167.
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In European conference on computer vision (pp. 44–57). Springer.
Cai, H., Zhu, L., & Han, S. (2018). Proxylessnas: Direct neural architecture search on target task and hardware. arXiv:1812.00332.
Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
Chen, L. C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., Shlens, J. (2018a). Searching for efficient multi-scale architectures for dense image prediction. In Advances in Neural Information Processing Systems (pp. 8699–8710).
Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018b). Encoder–decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 801–818).
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251–1258).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3213–3223).
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Article Google Scholar
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3146–3154).
Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In The IEEE conference on computer vision and pattern recognition (CVPR)
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
Li, G., & Kim, J. (2019). Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. In British machine vision conference
Li, H., Xiong, P., An, J., & Wang, L. (2018). Pyramid attention network for semantic segmentation. arXiv preprint arXiv:1805.10180.
Li, H., Xiong, P., Fan, H., & Sun, J. (2019a). Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9522–9531).
Li, L., & Talwalkar, A. (2019). Random search and reproducibility for neural architecture search. CoRR abs/1902.07638. http://arxiv.org/abs/1902.07638.
Li, X., Zhou, Y., Pan, Z., Feng, J. (2019b). Partial order pruning: For best speed/accuracy trade-off in neural architecture search. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Lin, G., Milan, A., Shen, C., & Reid, I. (2017a). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1925–1934).
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017b). Feature pyramid networks for object detection. In CVPR.
Liu, C., Chen, L. C., Schroff, F., Adam, H., Hua, W., Yuille, A. L., & Fei-Fei, L. (2019a). Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 82–92).
Liu, H., Simonyan, K., & Yang, Y. (2019b). DARTS: Differentiable architecture search. In ICLR
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In ECCV.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).
Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (pp. 4898–4906).
Maddison, C. J., Mnih, A., & Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., & Hajishirzi, H. (2018). Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 552–568).
Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., et al. (2019). Evolving deep neural networks. Artificial intelligence in the age of neural networks and brain computing (pp. 293–312). Amsterdam: Elsevier.
Chapter Google Scholar
Nekrasov, V., Chen, H., Shen, C., & Reid, I. (2019). Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9126–9135).
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision (pp. 1520–1528).
Orsic, M., Kreso, I., Bevandic, P., & Segvic, S. (2019). In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 12607–12616).
Paszke, A., Chaurasia, A., Kim, S., & Culurciello, E. (2016). Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in PyTorch. In NIPS autodiff workshop.
Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2019). Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 4780–4789.
Article Google Scholar
Romera, E., Alvarez, J. M., Bergasa, L. M., & Arroyo, R. (2017). Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1), 263–272.
Article Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2820–2828).
Treml, M., Arjona-Medina, J., Unterthiner, T., Durgesh, R., Friedmann, F., Schuberth, P., Mayr, A., Heusel, M., Hofmarcher, M., Widrich, M. et al. (2016). Speeding up semantic segmentation for autonomous driving. In MLITS, NIPS workshop (Vol. 2, p. 7).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
Wu, B., Wang, Y., Zhang, P., Tian, Y., Vajda, P., & Keutzer, K. (2018). Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090.
Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., & Keutzer, K. (2019). Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR (pp. 10734–10742).
Wu, Z., Shen, C., & Hengel, A. V. D. (2016). High-performance semantic segmentation using very deep fully convolutional networks. arXiv preprint arXiv:1604.04339.
Wu, Z., Shen, C., Hengel, A. V. D. (2017). Real-time semantic image segmentation via spatial sparsity. arXiv preprint arXiv:1712.00213.
Xie, S., Zheng, H., Liu, C., & Lin, L. (2019). SNAS: Stochastic neural architecture search. In ICLR.
Xu, H., Yao, L., Zhang, W., Liang, X., & Li, Z. (2019). Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In The IEEE international conference on computer vision (ICCV).
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 325–341).
Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 472–480).
Yu, K., Sciuto, C., Jaggi, M., Musat, C., & Salzmann, M. (2019). Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142.
Zhang, H., Zhang, H., Wang, C., & Xie, J. (2019a). Co-occurrent features in semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 548–557).
Zhang, Y., Qiu, Z., Liu, J., Yao, T., Liu, D., & Mei, T. (2019b). Customizable architecture search for semantic segmentation. In CVPR (pp. 11641–11650).
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).
Zhao, H., Qi, X., Shen, X., Shi, J., & Jia, J. (2018a). Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV) (pp. 405–420).
Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C. C., Lin, D., & Jia, J. (2018b). PSANet: Point-wise spatial attention network for scene parsing. In ECCV.
Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697–8710).

Download references

Acknowledgements

This work is supported in part by National Natural Science Foundation of China under Grant U20A20222, Zhejiang Provincial Natural Science Foundation of China under Grant LR19F020004, National Key Research and Development Program of China under Grant 2020AAA0107400, and key scientific technological innovation research project by Ministry of Education.

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Peng Sun, Songyuan Li & Xi Li
Tencent AI Lab, Shenzhen, China
Jiaxiang Wu
Sensetime Research, Beijing, China
Peiwen Lin
University of Texas at Arlington, Arlington, TX, USA
Junzhou Huang
Shanghai Institute for Advanced Study, Zhejiang University, Shanghai, China
Xi Li

Authors

Peng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxiang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Songyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Peiwen Lin
View author publications
You can also search for this author in PubMed Google Scholar
Junzhou Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xi Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xi Li.

Additional information

Communicated by Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, P., Wu, J., Li, S. et al. Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation. Int J Comput Vis 129, 1506–1525 (2021). https://doi.org/10.1007/s11263-021-01433-3

Download citation

Received: 20 February 2020
Accepted: 06 January 2021
Published: 19 February 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11263-021-01433-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

Abstract

Access this article

Similar content being viewed by others

CFFNet: Cross-scale Feature Fusion Network for Real-Time Semantic Segmentation

Lightweight and Progressively-Scalable Networks for Semantic Segmentation

Dense Dual-Path Network for Real-Time Semantic Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

Abstract

Access this article

Similar content being viewed by others

CFFNet: Cross-scale Feature Fusion Network for Real-Time Semantic Segmentation

Lightweight and Progressively-Scalable Networks for Semantic Segmentation

Dense Dual-Path Network for Real-Time Semantic Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation