Skip to main content
Log in

Lightweight and Progressively-Scalable Networks for Semantic Segmentation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation. The problem nevertheless is not trivial especially for the real-world deployments, which often demand high efficiency in inference latency. In this paper, we thoroughly analyze the design of convolutional blocks (the type of convolutions and the number of channels in convolutions), and the ways of interactions across multiple scales, all from lightweight standpoint for semantic segmentation. With such in-depth comparisons, we conclude three principles, and accordingly devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner. Technically, LPS-Net first capitalizes on the principles to build a tiny network. Then, LPS-Net progressively scales the tiny network to larger ones by expanding a single dimension (the number of convolutional blocks, the number of channels, or the input resolution) at one time to meet the best speed/accuracy tradeoff. Extensive experiments conducted on three datasets consistently demonstrate the superiority of LPS-Net over several efficient semantic segmentation methods. More remarkably, our LPS-Net achieves 73.4% mIoU on Cityscapes test set, with the speed of 413.5FPS on an NVIDIA GTX 1080Ti, leading to a performance improvement by 1.5% and a 65% speed-up against the state-of-the-art STDC. Code is available at https://github.com/YihengZhang-CV/LPS-Net.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

The image data that support the findings of this study are available in Cityscapes (Cordts et al., 2016) (https://www.cityscapes-dataset.com/), BDD100K (Yu et al., 2018b) (https://bdd-data.berkeley.edu/), CamVid (Brostow et al., 2008) (http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/), COCO (Lin et al., 2014) (https://cocodataset.org/), and DUTS (Wang et al., 2017) (http://saliencydetection.net/duts/).

References

  • Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. European Conference on Computer Vision (ECCV), 5302, 44–57. https://doi.org/10.1007/978-3-540-88682-2_5

    Article  Google Scholar 

  • Chen, L.C., Yang, Y., Wang, J., Xu, W., & Yuille, A.L. (2016). Attention to scale: Scale-aware semantic image segmentation. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3640–3649).https://doi.org/10.1109/CVPR.2016.396.

  • Chen, L. C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., & Shlens, J. (2018). Searching for efficient multi-scale architectures for dense image prediction. Advances in Neural Information Processing Systems (NeurIPS), 31, 8713–8724.

    Google Scholar 

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184

    Article  Google Scholar 

  • Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. European Conference on Computer Vision (ECCV), 11211, 801–818. https://doi.org/10.1007/978-3-030-01234-2_49

    Article  Google Scholar 

  • Chen, W., Gong, X., Liu, X., Zhang, Q., Li, Y., & Wang, Z. (2020). Fasterseg: Searching for faster real-time semantic segmentation. In International conference on learning representations (ICLR)

  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp 1800–1807). https://doi.org/10.1109/CVPR.2017.195.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3213–3223). https://doi.org/10.1109/CVPR.2016.350.

  • Ding, M., Lian, X., Yang, L., Wang, P., Jin, X., Lu, Z., & Luo, P. (2021a). Hr-nas: Searching efficient high-resolution neural architectures with lightweight transformers. In 2021 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2982–2992). https://doi.org/10.1109/CVPR46437.2021.00300.

  • Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., & Sun, J. (2021b). Repvgg: Making vgg-style convnets great again. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13728–13737). https://doi.org/10.1109/CVPR46437.2021.01352.

  • Fan, M., Lai, S., Huang, J., Wei, X., Chai, Z., Luo, J., & Wei, X. (2021) Rethinking bisenet for real-time semantic segmentation. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9711–9720). https://doi.org/10.1109/CVPR46437.2021.00959

  • Fu, J., Liu, J., Wang, Y., Li, Y., Bao, Y., Tang, J., & Lu, H. (2019) Adaptive context network for scene parsing. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 6747–6756). https://doi.org/10.1109/ICCV.2019.00685.

  • Ghiasi, G., & Fowlkesl, C. C. (2016). Laplacian pyramid reconstruction and refinement for semantic segmentation. European Conference on Computer Vision (ECCV), 9907, 519–534. https://doi.org/10.1007/978-3-319-46487-9_32

    Article  Google Scholar 

  • Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P., Zhao, S., & Keutzer, K. (2018). Squeezenext: Hardware-aware neural network design. In 2018 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 1719–171909). https://doi.org/10.1109/CVPRW.2018.00215.

  • Goldberg, D.E., & Deb, K. (1990) A comparative analysis of selection schemes used in genetic algorithms. In Foundations of genetic algorithms (pp 69–93). https://doi.org/10.1016/b978-0-08-050684-5.50008-2.

  • Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., & Xu, C. (2020) Ghostnet: More features from cheap operations. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1577–1586). https://doi.org/10.1109/CVPR42600.2020.00165.

  • Han, S., Mao, H., & Dally, W.J. (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International conference on learning representations (ICLR).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90.

  • Hong, Y., Pan, H., Sun, W., & Jia, Y. (2021) Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. ArXiv:2101.06085.

  • Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv:1704.04861

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), 25, 1106–1114.

    Google Scholar 

  • Ladický, L., Russell, C., Kohli, P., & Torr, P.H. (2009). Associative hierarchical crfs for object class image segmentation. In 2009 IEEE 12th international conference on computer vision (pp. 739–746). https://doi.org/10.1109/ICCV.2009.5459248

  • Li, G., Yun, I., Kim, J., & Kim, J. (2019). Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. In British machine vision conference (BMVC)

  • Li, H., Xiong, P., Fan, H., & Sun, J. (2019a). Dfanet: Deep feature aggregation for real-time semantic segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9514–9523). https://doi.org/10.1109/CVPR.2019.00975.

  • Li, P., Dong, X., Yu, X., & Yang, Y. (2020a) When humans meet machines: Towards efficient segmentation networks. In British machine vision conference (BMVC)

  • Li, X., Zhou, Y., Pan, Z., & Feng, J. (2019b) Partial order pruning: For best speed/accuracy trade-off in neural architecture search. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9137–9145). https://doi.org/10.1109/CVPR.2019.00936.

  • Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tan, S., & Tong, Y. (2020). Semantic flow for fast and accurate scene parsing. European Conference on Computer Vision (ECCV), 12346, 775–793. https://doi.org/10.1007/978-3-030-58452-8_45

    Article  Google Scholar 

  • Li, X., Zhang, J., Yang, Y., Cheng, G., Yang, K., Tong, Y., & Tao, D. (2022) Sfnet: Faster, accurate, and domain agnostic semantic segmentation via semantic flow. ArXiv:2207.04415

  • Lin, D., Shen, D., Shen, S., Ji, Y., Lischinski, D., Cohen-Or, D., & Huang, H. (2019) Zigzagnet: Fusing top-down and bottom-up context for object segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7482–7491). https://doi.org/10.1109/CVPR.2019.00767.

  • Lin, P., Sun, P., Cheng, G., Xie, S., Li, X., & Shi, J. (2020) Graph-guided architecture search for real-time semantic segmentation. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4202–4211). https://doi.org/10.1109/CVPR42600.2020.00426.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. European Conference on Computer Vision (ECCV), 8693, 740–755. https://doi.org/10.1007/978-3-319-10602-1_48

    Article  Google Scholar 

  • Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L. J., Fei-Fei, L., Yuille, A. L., Huang, J., & Murphy, K. P. (2017). Progressive neural architecture search. European Conference on Computer Vision (ECCV), 11205, 19–35. https://doi.org/10.1007/978-3-030-01246-5_2

    Article  Google Scholar 

  • Liu, C., Chen, L.C., Schroff, F., Adam, H., Hua, W., Yuille, A.L., & Fei-Fei, L. (2019a) Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 82–92). https://doi.org/10.1109/CVPR.2019.00017.

  • Liu, H., Simonyan, K., & Yang, Y. (2019b). DARTS: Differentiable architecture search. In International conference on learning representations (ICLR)

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3431–3440). https://doi.org/10.1109/CVPR.2015.7298965.

  • Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient CNN architecture design. European Conference on Computer Vision (ECCV), 11218, 116–131. https://doi.org/10.1007/978-3-030-01264-9_8

    Article  Google Scholar 

  • Nirkin, Y., Wolf, L., & Hassner, T. (2021) Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4060–4069). https://doi.org/10.1109/CVPR46437.2021.00405.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS) (Vol. 32, pp. 8024–8035).

  • Peng, C., Zhang, X., Yu, G., Luo, G., & Sun, J. (2017) Large kernel matters—Improve semantic segmentation by global convolutional network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1743–1751). https://doi.org/10.1109/CVPR.2017.189.

  • Poudel, R.P., Liwicki, S., & Cipolla, R. (2019) Fast-scnn: Fast semantic segmentation network. In British Machine Vision Conference (BMVC)

  • Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106, 107404. https://doi.org/10.1016/j.patcog.2020.107404

    Article  Google Scholar 

  • Qiu, Z., Yao, T., & Mei, T. (2017). Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Transactions on Multimedia, 20(4), 939–949. https://doi.org/10.1109/TMM.2017.2759504

    Article  Google Scholar 

  • Qiu, Z., Yao, T., Zhang, Y., Zhang, Y., & Mei, T. (2019) Scheduled differentiable architecture search for visual recognition. ArXiv:1909.10236

  • Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q.V., & Kurakin, A. (2017). Large-scale evolution of image classifiers. In 34th International Conference on Machine Learning (ICML) (Vol. 70, pp. 2902–2911).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.C. (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). https://doi.org/10.1109/CVPR.2018.00474

  • Si, H., Zhang, Z., & Lu, F. (2020). Real-time semantic segmentation via multiply spatial fusion network. In British Machine Vision Conference (BMVC)

  • Sun, P., Wu, J., Li, S., Lin, P., Huang, J., & Li, X. (2021). Real-time semantic segmentation via auto depth, downsampling joint decision and feature aggregation. International Journal of Computer Vision (IJCV), 129(5), 1506–1525. https://doi.org/10.1007/s11263-021-01433-3

    Article  Google Scholar 

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016) Rethinking the inception architecture for computer vision. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2818–2826). https://doi.org/10.1109/CVPR.2016.308.

  • Tan, M., & Le, Q.V. (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In 36th International Conference on Machine Learning (ICML) (Vol 97, pp. 6105–6114)

  • Tao, A., Sapra, K., & Catanzaro, B. (2020) Hierarchical multi-scale attention for semantic segmentation. ArXiv:2005.10821

  • Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., & Xiao, B. (2021). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364. https://doi.org/10.1109/TPAMI.2020.2983686

    Article  Google Scholar 

  • Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., & Ruan, X. (2017). Learning to detect salient objects with image-level supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 136–145). https://doi.org/10.1109/CVPR.2017.404.

  • Wu, B., Li, C., Zhang, H., Dai, X., Zhang, P., Yu, M., Wang, J., Lin, Y., & Vajda, P. (2021a). Fbnetv5: Neural architecture search for multiple tasks in one run. ArXiv:2111.10007

  • Wu, Y., Li, X., Shi, C., Tong, Y., Hua, Y., Song, T., Ma, R., & Guan, H. (2021b). Fast and accurate scene parsing via bi-direction alignment networks. In 2021 IEEE International Conference on Image Processing (ICIP) (pp. 2508–2512). https://doi.org/10.1109/ICIP42928.2021.9506720

  • Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. European Conference on Computer Vision (ECCV), 11217, 325–341. https://doi.org/10.1007/978-3-030-01261-8_20

    Article  Google Scholar 

  • Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., & Sang, N. (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision (IJCV), 129(11), 3051–3068. https://doi.org/10.1007/s11263-021-01515-2

    Article  Google Scholar 

  • Yu, C., Xiao, B., Gao, C., Yuan, L., Zhang, L., Sang, N., & Wang, J. (2021b) Lite-hrnet: A lightweight high-resolution network. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10435–10445). https://doi.org/10.1109/CVPR46437.2021.01030.

  • Yu, F., Koltun, V., & Funkhouser, T. (2017) Dilated residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 636–644). https://doi.org/10.1109/CVPR.2017.75.

  • Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., & Darrell, T. (2018b) Bdd100k: A diverse driving video database with scalable annotation tooling. ArXiv:1805.04687

  • Yuan, Y., Huang, L., Guo, J., Zhang, C., Chen, X., & Wang, J. (2021). Ocnet: Object context for semantic segmentation. International Journal of Computer Vision (IJCV), 129(8), 2375–2398. https://doi.org/10.1007/s11263-021-01465-9

    Article  MATH  Google Scholar 

  • Zhang, J., Pan, Y., Yao, T., Zhao, H., & Mei, T. (2019a) dabnn: A super fast inference framework for binary neural networks on arm devices. In Proceedings of the 27th ACM international conference on multimedia (pp. 2272–2275). https://doi.org/10.1145/3343031.3350534.

  • Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6848–6856). https://doi.org/10.1109/CVPR.2018.00716

  • Zhang, X., Xu, H., Mo, H., Tan, J., Yang, C., Wang, L., & Ren, W. (2021) Dcnas: Densely connected neural architecture search for semantic image segmentation. In 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 13956–13967). https://doi.org/10.1109/CVPR46437.2021.01374.

  • Zhang, Y., Qiu, Z., Liu, J., Yao, T., Liu, D., & Mei, T. (2019b) Customizable architecture search for semantic segmentation. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 11633–11642). https://doi.org/10.1109/CVPR.2019.01191.

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6230–6239). https://doi.org/10.1109/CVPR.2017.660.

  • Zhao, H., Qi, X., Shen, X., Shi, J., & Jia, J. (2018). Icnet for real-time semantic segmentation on high-resolution images. European Conference on Computer Vision (ECCV), 11207, 405–420. https://doi.org/10.1007/978-3-030-01219-9_25

    Article  Google Scholar 

  • Zhou, X., Wang, D., & Krähenbühl, P. (2019) Objects as points. ArXiv:1904.07850

  • Zoph, B., & Le, Q.V. (2017) Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR)

  • Zoph, B., Vasudevan, V., Shlens, J., & Le, Q.V. (2018) Learning transferable architectures for scalable image recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8697–8710). https://doi.org/10.1109/CVPR.2018.00907.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Yao.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Yao, T., Qiu, Z. et al. Lightweight and Progressively-Scalable Networks for Semantic Segmentation. Int J Comput Vis 131, 2153–2171 (2023). https://doi.org/10.1007/s11263-023-01801-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01801-1

Keywords

Navigation