Abstract
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation. The problem nevertheless is not trivial especially for the real-world deployments, which often demand high efficiency in inference latency. In this paper, we thoroughly analyze the design of convolutional blocks (the type of convolutions and the number of channels in convolutions), and the ways of interactions across multiple scales, all from lightweight standpoint for semantic segmentation. With such in-depth comparisons, we conclude three principles, and accordingly devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner. Technically, LPS-Net first capitalizes on the principles to build a tiny network. Then, LPS-Net progressively scales the tiny network to larger ones by expanding a single dimension (the number of convolutional blocks, the number of channels, or the input resolution) at one time to meet the best speed/accuracy tradeoff. Extensive experiments conducted on three datasets consistently demonstrate the superiority of LPS-Net over several efficient semantic segmentation methods. More remarkably, our LPS-Net achieves 73.4% mIoU on Cityscapes test set, with the speed of 413.5FPS on an NVIDIA GTX 1080Ti, leading to a performance improvement by 1.5% and a 65% speed-up against the state-of-the-art STDC. Code is available at https://github.com/YihengZhang-CV/LPS-Net.
Similar content being viewed by others
Data Availability
The image data that support the findings of this study are available in Cityscapes (Cordts et al., 2016) (https://www.cityscapes-dataset.com/), BDD100K (Yu et al., 2018b) (https://bdd-data.berkeley.edu/), CamVid (Brostow et al., 2008) (http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/), COCO (Lin et al., 2014) (https://cocodataset.org/), and DUTS (Wang et al., 2017) (http://saliencydetection.net/duts/).
References
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. European Conference on Computer Vision (ECCV), 5302, 44–57. https://doi.org/10.1007/978-3-540-88682-2_5
Chen, L.C., Yang, Y., Wang, J., Xu, W., & Yuille, A.L. (2016). Attention to scale: Scale-aware semantic image segmentation. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3640–3649).https://doi.org/10.1109/CVPR.2016.396.
Chen, L. C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., & Shlens, J. (2018). Searching for efficient multi-scale architectures for dense image prediction. Advances in Neural Information Processing Systems (NeurIPS), 31, 8713–8724.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. European Conference on Computer Vision (ECCV), 11211, 801–818. https://doi.org/10.1007/978-3-030-01234-2_49
Chen, W., Gong, X., Liu, X., Zhang, Q., Li, Y., & Wang, Z. (2020). Fasterseg: Searching for faster real-time semantic segmentation. In International conference on learning representations (ICLR)
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp 1800–1807). https://doi.org/10.1109/CVPR.2017.195.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3213–3223). https://doi.org/10.1109/CVPR.2016.350.
Ding, M., Lian, X., Yang, L., Wang, P., Jin, X., Lu, Z., & Luo, P. (2021a). Hr-nas: Searching efficient high-resolution neural architectures with lightweight transformers. In 2021 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2982–2992). https://doi.org/10.1109/CVPR46437.2021.00300.
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., & Sun, J. (2021b). Repvgg: Making vgg-style convnets great again. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13728–13737). https://doi.org/10.1109/CVPR46437.2021.01352.
Fan, M., Lai, S., Huang, J., Wei, X., Chai, Z., Luo, J., & Wei, X. (2021) Rethinking bisenet for real-time semantic segmentation. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9711–9720). https://doi.org/10.1109/CVPR46437.2021.00959
Fu, J., Liu, J., Wang, Y., Li, Y., Bao, Y., Tang, J., & Lu, H. (2019) Adaptive context network for scene parsing. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 6747–6756). https://doi.org/10.1109/ICCV.2019.00685.
Ghiasi, G., & Fowlkesl, C. C. (2016). Laplacian pyramid reconstruction and refinement for semantic segmentation. European Conference on Computer Vision (ECCV), 9907, 519–534. https://doi.org/10.1007/978-3-319-46487-9_32
Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P., Zhao, S., & Keutzer, K. (2018). Squeezenext: Hardware-aware neural network design. In 2018 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 1719–171909). https://doi.org/10.1109/CVPRW.2018.00215.
Goldberg, D.E., & Deb, K. (1990) A comparative analysis of selection schemes used in genetic algorithms. In Foundations of genetic algorithms (pp 69–93). https://doi.org/10.1016/b978-0-08-050684-5.50008-2.
Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., & Xu, C. (2020) Ghostnet: More features from cheap operations. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1577–1586). https://doi.org/10.1109/CVPR42600.2020.00165.
Han, S., Mao, H., & Dally, W.J. (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International conference on learning representations (ICLR).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90.
Hong, Y., Pan, H., Sun, W., & Jia, Y. (2021) Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. ArXiv:2101.06085.
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv:1704.04861
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), 25, 1106–1114.
Ladický, L., Russell, C., Kohli, P., & Torr, P.H. (2009). Associative hierarchical crfs for object class image segmentation. In 2009 IEEE 12th international conference on computer vision (pp. 739–746). https://doi.org/10.1109/ICCV.2009.5459248
Li, G., Yun, I., Kim, J., & Kim, J. (2019). Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. In British machine vision conference (BMVC)
Li, H., Xiong, P., Fan, H., & Sun, J. (2019a). Dfanet: Deep feature aggregation for real-time semantic segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9514–9523). https://doi.org/10.1109/CVPR.2019.00975.
Li, P., Dong, X., Yu, X., & Yang, Y. (2020a) When humans meet machines: Towards efficient segmentation networks. In British machine vision conference (BMVC)
Li, X., Zhou, Y., Pan, Z., & Feng, J. (2019b) Partial order pruning: For best speed/accuracy trade-off in neural architecture search. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9137–9145). https://doi.org/10.1109/CVPR.2019.00936.
Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tan, S., & Tong, Y. (2020). Semantic flow for fast and accurate scene parsing. European Conference on Computer Vision (ECCV), 12346, 775–793. https://doi.org/10.1007/978-3-030-58452-8_45
Li, X., Zhang, J., Yang, Y., Cheng, G., Yang, K., Tong, Y., & Tao, D. (2022) Sfnet: Faster, accurate, and domain agnostic semantic segmentation via semantic flow. ArXiv:2207.04415
Lin, D., Shen, D., Shen, S., Ji, Y., Lischinski, D., Cohen-Or, D., & Huang, H. (2019) Zigzagnet: Fusing top-down and bottom-up context for object segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7482–7491). https://doi.org/10.1109/CVPR.2019.00767.
Lin, P., Sun, P., Cheng, G., Xie, S., Li, X., & Shi, J. (2020) Graph-guided architecture search for real-time semantic segmentation. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4202–4211). https://doi.org/10.1109/CVPR42600.2020.00426.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. European Conference on Computer Vision (ECCV), 8693, 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L. J., Fei-Fei, L., Yuille, A. L., Huang, J., & Murphy, K. P. (2017). Progressive neural architecture search. European Conference on Computer Vision (ECCV), 11205, 19–35. https://doi.org/10.1007/978-3-030-01246-5_2
Liu, C., Chen, L.C., Schroff, F., Adam, H., Hua, W., Yuille, A.L., & Fei-Fei, L. (2019a) Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 82–92). https://doi.org/10.1109/CVPR.2019.00017.
Liu, H., Simonyan, K., & Yang, Y. (2019b). DARTS: Differentiable architecture search. In International conference on learning representations (ICLR)
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3431–3440). https://doi.org/10.1109/CVPR.2015.7298965.
Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient CNN architecture design. European Conference on Computer Vision (ECCV), 11218, 116–131. https://doi.org/10.1007/978-3-030-01264-9_8
Nirkin, Y., Wolf, L., & Hassner, T. (2021) Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4060–4069). https://doi.org/10.1109/CVPR46437.2021.00405.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS) (Vol. 32, pp. 8024–8035).
Peng, C., Zhang, X., Yu, G., Luo, G., & Sun, J. (2017) Large kernel matters—Improve semantic segmentation by global convolutional network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1743–1751). https://doi.org/10.1109/CVPR.2017.189.
Poudel, R.P., Liwicki, S., & Cipolla, R. (2019) Fast-scnn: Fast semantic segmentation network. In British Machine Vision Conference (BMVC)
Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106, 107404. https://doi.org/10.1016/j.patcog.2020.107404
Qiu, Z., Yao, T., & Mei, T. (2017). Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Transactions on Multimedia, 20(4), 939–949. https://doi.org/10.1109/TMM.2017.2759504
Qiu, Z., Yao, T., Zhang, Y., Zhang, Y., & Mei, T. (2019) Scheduled differentiable architecture search for visual recognition. ArXiv:1909.10236
Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q.V., & Kurakin, A. (2017). Large-scale evolution of image classifiers. In 34th International Conference on Machine Learning (ICML) (Vol. 70, pp. 2902–2911).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.C. (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). https://doi.org/10.1109/CVPR.2018.00474
Si, H., Zhang, Z., & Lu, F. (2020). Real-time semantic segmentation via multiply spatial fusion network. In British Machine Vision Conference (BMVC)
Sun, P., Wu, J., Li, S., Lin, P., Huang, J., & Li, X. (2021). Real-time semantic segmentation via auto depth, downsampling joint decision and feature aggregation. International Journal of Computer Vision (IJCV), 129(5), 1506–1525. https://doi.org/10.1007/s11263-021-01433-3
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016) Rethinking the inception architecture for computer vision. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2818–2826). https://doi.org/10.1109/CVPR.2016.308.
Tan, M., & Le, Q.V. (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In 36th International Conference on Machine Learning (ICML) (Vol 97, pp. 6105–6114)
Tao, A., Sapra, K., & Catanzaro, B. (2020) Hierarchical multi-scale attention for semantic segmentation. ArXiv:2005.10821
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., & Xiao, B. (2021). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364. https://doi.org/10.1109/TPAMI.2020.2983686
Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., & Ruan, X. (2017). Learning to detect salient objects with image-level supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 136–145). https://doi.org/10.1109/CVPR.2017.404.
Wu, B., Li, C., Zhang, H., Dai, X., Zhang, P., Yu, M., Wang, J., Lin, Y., & Vajda, P. (2021a). Fbnetv5: Neural architecture search for multiple tasks in one run. ArXiv:2111.10007
Wu, Y., Li, X., Shi, C., Tong, Y., Hua, Y., Song, T., Ma, R., & Guan, H. (2021b). Fast and accurate scene parsing via bi-direction alignment networks. In 2021 IEEE International Conference on Image Processing (ICIP) (pp. 2508–2512). https://doi.org/10.1109/ICIP42928.2021.9506720
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. European Conference on Computer Vision (ECCV), 11217, 325–341. https://doi.org/10.1007/978-3-030-01261-8_20
Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., & Sang, N. (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision (IJCV), 129(11), 3051–3068. https://doi.org/10.1007/s11263-021-01515-2
Yu, C., Xiao, B., Gao, C., Yuan, L., Zhang, L., Sang, N., & Wang, J. (2021b) Lite-hrnet: A lightweight high-resolution network. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10435–10445). https://doi.org/10.1109/CVPR46437.2021.01030.
Yu, F., Koltun, V., & Funkhouser, T. (2017) Dilated residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 636–644). https://doi.org/10.1109/CVPR.2017.75.
Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., & Darrell, T. (2018b) Bdd100k: A diverse driving video database with scalable annotation tooling. ArXiv:1805.04687
Yuan, Y., Huang, L., Guo, J., Zhang, C., Chen, X., & Wang, J. (2021). Ocnet: Object context for semantic segmentation. International Journal of Computer Vision (IJCV), 129(8), 2375–2398. https://doi.org/10.1007/s11263-021-01465-9
Zhang, J., Pan, Y., Yao, T., Zhao, H., & Mei, T. (2019a) dabnn: A super fast inference framework for binary neural networks on arm devices. In Proceedings of the 27th ACM international conference on multimedia (pp. 2272–2275). https://doi.org/10.1145/3343031.3350534.
Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6848–6856). https://doi.org/10.1109/CVPR.2018.00716
Zhang, X., Xu, H., Mo, H., Tan, J., Yang, C., Wang, L., & Ren, W. (2021) Dcnas: Densely connected neural architecture search for semantic image segmentation. In 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 13956–13967). https://doi.org/10.1109/CVPR46437.2021.01374.
Zhang, Y., Qiu, Z., Liu, J., Yao, T., Liu, D., & Mei, T. (2019b) Customizable architecture search for semantic segmentation. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 11633–11642). https://doi.org/10.1109/CVPR.2019.01191.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6230–6239). https://doi.org/10.1109/CVPR.2017.660.
Zhao, H., Qi, X., Shen, X., Shi, J., & Jia, J. (2018). Icnet for real-time semantic segmentation on high-resolution images. European Conference on Computer Vision (ECCV), 11207, 405–420. https://doi.org/10.1007/978-3-030-01219-9_25
Zhou, X., Wang, D., & Krähenbühl, P. (2019) Objects as points. ArXiv:1904.07850
Zoph, B., & Le, Q.V. (2017) Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR)
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q.V. (2018) Learning transferable architectures for scalable image recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8697–8710). https://doi.org/10.1109/CVPR.2018.00907.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Jifeng Dai.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Yao, T., Qiu, Z. et al. Lightweight and Progressively-Scalable Networks for Semantic Segmentation. Int J Comput Vis 131, 2153–2171 (2023). https://doi.org/10.1007/s11263-023-01801-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-023-01801-1