Advertisement

ICNet for Real-Time Semantic Segmentation on High-Resolution Images

  • Hengshuang ZhaoEmail author
  • Xiaojuan Qi
  • Xiaoyong Shen
  • Jianping Shi
  • Jiaya Jia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)

Abstract

We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion unit to quickly achieve high-quality segmentation. Our system yields real-time inference on a single GPU card with decent quality results evaluated on challenging datasets like Cityscapes, CamVid and COCO-Stuff.

Keywords

Real-time High-resolution Semantic segmentation 

Supplementary material

474178_1_En_25_MOESM1_ESM.pdf (2.6 mb)
Supplementary material 1 (pdf 2637 KB)

References

  1. 1.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  2. 2.
    Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)Google Scholar
  3. 3.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561 (2015)
  4. 4.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015)Google Scholar
  5. 5.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)Google Scholar
  6. 6.
    Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: revisiting the ResNet model for visual recognition. arXiv:1611.10080 (2016)
  7. 7.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  8. 8.
    Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147 (2016)
  9. 9.
    Treml, M., et al.: Speeding up semantic segmentation for autonomous driving. In: NIPS Workshop (2016)Google Scholar
  10. 10.
    Wang, P., et al.: Understanding convolution for semantic segmentation. arXiv:1702.08502 (2017)
  11. 11.
    Lin, G., Milan, A., Shen, C., Reid, I.D.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: CVPR (2017)Google Scholar
  12. 12.
    Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR (2017)Google Scholar
  13. 13.
    Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915 (2016)
  14. 14.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)Google Scholar
  15. 15.
    Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: ICCV (2015)Google Scholar
  16. 16.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV (2015)Google Scholar
  17. 17.
    Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a high-definition ground truth database. Pattern Recognit. Lett. 30, 88–97 (2009)CrossRefGoogle Scholar
  18. 18.
    Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. arXiv:1612.03716 (2016)
  19. 19.
    Liu, C., Yuen, J., Torralba, A.: Nonparametric scene parsing via label transfer. TPAMI 33, 2368–2382 (2011)CrossRefGoogle Scholar
  20. 20.
    Chen, L., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-aware semantic image segmentation. In: CVPR (2016)Google Scholar
  21. 21.
    Hariharan, B., Arbeláez, P.A., Girshick, R.B., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)Google Scholar
  22. 22.
    Xia, F., Wang, P., Chen, L.-C., Yuille, A.L.: Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 648–663. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_39CrossRefGoogle Scholar
  23. 23.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  24. 24.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS. (2015)Google Scholar
  25. 25.
    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
  26. 26.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: CVPR (2017)Google Scholar
  27. 27.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  28. 28.
    Romera, E., Alvarez, J.M., Bergasa, L.M., Arroyo, R.: Efficient ConvNet for real-time semantic segmentation. In: Intelligent Vehicles Symposium (IV) (2017)Google Scholar
  29. 29.
    Shelhamer, E., Rakelly, K., Hoffman, J., Darrell, T.: Clockwork convnets for video semantic segmentation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 852–868. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_69CrossRefGoogle Scholar
  30. 30.
    Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR (2017)Google Scholar
  31. 31.
    Kundu, A., Vineet, V., Koltun, V.: Feature space optimization for semantic video segmentation. In: CVPR (2016)Google Scholar
  32. 32.
    Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representation warping. In: ICCV (2017)Google Scholar
  33. 33.
    Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
  34. 34.
    Ghiasi, G., Fowlkes, C.C.: Laplacian pyramid reconstruction and refinement for semantic segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 519–534. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_32CrossRefGoogle Scholar
  35. 35.
    Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes VOC challenge. IJCV 88, 303–338 (2010)CrossRefGoogle Scholar
  36. 36.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20K dataset. arXiv:1608.05442 (2016)
  37. 37.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM MM (2014)Google Scholar
  38. 38.
    Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\)1mb model size. arXiv:1602.07360 (2016)
  39. 39.
    Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding. In: ICLR (2016)Google Scholar
  40. 40.
    Han, S., et al.: DSD: regularizing deep neural networks with dense-sparse-dense training flow. In: ICLR (2017)Google Scholar
  41. 41.
    Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: ICLR (2017)Google Scholar
  42. 42.
    Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.: Combining appearance and structure from motion features for road scene understanding. In: BMVC (2009)Google Scholar
  43. 43.
    Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Hengshuang Zhao
    • 1
    Email author
  • Xiaojuan Qi
    • 1
  • Xiaoyong Shen
    • 2
  • Jianping Shi
    • 3
  • Jiaya Jia
    • 1
    • 2
  1. 1.The Chinese University of Hong KongShatinHong Kong
  2. 2.Tencent Youtu LabShenzhenChina
  3. 3.SenseTime ResearchBeijingChina

Personalised recommendations