Advertisement

Semantic Flow for Fast and Accurate Scene Parsing

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)

Abstract

In this paper, we focus on designing effective method for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used—atrous convolutions and feature pyramid fusion, are either computation intensive or ineffective. Inspired by the Optical Flow for motion alignment between adjacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of adjacent levels, and broadcast high-level features to high resolution features effectively and efficiently. Furthermore, integrating our module to a common feature pyramid structure exhibits superior performance over other real-time methods even on light-weight backbone networks, such as ResNet-18. Extensive experiments are conducted on several challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid. Especially, our network is the first to achieve 80.4% mIoU on Cityscapes with a frame rate of 26 FPS. The code is available at https://github.com/donnyyou/torchcv.

Keywords

Scene parsing Semantic flow Flow alignment module 

Supplementary material

500725_1_En_45_MOESM1_ESM.pdf (710 kb)
Supplementary material 1 (pdf 709 KB)

References

  1. 1.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. PAMI 39, 2481–2495 (2017)CrossRefGoogle Scholar
  2. 2.
    Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 92(1), 1–31 (2011).  https://doi.org/10.1007/s11263-010-0390-2CrossRefGoogle Scholar
  3. 3.
    Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a high-definition ground truth database. Pattern Recogn. Lett. (2008) Google Scholar
  4. 4.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-24673-2_3CrossRefGoogle Scholar
  5. 5.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. PAMI (2018)Google Scholar
  6. 6.
    Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
  7. 7.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01234-2_49CrossRefGoogle Scholar
  8. 8.
    Cheng, B., et al.: SPGNet: semantic prediction guidance for scene parsing. In: ICCV, October 2019Google Scholar
  9. 9.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  10. 10.
    Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)Google Scholar
  11. 11.
    Ding, H., Jiang, X., Liu, A.Q., Magnenat-Thalmann, N., Wang, G.: Boundary-aware feature propagation for scene segmentation (2019)Google Scholar
  12. 12.
    Ding, H., Jiang, X., Shuai, B., Qun Liu, A., Wang, G.: Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: CVPR (2018)Google Scholar
  13. 13.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: CVPR (2015)Google Scholar
  14. 14.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88, 303–338 (2010).  https://doi.org/10.1007/s11263-009-0275-4CrossRefGoogle Scholar
  15. 15.
    Fu, J., Liu, J., Tian, H., Fang, Z., Lu, H.: Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983 (2018)
  16. 16.
    Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representation warping. In: ICCV, October 2017Google Scholar
  17. 17.
    He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation. In: ICCV, October 2019Google Scholar
  18. 18.
    He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network for semantic segmentation. In: CVPR, June 2019Google Scholar
  19. 19.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  20. 20.
    Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Criss-cross attention for semantic segmentation (2019)Google Scholar
  21. 21.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. ArXiv abs/1506.02025 (2015)Google Scholar
  22. 22.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. ArXiv abs/1609.02907 (2016)Google Scholar
  23. 23.
    Kirillov, A., Girshick, R., He, K., Dollar, P.: Panoptic feature pyramid networks. In: CVPR, June 2019Google Scholar
  24. 24.
    Kong, S., Fowlkes, C.C.: Recurrent scene parsing with perspective understanding in the loop. In: CVPR (2018)Google Scholar
  25. 25.
    Li, H., Xiong, P., Fan, H., Sun, J.: DFANet: deep feature aggregation for real-time semantic segmentation. In: CVPR, June 2019Google Scholar
  26. 26.
    Li, X., Yang, Y., Zhao, Q., Shen, T., Lin, Z., Liu, H.: Spatial pyramid based graph reasoning for semantic segmentation. In: CVPR (2020)Google Scholar
  27. 27.
    Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV (2019)Google Scholar
  28. 28.
    Li, X., Houlong, Z., Lei, H., Yunhai, T., Kuiyuan, Y.: GFF: gated fully fusion for semantic segmentation. In: AAAI (2020)Google Scholar
  29. 29.
    Li, X., Zhou, Y., Pan, Z., Feng, J.: Partial order pruning: for best speed/accuracy trade-off in neural architecture search. In: CVPR (2019)Google Scholar
  30. 30.
    Li, Y., Gupta, A.: Beyond grids: learning graph representations for visual recognition. In: NIPS (2018)Google Scholar
  31. 31.
    Li, Y., Shi, J., Lin, D.: Low-latency video semantic segmentation. In: CVPR, June 2018Google Scholar
  32. 32.
    Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: CVPR (2017)Google Scholar
  33. 33.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  34. 34.
    Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01264-9_8CrossRefGoogle Scholar
  35. 35.
    Mazzini, D.: Guided upsampling network for real-time semantic segmentation. In: BMVC (2018)Google Scholar
  36. 36.
    Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 561–580. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01249-6_34CrossRefGoogle Scholar
  37. 37.
    Mehta, S., Rastegari, M., Shapiro, L., Hajishirzi, H.: ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network. In: CVPR, June 2019Google Scholar
  38. 38.
    Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)Google Scholar
  39. 39.
    Nekrasov, V., Chen, H., Shen, C., Reid, I.: Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In: CVPR, June 2019Google Scholar
  40. 40.
    Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: ICCV (2017)Google Scholar
  41. 41.
    Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: CVPR, June 2018Google Scholar
  42. 42.
    Orsic, M., Kreso, I., Bevandic, P., Segvic, S.: In defense of pre-trained ImageNet architectures for real-time semantic segmentation of road-driving images. In: CVPR, June 2019Google Scholar
  43. 43.
    Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep neural network architecture for real-time semantic segmentation. http://arxiv.org/abs/1606.02147
  44. 44.
    Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS-W (2017)Google Scholar
  45. 45.
    Romera, E., Alvarez, J.M., Bergasa, L.M., Arroyo, R.: ERFNet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 19, 263–272 (2018)CrossRefGoogle Scholar
  46. 46.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
  47. 47.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015).  https://doi.org/10.1007/s11263-015-0816-yMathSciNetCrossRefGoogle Scholar
  48. 48.
    Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR (2016)Google Scholar
  49. 49.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  50. 50.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR, June 2018Google Scholar
  51. 51.
    Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01228-1_26CrossRefGoogle Scholar
  52. 52.
    Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: DenseASPP for semantic segmentation in street scenes. In: CVPR (2018)Google Scholar
  53. 53.
    Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_20CrossRefGoogle Scholar
  54. 54.
    Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: CVPR (2018)Google Scholar
  55. 55.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)Google Scholar
  56. 56.
    Yuan, Y., Wang, J.: OCNet: object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018)
  57. 57.
    Zhang, H., et al.: Context encoding for semantic segmentation. In: CVPR (2018)Google Scholar
  58. 58.
    Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic segmentation. In: CVPR, June 2019Google Scholar
  59. 59.
    Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.: Dual graph convolutional network for semantic segmentation. In: BMVC (2019)Google Scholar
  60. 60.
    Zhang, L., Xu, D., Arnab, A., Torr, P.H.: Dynamic graph message passing networks. In: CVPR (2020)Google Scholar
  61. 61.
    Zhang, R., Tang, S., Zhang, Y., Li, J., Yan, S.: Scale-adaptive convolutions for scene parsing. In: ICCV (2017)Google Scholar
  62. 62.
    Zhang, Y., Qiu, Z., Liu, J., Yao, T., Liu, D., Mei, T.: Customizable architecture search for semantic segmentation. In: CVPR, June 2019Google Scholar
  63. 63.
    Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 418–434. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_25CrossRefGoogle Scholar
  64. 64.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)Google Scholar
  65. 65.
    Zhao, H., et al.: PSANet: point-wise spatial attention network for scene parsing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 270–286. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01240-3_17CrossRefGoogle Scholar
  66. 66.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20K dataset. arXiv preprint arXiv:1608.05442 (2016)
  67. 67.
    Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR, July 2017Google Scholar
  68. 68.
    Zhu, Y., et al.: Improving semantic segmentation via video propagation and label relaxation. In: CVPR, June 2019Google Scholar
  69. 69.
    Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In: ICCV (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Key Laboratory of Machine Perception, MOE, School of EECSPeking UniversityBeijingChina
  2. 2.Huazhong University of Science and TechnologyWuhanChina
  3. 3.DeepMotionWuhanChina

Personalised recommendations