World Wide Web

, Volume 22, Issue 2, pp 555–570 | Cite as

Multi-scale deep context convolutional neural networks for semantic segmentation

  • Quan ZhouEmail author
  • Wenbing Yang
  • Guangwei Gao
  • Weihua OuEmail author
  • Huimin Lu
  • Jie Chen
  • Longin Jan Latecki
Part of the following topical collections:
  1. Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications


Recent years have witnessed the great progress for semantic segmentation using deep convolutional neural networks (DCNNs). This paper presents a novel fully convolutional network for semantic segmentation using multi-scale contextual convolutional features. Since objects in natural images tend to be with various scales and aspect ratios, capturing the rich contextual information is very critical for dense pixel prediction. On the other hand, when going deeper in convolutional layers, the convolutional feature maps of traditional DCNNs gradually become coarser, which may be harmful for semantic segmentation. According to these observations, we attempt to design a multi-scale deep context convolutional network (MDCCNet), which combines the feature maps from different levels of network in a holistic manner for semantic segmentation. The segmentation outputs of MDCCNets are further enhanced using dense connected conditional random fields (CRF). The proposed network allows us to fully exploit local and global contextual information, ranging from an entire scene to every single pixel, to perform pixel-wise label estimation. The experimental results demonstrate that our method outperforms or is comparable to state-of-the-art methods on PASCAL VOC 2012 and SIFTFlow semantic segmentation datasets.


Multi-scale context MDCNNs Semantic segmentation CRF 



The authors would like to thank all the anonymous reviewers for their valuable comments and suggestions. This work was partly supported by the National Science Foundation (Grant No. IIS-1302164), the National Natural Science Foundation of China (Grant No. 61401228, 61402238, 61762021, 61571240, 61501247, 61501259, 61671253, 61402122), China Postdoctoral Science Foundation (Grant No. 2015M581841), Natural Science Foundation of Jiangsu Province (Grant No. BK20150849, BK20160908), Postdoctoral Science Foundation of Jiangsu Province (Grant No. 1501019A), Open Research Fund of National Engineering Research Center of Communications and Networking (Nanjing University of Posts and Telecommunications) (Grant No. TXKY17009), Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (Grant No. MJUKF201710), Open Fund Project of Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education (Nanjing University of Science and Technology) (Grant No. JYB201709, JYB201710), Natural Science Foundation of Guizhou Province (Grant No.[2017]1130), and the 2014 Ph.D Recruitment Program of Guizhou Normal University.


  1. 1.
    Badrinarayanan, V., Alex, K., Roberto, C.: SegNet: A deep convolutional encoder-decoder architecture for scene segmentation. IEEE TPAMI (2017)Google Scholar
  2. 2.
    Carreira, J., Sminchisescu, C.: Cpmc: Automatic object segmentation using constrained parametric min-cuts. IEEE TPAMI. 34(7), 1312–1328 (2012)CrossRefGoogle Scholar
  3. 3.
    Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-aware semantic image segmentation. In: Proceedings of CVPR, pp. 3640–3649 (2016)Google Scholar
  4. 4.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE TPAMI (2017)Google Scholar
  5. 5.
    Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: Proceedings of CVPR, pp. 2147–2154 (2014)Google Scholar
  6. 6.
    Everingham, M., Eslami, S.A., Van, G.L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 11(1), 98–136 (2015)CrossRefGoogle Scholar
  7. 7.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE TPAMI. 35(8), 1915–1929 (2013)CrossRefGoogle Scholar
  8. 8.
    Fulkerson, B., Vedaldi, A., Soatto, S.: Class Segmentation and Object Localization with Superpixel Neighborhoods. In: Proceedings of ICCV, pp. 670-677 (2009)Google Scholar
  9. 9.
    Gao, L.L., Song, J.K., Nie, F.P., Zhou, F.H., Sebe, N., Shen, H.T.: Graph-Without-Cut: an ideal graph learning for image segmentation. In: Proceedings of AAAI, pp. 1188–1194 (2016)Google Scholar
  10. 10.
    Gao, L.L., Guo, Z., Zhang, H.W., Xu, X., Shen, H.T.: Video captioning with Attention-Based LSTM and semantic consistency. IEEE TMM. 19(9), 2045–2055 (2017)Google Scholar
  11. 11.
    Girshick, R.: Fast R-Cnn. In: Proceedings of ICCV, pp. 1440–1448 (2015)Google Scholar
  12. 12.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of CVPR, pp. 580–587 (2014)Google Scholar
  13. 13.
    Hariharan, B., ArbelAez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: Proceedings of ICCV, pp. 991–998 (2011)Google Scholar
  14. 14.
    He, K.M., Zhang, X.Y., Ren, S.Q., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE TPAMI. 37(9), 1904–1916 (2015)CrossRefGoogle Scholar
  15. 15.
    He, K.M., Zhang, X.Y., Ren, S.Q., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016)Google Scholar
  16. 16.
    Jia, Y.Q., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACMMM, pp. 675–678 (2014)Google Scholar
  17. 17.
    Kamran, S.A., Sabbir, A.S.: Efficient yet deep convolutional neural networks for semantic segmentation. In: Arxiv (2017)Google Scholar
  18. 18.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS, pp. 1097–1105 (2012)Google Scholar
  19. 19.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of CVPR, pp. 2169–2178 (2006)Google Scholar
  20. 20.
    Lin, G.S., Shen, C.H., Van, D.H., Reid, I.: Exploring context with deep structured models for semantic segmentation. IEEE TPAMI (2017)Google Scholar
  21. 21.
    Liu, C., Yuen, J., Torralba, A.: Sift flow: Dense correspondence across scenes and its applications. IEEE TPAMI. 33(5), 978–994 (2011)CrossRefGoogle Scholar
  22. 22.
    Liu, Z.W., Li, X.X., Luo, P., Loy, C.C., Tang, X.O.: Semantic image segmentation via deep parsing network. In: Proceedings of ICCV, pp. 1377–1385 (2015)Google Scholar
  23. 23.
    Liu, Y., Chen, M.M., Hu, X.W., Wang, K., Bai, X.: Richer convolutional features for edge detection. In: Proceedings of CVPR, pp. 5872–5881 (2017)Google Scholar
  24. 24.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE TPAMI. 39(4), 640–651 (2017)CrossRefGoogle Scholar
  25. 25.
    Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Proceedings of NIPS, pp. 1601–1609 (2014)Google Scholar
  26. 26.
    Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. In: Proceedings of CVPR, pp. 3376–3385 (2015)Google Scholar
  27. 27.
    Nguyen, K., Fookes, C., Sridharan, S.: Deep context modeling for semantic segmentation. In: Proceedings of WACV, pp. 56–63 (2017)Google Scholar
  28. 28.
    Noh, H., Hong, S., Han, B.Y.: Learning deconvolution network for semantic segmentation. In: Proceedings of ICCV, pp. 1520–1528 (2015)Google Scholar
  29. 29.
    Pinherio, R.C., Pedro, H.: Recurrent convolutional neural networks for scene parsing. In: Proceedings of ICML (2014)Google Scholar
  30. 30.
    Ren, S.Q., He, K.M., Girshick, R., Sun, J.: Faster R-Cnn: towards real-time object detection with region proposal networks. In: Proceedings of NIPS, pp. 91–99 (2015)Google Scholar
  31. 31.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Proceedings of MICCAI, pp. 234–241 (2015)Google Scholar
  32. 32.
    Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: Proceedings of CVPR, pp. 1–8 (2008)Google Scholar
  33. 33.
    Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV 81(1), 2–23 (2009)CrossRefGoogle Scholar
  34. 34.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  35. 35.
    Song, J.K., Gao, L.L., Nie, F.P., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE TIP. 25(11), 4999–5011 (2016)MathSciNetzbMATHGoogle Scholar
  36. 36.
    Song, J.K., Gao, L.L., Puscas, M.M., Nie, F.P., Shen, F.M., Sebe, N.: Joint graph learning and video segmentation via multiple cues and topology calibration. In: Proceedings of ACM MM, pp. 831–840 (2016)Google Scholar
  37. 37.
    Song, J.K., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. PR (2017)Google Scholar
  38. 38.
    Song, J.K., Zhang, H.W., Li, X.P., Gao, L.L., Wang, M., Hong, R.C.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE TIP (2018)Google Scholar
  39. 39.
    Szegedy, C., Liu, W., Jia, Y.Q., Sermanet, P., Reed, S., Anguelo, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of CVPR, pp. 1–9 (2015)Google Scholar
  40. 40.
    Tighe, J., Lazebnik, S.: Finding things: image parsing with regions and per-exemplar detectors. In: Proceedings of CVPR, pp. 3001–3008 (2013)Google Scholar
  41. 41.
    Tu, Z.W., Bai, X.: Auto-context and its application to high-level vision tasks and 3d brain image segmentation. IEEE TPAMI. 32(10), 1744–1757 (2010)CrossRefGoogle Scholar
  42. 42.
    Uijlings, J.R., Van, D.S., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV. 104(2), 154–171 (2013)CrossRefGoogle Scholar
  43. 43.
    Vladlen, K.: Efficient Inference in Fully Connected Crfs with Gaussian Edge Potentials. In: Proceedings of NIPS, pp. 4–10 (2011)Google Scholar
  44. 44.
    Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Transactions on Multimedia (2017)Google Scholar
  45. 45.
    Xu, X., He, L., Shimada, A., Taniguchi, R.I., Lu, H: Self-supervised video hashing with hierarchical binary auto-encoder. Neurocomputing 21(3), 191–203 (2016)CrossRefGoogle Scholar
  46. 46.
    Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.L.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE TIP. 26(5), 2494–2507 (2017)MathSciNetzbMATHGoogle Scholar
  47. 47.
    Yang, W.B., Zhou, Q., Fan, Y.W., Gao, G.W., Wu, S.S., Ou, W.H., Lu, H.M., Cheng, J., Longin, J.L.: Deep context convolutional neural networks for semantic segmentation. In: Proceedings of CCCV (2017)Google Scholar
  48. 48.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122 (2015)
  49. 49.
    Zhao, H.H., Shi, J.P., Qi, X.J., Wang, X.G., Jia, J.Y.: Pyramid scene parsing network. arXiv:1612.01105 (2017)
  50. 50.
    Zheng, S., Jayasumana, S., Paredes, B.R., Vineet, V., Su, Z.Z., Du, D.L., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: Proceedings of ICCV, pp. 1529–1537 (2015)Google Scholar
  51. 51.
    Zhou, Q., Zhu, J., Liu, W.Y.: Learning dynamic hybrid Markov random field for image labeling. IEEE TIP. 22(6), 2219–2232 (2013)MathSciNetzbMATHGoogle Scholar
  52. 52.
    Zhou, Q., Zheng, B.Y., Zhu, W.P., Latecki, L.J.: Multi-scale context for scene labeling via flexible segmentation graph. PR 2016(59), 312–324 (2016)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.National Engineering Research Center of Communications and NetworkingNanjing University of Posts, TelecommunicationsNanjingChina
  2. 2.Fujian Provincial Key Laboratory of Information Processing and Intelligent ControlMinjiang UniversityFuzhouChina
  3. 3.Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of EducationNanjing University of Science, TechnologyNanjingChina
  4. 4.School of Big Data and Computer ScienceGuizhou Normal UniversityGuiyangChina
  5. 5.Department of Mechanical and Control EngineeringKyushu Institute of TechnologyKitakyushuJapan
  6. 6.Huawei Technologies Co. Ltd.ShenZhenChina
  7. 7.Department of Computer and Information SciencesTemple UniversityPhiladelphiaUSA

Personalised recommendations