Multi-scale Context Intertwining for Semantic Segmentation

  • Di Lin
  • Yuanfeng Ji
  • Dani Lischinski
  • Daniel Cohen-Or
  • Hui HuangEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)


Accurate semantic image segmentation requires the joint consideration of local appearance, semantic information, and global scene context. In today’s age of pre-trained deep networks and their powerful convolutional features, state-of-the-art semantic segmentation approaches differ mostly in how they choose to combine together these different kinds of information. In this work, we propose a novel scheme for aggregating features from different scales, which we refer to as Multi-Scale Context Intertwining (MSCI). In contrast to previous approaches, which typically propagate information between scales in a one-directional manner, we merge pairs of feature maps in a bidirectional and recurrent fashion, via connections between two LSTM chains. By training the parameters of the LSTM units on the segmentation task, the above approach learns how to extract powerful and effective features for pixel-level semantic segmentation, which are then combined hierarchically. Furthermore, rather than using fixed information propagation routes, we subdivide images into super-pixels, and use the spatial relationship between them in order to perform image-adapted context aggregation. Our extensive evaluation on public benchmarks indicates that all of the aforementioned components of our approach increase the effectiveness of information propagation throughout the network, and significantly improve its eventual segmentation accuracy.


Semantic segmentation Deep learning Convolutional neural network Long short-term memory 



We thank the anonymous reviewers for their constructive comments. This work was supported in part by NSFC (61702338, 61522213, 61761146002, 61861130365), 973 Program (2015CB352501), Guangdong Science and Technology Program (2015A030312015), Shenzhen Innovation Program (KQJSCX20170727101233642, JCYJ20151015151249564), and ISF-NSFC Joint Research Program (2472/17).


  1. 1.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. IJCV 88, 303–338 (2010)CrossRefGoogle Scholar
  2. 2.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  3. 3.
    Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)Google Scholar
  4. 4.
    Cordts, M., et al.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  5. 5.
    Chen, H., Qi, X., Yu, L., Dou, Q., Qin, J., Heng, P.A.: DCAN: deep contour-aware networks for object instance segmentation from histology images. Med. Image Anal. 36, 135–146 (2017)CrossRefGoogle Scholar
  6. 6.
    Yoon, Y., Jeon, H.G., Yoo, D., Lee, J.Y., Kweon, I.S.: Light-field image super-resolution using convolutional neural network. IEEE Signal Process. Lett. 24, 848–852 (2017)CrossRefGoogle Scholar
  7. 7.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  8. 8.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015)Google Scholar
  9. 9.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv (2016)Google Scholar
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  11. 11.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)Google Scholar
  12. 12.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV (2015)Google Scholar
  13. 13.
    Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: ICCV (2015)Google Scholar
  14. 14.
    Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.: Weakly-and semi-supervised learning of a DCNN for semantic image segmentation. arXiv preprint arXiv:1502.02734 (2015)
  15. 15.
    Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In: CVPR (2016)Google Scholar
  16. 16.
    Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR (2016)Google Scholar
  17. 17.
    Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. arXiv (2016)Google Scholar
  18. 18.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. arXiv (2016)Google Scholar
  19. 19.
    Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters-improve semantic segmentation by global convolutional network. arXiv (2017)Google Scholar
  20. 20.
    Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv (2017)Google Scholar
  21. 21.
    Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR (2017)Google Scholar
  22. 22.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  23. 23.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). Scholar
  24. 24.
    Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR (2015)Google Scholar
  25. 25.
    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611 (2018)
  26. 26.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS (2014)Google Scholar
  27. 27.
    Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR (2017)Google Scholar
  28. 28.
    Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: ICCV (2017)Google Scholar
  29. 29.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  30. 30.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  31. 31.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)Google Scholar
  32. 32.
    Liang, X., Shen, X., Feng, J., Lin, L., Yan, S.: Semantic object parsing with graph LSTM. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 125–143. Springer, Cham (2016). Scholar
  33. 33.
    Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., Yan, S.: Semantic object parsing with local-global long short-term memory. In: CVPR, pp. 3185–3193 (2016)Google Scholar
  34. 34.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)Google Scholar
  35. 35.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)Google Scholar
  36. 36.
    Gadde, R., Jampani, V., Kiefel, M., Kappler, D., Gehler, P.V.: Superpixel convolutional networks using bilateral inceptions. In: ECCV (2016)Google Scholar
  37. 37.
    Bell, S., Lawrence Zitnick, C., Bala, K., Girshick, R.: Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: CVPR (2016)Google Scholar
  38. 38.
    Zeng, X.: Crafting GBD-Net for object detection. PAMI 40, 2109–2123 (2017)CrossRefGoogle Scholar
  39. 39.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia (2014)Google Scholar
  40. 40.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (2014)Google Scholar
  41. 41.
    Dollár, P., Zitnick, C.L.: Structured forests for fast edge detection. In: ICCV (2013)Google Scholar
  42. 42.
    Wang, P., et al.: Understanding convolution for semantic segmentation. arXiv preprint arXiv:1702.08502 (2017)
  43. 43.
    Sun, H., Xie, D., Pu, S.: Mixed context networks for semantic segmentation. arXiv preprint arXiv:1610.05854 (2016)
  44. 44.
    Wu, Z., Shen, C., Hengel, A.v.d.: Wider or deeper: revisiting the ResNet model for visual recognition. arXiv preprint arXiv:1611.10080 (2016)
  45. 45.
    Shen, F., Gan, R., Yan, S., Zeng, G.: Semantic segmentation via structured patch prediction, context CRF and guidance CRF. In: CVPR (2017)Google Scholar
  46. 46.
    Wang, G., Luo, P., Lin, L., Wang, X.: Learning object interactions and descriptions for semantic image segmentation. In: CVPR (2017)Google Scholar
  47. 47.
    Fu, J., Liu, J., Wang, Y., Lu, H.: Stacked deconvolutional network for semantic segmentation. arXiv preprint arXiv:1708.04943 (2017)
  48. 48.
    Luo, P., Wang, G., Lin, L., Wang, X.: Deep dual learning for semantic image segmentation. In: CVPR (2017)Google Scholar
  49. 49.
    Dai, J., He, K., Sun, J.: BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: ICCV (2015)Google Scholar
  50. 50.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)Google Scholar
  51. 51.
    Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv (2015)Google Scholar
  52. 52.
    He, Y., Chiu, W.C., Keuper, M., Fritz, M.: RGBD semantic segmentation using spatio-temporal data-driven pooling. arXiv (2016)Google Scholar
  53. 53.
    Wu, Z., Shen, C., Hengel, A.V.D.: High-performance semantic segmentation using very deep fully convolutional networks. arXiv preprint arXiv:1604.04339 (2016)
  54. 54.
    Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: ACCV (2016)Google Scholar
  55. 55.
    Lin, D., Chen, G., Cohen-Or, D., Heng, P.A., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Di Lin
    • 1
  • Yuanfeng Ji
    • 1
  • Dani Lischinski
    • 2
  • Daniel Cohen-Or
    • 1
    • 3
  • Hui Huang
    • 1
    Email author
  1. 1.Shenzhen UniversityShenzhenChina
  2. 2.The Hebrew University of JerusalemJerusalemIsrael
  3. 3.Tel Aviv UniversityTel AvivIsrael

Personalised recommendations