Abstract
Semantic segmentation is a fundamental task in image analysis. The issue of semantic segmentation is to extract discriminative features for distinguishing different objects and recognizing hard examples. However, most existing methods have limitations on resolving this problem. To tackle this problem, we identify the contributions of the edge and saliency information for segmentation and present a novel end-to-end network, termed cross-guidance network (CGNet) to leverage them to benefit the semantic segmentation. The edge and saliency detection network are unified into the CGNet, and model the intrinsic information among them, guiding the process of extracting discriminative features. Specifically, the CGNet attempts to extract segmentation, edge, and salient features, simultaneously. Then it transfers them into the cross-guidance module (CGM) to generate the pre-knowledge features based on the modeled information, optimizing the context feature extraction process. The proposed approach is extensively evaluated on PASCAL VOC 2012, PASCAL-Person-Part, and Cityscapes, and achieves state-of-the-art performance, demonstrating the superiority of the proposed approach.
This is a preview of subscription content, access via your institution.
References
- 1
Geng Q C, Zhou Z, Cao X C. Survey of recent progress in semantic image segmentation with CNNs. Sci China Inf Sci, 2018, 61: 051101
- 2
Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 640–651
- 3
He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1904–1916
- 4
Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 6230–6239
- 5
Chen L-C, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 834–848
- 6
Chen L-C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation. 2017. ArXiv: 1706.05587
- 7
Chen L-C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 833–851
- 8
Joachims T, Finley T, Yu C-N J. Cutting-plane training of structural SVMs. Mach Learn, 2009, 77: 27–59
- 9
Lin T-Y, Goyal P, Girshick R, et al. Focal loss for dense object detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 2999–3007
- 10
Wu Z, Shen C, Hengel A. High-performance semantic segmentation using very deep fully convolutional networks. 2016. ArXiv: 1604.04339
- 11
Kokkinos I. UberNet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 5454–5463
- 12
Sun H Q, Pang Y W. GlanceNets efficient convolutional neural networks with adaptive hard example mining. Sci China Inf Sci, 2018, 61: 109101
- 13
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations, San Diego, 2015
- 14
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770–778
- 15
Huang G, Liu Z, Maaten L, et al. Densely connected convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2261–2269
- 16
Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 1800–1807
- 17
Badrinarayanan V, Kendall A, Cipolla R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2481–2495
- 18
Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. 1520–1528
- 19
Yu F, Koltun V, Funkhouser T A. Dilated residual networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 636–644
- 20
Lin G, Milan A, Shen C, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 5168–5177
- 21
Zhang H, Dana K, Shi J, et al. Context encoding for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7151–7160
- 22
Huang Z, Wang X, Huang L, et al. CCNet: criss-cross attention for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, Seoul, 2019
- 23
Jégou S, Drozdzal M, Vázquez D, et al. The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, 2017. 1175–1183
- 24
Yang M, Yu K, Zhang C, et al. DenseASPP for semantic segmentation in street scenes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 3684–3692
- 25
Zhang Z, Zhang X, Peng C, et al. ExFuse: enhancing feature fusion for semantic segmentation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 273–288
- 26
Zhao H, Qi X, Shen X, et al. ICNet for real-time semantic segmentation on high-resolution images. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 418–434
- 27
Li H, Xiong P, An J, et al. Pyramid attention network for semantic segmentation. In: Proceedings of British Machine Vision Conference, Newcastle, 2018. 285
- 28
Peng C, Zhang X, Yu G, et al. Large kernel matters-improve semantic segmentation by global convolutional network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 1743–1751
- 29
Wei Z, Sun Y, Wang J. Learning adaptive receptive fields for deep image parsing network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 3947–3955
- 30
Pang Y, Wang T, Anwer R M, et al. Efficient featurized image pyramid network for single shot detector. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 7336–7344
- 31
Deng R, Shen C, Liu S, et al. Learning to predict crisp boundaries. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 570–586
- 32
Xie S, Tu Z. Holistically-nested edge detection. Int J Comput Vis, 2017, 125: 3–18
- 33
Liu Y, Cheng M-M, Hu X, et al. Richer convolutional features for edge detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 5872–5881
- 34
Liu Y, Lew M S. Learning relaxed deep supervision for better edge detection. In: Proceedings of IEEE Conference on Computer Vision, Las Vegas, 2016. 231–240
- 35
Shen W, Wang X, Wang Y, et al. DeepContour: a deep convolutional feature learned by positive-sharing loss for contour detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3982–3991
- 36
Wang T-C, Liu M-Y, Zhu J-Y, et al. High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 8798–8807
- 37
Wang W, Lai Q, Fu H, et al. Salient object detection in the deep learning era: an in-depth survey. 2019. ArXiv: 1904.09146
- 38
Liu N, Han J. DHSNet: deep hierarchical saliency network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 678–686
- 39
Wang W, Shen J, Dong X, et al. Salient object detection driven by fixation prediction. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 1711–1720
- 40
Wang W, Shen J, Yang R, et al. Saliency-aware video object segmentation. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 20–33
- 41
Wang W, Shen J, Dong X, et al. Inferring salient objects from human fixations. IEEE Trans Pattern Anal Mach Intell, 2019. doi: https://doi.org/10.1109/TPAMI.2019.2905607
- 42
Liu N, Han J, Yang M-H. PiCANet: learning pixel-wise contextual attention for saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 3089–3098
- 43
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7132–7141
- 44
Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 3146–3154
- 45
Wang X, Girshick R, Gupta A, et al. Non-local neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7794–7803
- 46
Zhang X, Wang T, Qi J, et al. Progressive attention guided recurrent network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 714–722
- 47
Zhang X, Xiong H, Zhou W, et al. Picking deep filter responses for fine-grained image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 1134–1142
- 48
Everingham M, van Gool L, Williams C K I, et al. The pascal visual object classes (VOC) challenge. Int J Comput Vis, 2010, 88: 303–338
- 49
Xia F, Wang P, Chen X, et al. Joint multi-person pose estimation and semantic part segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 6080–6089
- 50
Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 3213–3223
- 51
Hariharan B, Arbelaez P, Bourdev L D, et al. Semantic contours from inverse detectors. In: Proceedings of the IEEE International Conference on Computer Vision, Barcelona, 2017. 991–998
- 52
Zheng S, Jayasumana S, Romera-Paredes B. Conditional random fields as recurrent neural networks. In: Proceedings of International Conference on Computer Vision, Santiago, 2015. 1529–1537
- 53
Liu Z, Li X, Luo P, et al. Semantic image segmentation via deep parsing network. In: Proceedings of International Conference on Computer Vision, Santiago, 2015. 1377–1385
- 54
Lin G, Shen C, Hengel A, et al. Efficient piecewise training of deep structured models for semantic segmentation. In: Proceedings of Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 3194–3203
- 55
Ke T-W, Hwang J-J, Liu Z, et al. Adaptive affinity fields for semantic segmentation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 605–621
- 56
Wu Z, Shen C, van den Hengel A. Wider or deeper: revisiting the ResNet model for visual recognition. Pattern Recogn, 2019, 90: 119–133
- 57
Xia F, Wang P, Chen L-C, et al. Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 648–663
- 58
Chen L-C, Yang Y, Wang J, et al. Attention to scale: scale-aware semantic image segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 3640–3649
- 59
Liang X, Shen X, Xiang D, et al. Semantic object parsing with local-global long short-term memory. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 3185–3193
- 60
Gong K, Liang X, Zhang D, et al. Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 6757–6765
- 61
Luo Y, Zheng Z, Zheng L, et al. Macro-micro adversarial network for human parsing. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 424–440
- 62
Liang X, Shen X, Feng J, et al. Semantic object parsing with graph LSTM. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 125–143
- 63
Zhao J, Li J, Nie X, et al. Self-supervised neural aggregation networks for human parsing. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, 2017. 1595–1603
- 64
Liang X, Lin L, Shen X, et al. Interpretable structure-evolving LSTM. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2175–2184
- 65
Nie X, Feng J, Yan S. Mutual learning to adapt for joint human parsing and pose estimation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 519–534
- 66
Zhu B, Chen Y, Tang M, et al. Progressive cognitive human parsing. In: Proceedings of AAAI Conference on Artificial Intelligence, New Orleans, 2018. 7607–7614
- 67
Li Q Z, Arnab A, Torr P H S. Holistic, instance-level human parsing. In: Proceedings of British Machine Vision Conference, London, 2017
- 68
Fang H, Lu G, Fang X, et al. Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 70–78
- 69
Gong K, Liang X, Li Y, et al. Instance-level human parsing via part grouping network. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 805–822
- 70
Liang X, Zhou H, Xing E. Dynamic-structure semantic propagation network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 752–761
- 71
Wang P, Chen P, Yuan Y, et al. Understanding convolution for semantic segmentation. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, 2018. 1451–1460
- 72
Zhang R, Tang S, Zhang Y, et al. Scale-adaptive convolutions for scene parsing. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 2050–2058
- 73
Yu C, Wang J, Peng C, et al. BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 334–349
- 74
Yu C, Wang J, Peng C, et al. Learning a discriminative feature network for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 1857–1866
- 75
Zhao H, Zhang Y, Liu S, et al. PSANet: point-wise spatial attention network for scene parsing. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 270–286
- 76
Zhu Z, Xu M, Bai S, et al. Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, Seoul, 2019. 593–602
Acknowledgements
This work was supported in part by the Science and Technology Innovation 2030-Major Project of Artificial Intelligence of the Ministry of Science and Technology of China (Grant No. 2018AAA01028) and in part by National Natural Science Foundation of China (Grant No. 61632018).
Author information
Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, Z., Pang, Y. CGNet: cross-guidance network for semantic segmentation. Sci. China Inf. Sci. 63, 120104 (2020). https://doi.org/10.1007/s11432-019-2718-7
Received:
Revised:
Accepted:
Published:
Keywords
- semantic segmentation
- fully convolutional networks
- pyramid network
- edge detection
- saliency detection
- cross-guidance