GAN and DCN Based Multi-step Supervised Learning for Image Semantic Segmentation

  • Jie FangEmail author
  • Xiaoqian Cao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11257)


Image semantic segmentation contains two sub-tasks, segmenting and labeling. However, the recent fully convolutional network (FCN) based methods often ignore the first sub-task and consider it as a direct labeling one. Even though these methods have achieved competitive performances, they obtained spatially fragmented and disconnected outputs. The reason is that, pixel-level relationships inside the deepest layers become inconsistent since traditional FCNs do not have any explicit pixel grouping mechanism. To address this problem, a multi-step supervised learning method, which contains image-level supervised learning step and pixel-level supervised learning step, is proposed. Specifically, as for the visualized result of image semantic segmentation, it is actually an image-to-image transformation problem, from RGB domain to category label domain. The recent conditional generative adversarial network (cGAN) has achieved significant performance for image-to-image generation task, and the generated image remains good regional connectivity. Therefore, a cGAN supervised by RGB-category label map is used to obtain a coarse segmentation mask, which avoids generating disconnected segmentation results to a certain extent. Furthermore, an interaction information (II) loss term is proposed for cGAN to remain the spatial structure of the segmentation mask. Additionally, dilated convolutional networks (DCNs) have achieved significant performance in object detection field, especially for small objects because of its special receptive field settings. Specific to image semantic segmentation, if each pixel is seen as an object, this task can be transformed to object detection. In this case, combined with the segmentation mask from cGAN, a DCN supervised by the pixel-level label is used to finalize the category recognition of each pixel in the image. The proposed method achieves satisfactory performances on three public and challenging datasets for image semantic segmentation.


cGAN DCN Image semantic segmentation Multi-step supervised learning 


  1. 1.
    Dai, J., He, K., Sun, J.: BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation, pp. 1635–1643 (2015)Google Scholar
  2. 2.
    Donahue, J., et al.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: International Conference on Machine Learning, p. I-647 (2014)Google Scholar
  3. 3.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network, pp. 2366–2374 (2014)Google Scholar
  4. 4.
    Goodfellow, I.J., et al.: Generative adversarial nets. In: International Conference on Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  5. 5.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). Scholar
  6. 6.
    Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: International Conference on Computer Vision, pp. 991–998 (2011)Google Scholar
  7. 7.
    He, Y., Chiu, W.C., Keuper, M., Fritz, M.: STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling, pp. 7158–7167 (2016)Google Scholar
  8. 8.
    Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P.: A real-time algorithm for signal analysis with the help of the wavelet transform. In: Combes, J.M., Grossmann, A., Tchamitchian, P. (eds.) Wavelets. IPTI, pp. 286–297. Springer, Heidelberg (1990). Scholar
  9. 9.
    Hong, S., You, T., Kwak, S., Han, B.: Online tracking by learning discriminative saliency map with convolutional neural network, pp. 597–606 (2015)Google Scholar
  10. 10.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5967–5976 (2017)Google Scholar
  11. 11.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221 (2013)CrossRefGoogle Scholar
  12. 12.
    Kemker, R., Salvaggio, C., Kanan, C.: High-resolution multispectral dataset for semantic segmentation (2017)Google Scholar
  13. 13.
    Li, J., Wu, Y., Zhao, J., Guan, L., Ye, C., Yang, T.: Pedestrian detection with dilated convolution, region proposal network and boosted decision trees. In: International Joint Conference on Neural Networks, pp. 4052–4057 (2017)Google Scholar
  14. 14.
    Lin, G., Shen, C., Van Den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation, pp. 3194–3203 (2015)Google Scholar
  15. 15.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  16. 16.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: IEEE International Conference on Computer Vision, pp. 1520–1528 (2015)Google Scholar
  17. 17.
    Shensa, M.J.: The discrete wavelet transform: wedding the a trous and Mallat algorithms. IEEE Trans. Sig. Process. 40(10), 2464–2482 (1992)CrossRefGoogle Scholar
  18. 18.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). Scholar
  19. 19.
    Tighe, J., Lazebnik, S.: SuperParsing: scalable nonparametric image parsing with superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 352–365. Springer, Heidelberg (2010). Scholar
  20. 20.
    Wang, P., et al.: Understanding convolution for semantic segmentation (2017)Google Scholar
  21. 21.
    Yasrab, R.: DCSeg: decoupled CNN for classification and semantic segmentation. In: IEEE Sponsored International Conference on Knowledge and Smart Technologies (2017)Google Scholar
  22. 22.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions (2015)Google Scholar
  23. 23.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks, pp. 1529–1537 (2015)Google Scholar
  24. 24.
    Zhou, H., Zhang, J., Lei, J., Li, S., Tu, D.: Image semantic segmentation based on FCN-CRF model. In: International Conference on Image, Vision and Computing, pp. 9–14 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Center for OPTical IMagery Analysis and Learning (OPTIMAL), Xi’an Institute of Optics and Precision MechanicsChinese Academy of SciencesXi’anPeople’s Republic of China
  2. 2.University of Chinese Academy of SciencesBeijingPeople’s Republic of China
  3. 3.College of Electrical and Information EngineeringShaanxi University of Science and TechnologyXi’anPeople’s Republic of China

Personalised recommendations