Multi-scale Autoencoders in Autoencoder for Semantic Image Segmentation

  • John Paul T. YusiongEmail author
  • Prospero C. NavalJr.
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11431)


Semantic image segmentation is essential for scene understanding. Several state-of-the-art deep learning-based approaches achieved remarkable results by increasing the network depth to improve performance. Using this principle, we introduce a novel encoder-decoder network architecture for semantic image segmentation of outdoor scenes called SAsiANet. SAsiANet utilizes multi-scale cascaded autoencoders at the decoder section of an autoencoder to achieve high accuracy pixel-wise prediction and involves exploiting features across multiple scales when upsampling the output of the encoder to obtain better spatial and contextual information effectively. The proposed network architecture is trained using the cross-entropy loss function but without incorporating any class balancing technique to the loss function. Our experimental results on two challenging outdoor scenes: the CamVid urban scenes dataset and the Freiburg forest dataset demonstrate that SAsiANet provides an effective way of producing accurate segmentation maps since it achieved state-of-the-art results on the test set of both datasets, \(72.40\%\) mIoU, and \(89.90\%\) mIoU, respectively.


Autoencoders in autoencoder Semantic image segmentation Stacked autoencoders 


  1. 1.
    Abadi, M., et al., M.D.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 (2016)
  2. 2.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)CrossRefGoogle Scholar
  3. 3.
    Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a high-definition ground truth database. Pattern Recogn. Lett. 30(2), 88–97–272 (2009)Google Scholar
  4. 4.
    Caesar, H., Uijlings, J., Ferrari, V.: Region-based semantic segmentation with end-to-end training. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 381–397. Springer, Cham (2016). Scholar
  5. 5.
    Chaurasia, A., Culurciello, E.: Linknet: exploiting encoder representations for efficient semantic segmentation. In: IEEE Visual Communications and Image Processing (VCIP), pp. 1–4 (2017)Google Scholar
  6. 6.
    Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)CrossRefGoogle Scholar
  7. 7.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)Google Scholar
  8. 8.
    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611 (2018)
  9. 9.
    Dunne, R.A., Campbell, N.A.: On the pairing of the softmax activation and cross-entropy penalty functions and the derivation of the softmax activation function. In: Proceedings of the 8th Australian Conference on the Neural Networks (1997)Google Scholar
  10. 10.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)Google Scholar
  11. 11.
    Fayyaz, M., Saffar, M.H., Sabokrou, M., Fathy, M., Huang, F., Klette, R.: STFCN: spatio-temporal fully convolutional neural network for semantic segmentation of street scenes. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10116, pp. 493–509. Springer, Cham (2017). Scholar
  12. 12.
    Fu, J., Liu, J., Wang, Y., Lu, H.: Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943 (2017)
  13. 13.
    Islam, M.A., Naha, S., Rochan, M., Bruce, N.D.B., Wang, Y.: Label refinement network for coarse-to-fine semantic segmentation. arXiv:1606.07415 (2017)
  14. 14.
    Islam, M.A., Rochan, M., Bruce, N.D.B., Wang, Y.: Gated feedback refinement network for dense image labeling. In: CVPR (2017)Google Scholar
  15. 15.
    Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1175–1183 (2017)Google Scholar
  16. 16.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Advances in Neural Information Processing Systems 24, pp. 109–117 (2017)Google Scholar
  17. 17.
    Kundu, A., Vineet, V., Koltun, V.: Feature space optimization for semantic video segmentation. In: CVPR, pp. 3168–3175 (2016)Google Scholar
  18. 18.
    Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. arXiv:1506.04579 (2015)
  19. 19.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)Google Scholar
  20. 20.
    Nguyen, K., Fookes, C., Sridharan, S.: Deep context modeling for semantic segmentation. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 56–63 (2017)Google Scholar
  21. 21.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV, pp. 1520–1528 (2015)Google Scholar
  22. 22.
    Oliveira, G.L., Bollen, C., Burgard, W., Brox, T.: Efficient and robust deep networks for semantic segmentation. Int. J. Robot. Res. 37, 472–491 (2017)CrossRefGoogle Scholar
  23. 23.
    Oliveira, G.L., Burgard, W., Brox, T.: Efficient deep models for monocular road segmentation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2016)Google Scholar
  24. 24.
    Oliveira, G.L., Burgard, W., Brox, T.: DPDB-Net: exploiting dense connections for convolutional encoders. In: IEEE International Conference on Robotics and Automation (ICRA) (2018)Google Scholar
  25. 25.
    Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147 (2016)
  26. 26.
    Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters-improve semantic segmentation by global convolutional network. In: CVPR (2017)Google Scholar
  27. 27.
    Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR, pp. 3309–3318 (2017)Google Scholar
  28. 28.
    Shah, S.A., Ghosh, P., Davis, L.S., Goldstein, T.: Stacked U-Nets: a no-frills approach to natural image segmentation. arXiv:1804.10343 (2018)
  29. 29.
    Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)CrossRefGoogle Scholar
  30. 30.
    Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.S.: Combining appearance and structure from motion features for road scene understanding. In: BMCV (2009)Google Scholar
  31. 31.
    Valada, A., Oliveira, G.L., Brox, T., Burgard, W.: Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In: Kulić, D., Nakamura, Y., Khatib, O., Venture, G. (eds.) ISER 2016. SPAR, vol. 1, pp. 465–477. Springer, Cham (2017). Scholar
  32. 32.
    Visin, F., et al.: ReSeg: a recurrent neural network-based model for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 426–433 (2016)Google Scholar
  33. 33.
    Wang, P., Chen, P., Yuan, Y.: Understanding convolution for semantic segmentation. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 6230–6239 (2017)Google Scholar
  34. 34.
    Wu, Y., Yang, T., Zhao, J., Guan, L., Li, J.: Fully combined convolutional network with soft cost function for traffic scene parsing. In: Huang, D.-S., Bevilacqua, V., Premaratne, P., Gupta, P. (eds.) ICIC 2017. LNCS, vol. 10361, pp. 725–731. Springer, Cham (2017). Scholar
  35. 35.
    Yan, Z., Zhang, H., Jia, Y., Breuel, T., Yu, Y.: Combining the best of convolutional layers and recurrent layers: a hybrid network for semantic segmentation. arXiv:1603.04871 (2016)
  36. 36.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)Google Scholar
  37. 37.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • John Paul T. Yusiong
    • 1
    • 2
    Email author
  • Prospero C. NavalJr.
    • 1
  1. 1.Computer Vision and Machine Intelligence Group, Department of Computer Science, College of EngineeringUniversity of the PhilippinesQuezon CityPhilippines
  2. 2.Division of Natural Sciences and MathematicsUniversity of the Philippines Visayas Tacloban CollegeTacloban CityPhilippines

Personalised recommendations