Skip to main content

STFCN: Spatio-Temporal Fully Convolutional Neural Network for Semantic Segmentation of Street Scenes

Part of the Lecture Notes in Computer Science book series (LNIP,volume 10116)

Abstract

This paper presents a novel method to involve both spatial and temporal features for semantic segmentation of street scenes. Current work on convolutional neural networks (CNNs) has shown that CNNs provide advanced spatial features supporting a very good performance of solutions for the semantic segmentation task. We investigate how involving temporal features also has a good effect on segmenting video data. We propose a module based on a long short-term memory (LSTM) architecture of a recurrent neural network for interpreting the temporal characteristics of video frames over time. Our system takes as input frames of a video and produces a correspondingly-sized output; for segmenting the video our method combines the use of three components: First, the regional spatial features of frames are extracted using a CNN; then, using LSTM the temporal features are added; finally, by deconvolving the spatio-temporal features we produce pixel-wise predictions. Our key insight is to build spatio-temporal convolutional networks (spatio-temporal CNNs) that have an end-to-end architecture for semantic video segmentation. We adapted fully some known convolutional network architectures (such as FCN-AlexNet and FCN-VGG16), and dilated convolution into our spatio-temporal CNNs. Our spatio-temporal CNNs achieve state-of-the-art semantic segmentation, as demonstrated for the Camvid and NYUDv2 datasets.

Keywords

  • Video Frame
  • Convolutional Neural Network
  • Context Module
  • Advanced Driver Assistance System
  • Convolutional Layer

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-54407-6_33
  • Chapter length: 17 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-54407-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

Notes

  1. 1.

    Available at https://github.com/junhyukoh/caffe-lstm.

  2. 2.

    Our modified Caffe distribution and STFCN models are publicly available at https://github.com/MohsenFayyaz89/STFCN.

  3. 3.

    Available at mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/.

  4. 4.

    Available at https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html.

References

  1. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)

  2. Bittel, S., Kaiser, V., Teichmann, M., Thoma, M.: Pixel-wise segmentation of street with neural networks. arXiv preprint arXiv:1511.00513 (2015)

  3. Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88682-2_5

    CrossRef  Google Scholar 

  4. Carreira, J., Sminchisescu, C.: Constrained parametric min-cuts for automatic object segmentation. In: CVPR, pp. 3241–3248 (2010)

    Google Scholar 

  5. Chang, F.J., Lin, Y.Y., Hsu, K.J.: Multiple structured-instance learning for semantic segmentation with uncertain training data. In: CVPR, pp. 360–367 (2014)

    Google Scholar 

  6. Chen, A.Y., Corso, J.J.: Propagating multi-class pixel labels throughout video frames. In: Image Processing Workshop (WNYIPW), pp. 14–17 (2010)

    Google Scholar 

  7. Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. arXiv preprint arXiv:1511.03339 (2015)

  8. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)

    Google Scholar 

  9. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2011 (2011). www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html

  10. Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)

  11. Galasso, F., Keuper, M., Brox, T., Schiele, B.: Spectral graph reduction for efficient image and streaming video segmentation. In: CVPR, pp. 49–56 (2014)

    Google Scholar 

  12. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. In: Neural computation, pp. 2451–2471 (2000)

    Google Scholar 

  13. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)

    Google Scholar 

  14. Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: CVPR, pp. 564–571 (2013)

    Google Scholar 

  15. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). doi:10.1007/978-3-319-10584-0_23

    Google Scholar 

  16. He, Y., Chiu, W.C., Keuper, M., Fritz, M.: RGBD semantic segmentation using spatio-temporal data-driven pooling. arXiv preprint arXiv:1604.02388 (2016)

  17. Hickson, S., Birchfield, S., Essa, I., Christensen, H.: Efficient hierarchical graph-based segmentation of RGBD videos. In: CVPR, pp. 344–351 (2014)

    Google Scholar 

  18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 12, 1735–1780 (1997)

    CrossRef  Google Scholar 

  19. Hong, S., Noh, H., Han, B.: Decoupled deep neural network for semi-supervised semantic segmentation. In: NIPS, pp. 1495–1503 (2015)

    Google Scholar 

  20. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference Multimedia, pp. 675–678 (2014)

    Google Scholar 

  21. Khoreva, A., Galasso, F., Hein, M., Schiele, B.: Classifier based graph construction for video segmentation. In: CVPR, pp. 951–960 (2015)

    Google Scholar 

  22. Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004)

    MATH  Google Scholar 

  23. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  24. Kundu, A., Vineet, V., Koltun, V.: Feature space optimization for semantic video segmentation. In: CVPR (2016)

    Google Scholar 

  25. Russell, C., Kohli, P., Torr, P.H.: Associative hierarchical CRFs for object class image segmentation. In: ICCV, pp. 739–746 (2009)

    Google Scholar 

  26. Liu, B., He, X.: Multiclass semantic video segmentation with object-level active inference. In: CVPR, pp. 4286–4294 (2015)

    Google Scholar 

  27. Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., Bu, J.: Weakly supervised multiclass video segmentation. In: CVPR, pp. 57–64 (2014)

    Google Scholar 

  28. Liu, Y., Liu, J., Li, Z., Tang, J., Lu, H.: Weakly-supervised dual clustering for image semantic segmentation. In: CVPR, pp. 2075–2082 (2013)

    Google Scholar 

  29. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)

    Google Scholar 

  30. Martinovic, A., Knopp, J., Riemenschneider, H., Van Gool, L.: 3D all the way: semantic segmentation of urban scenes from start to end in 3D. In: CVPR, pp. 4456–4465 (2015)

    Google Scholar 

  31. Matan, O., Burges, C.J., LeCun, Y., Denker, J.S.: Multi-digit recognition using a space displacement neural network. In: NIPS, pp. 488–495 (1991)

    Google Scholar 

  32. Mottaghi, R., Fidler, S., Yao, J., Urtasun, R., Parikh, D.: Analyzing semantic segmentation using hybrid human-machine CRFs. In: CVPR, pp. 3143–3150 (2013)

    Google Scholar 

  33. Richmond, D.L., Kainmueller, D., Yang, M.Y., Myers, E.W., Rother, C.: Relating cascaded random forests to deep convolutional neural networks for semantic segmentation. arXiv preprint arXiv:1507.07583 (2015)

  34. Sabokrou, M., Fathy, M., Hoseini, M., Klette, R.: Real-time anomaly detection and localization in crowded scenes. In: CVPR, Workshops, pp. 56–62 (2015)

    Google Scholar 

  35. Sabokrou, M., Fathy, M., Hoseini, M.: Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. Electron. Lett. 52, 1122–1124 (2016)

    CrossRef  Google Scholar 

  36. Sharma, A., Tuzel, O., Jacobs, D.W.: Deep hierarchical parsing for semantic segmentation. In: CVPR, pp. 530–538 (2015)

    Google Scholar 

  37. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33715-4_54

    CrossRef  Google Scholar 

  38. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances Neural Information Processing Systems, pp. 68–576 (2014)

    Google Scholar 

  39. Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.: Combining appearance and structure from motion features for road scene understanding. In: BMVC (2012)

    Google Scholar 

  40. Tighe, J., Lazebnik, S.: SuperParsing: scalable nonparametric image parsing with superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 352–365. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15555-0_26

    CrossRef  Google Scholar 

  41. Volpi, M., Ferrari, V.: Semantic segmentation of urban scenes by learning local class interactions. In: CVPR, pp. 1–9 (2015)

    Google Scholar 

  42. Wolf, R., Platt, J.C.: Postal address block location using a convolutional locator network. In: NIPS, pp. 745–745 (1994)

    Google Scholar 

  43. Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.C.: Layered object models for image segmentation. IEEE Trans. PAMI 34, 1731–1743 (2012)

    CrossRef  Google Scholar 

  44. Zheng, C., Wang, L.: Semantic segmentation of remote sensing imagery using object-based Markov random field model with regional penalties. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 8, 1924–1935 (2015)

    CrossRef  Google Scholar 

  45. Zhang, L., Song, M., Liu, Z., Liu, X., Bu, J., Chen, C.: Probabilistic graphlet cut: exploiting spatial structure cue for weakly supervised image segmentation. In: CVPR, pp. 1908–1915 (2013)

    Google Scholar 

  46. Zhang, Y., Chen, X., Li, J., Wang, C., Xia, C.: Semantic object segmentation via detection in weakly labeled video. In: CVPR, pp. 3641–3649 (2015)

    Google Scholar 

  47. Zhu, Y., Urtasun, R., Salakhutdinov, R., Fidler, S.: Segdeepm: exploiting segmentation and context in deep neural networks for object detection. In: CVPR, pp. 4703–4711 (2015)

    Google Scholar 

  48. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohsen Fayyaz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Fayyaz, M., Saffar, M.H., Sabokrou, M., Fathy, M., Huang, F., Klette, R. (2017). STFCN: Spatio-Temporal Fully Convolutional Neural Network for Semantic Segmentation of Street Scenes. In: Chen, CS., Lu, J., Ma, KK. (eds) Computer Vision – ACCV 2016 Workshops. ACCV 2016. Lecture Notes in Computer Science(), vol 10116. Springer, Cham. https://doi.org/10.1007/978-3-319-54407-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54407-6_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54406-9

  • Online ISBN: 978-3-319-54407-6

  • eBook Packages: Computer ScienceComputer Science (R0)