Learning Background Subtraction by Video Synthesis and Multi-scale Recurrent Networks

  • Sungkwon Choo
  • Wonkyo Seo
  • Dong-ju Jeong
  • Nam Ik ChoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11366)


This paper addresses the moving objects segmentation in videos, i.e. Background Subtraction (BGS) using a deep network. The proposed structure learns temporal associativity without losing spatial information by using the convolutional Long Short-Term Memory (LSTM). It learns the spatial relation by forming various-size spatial receptive fields through the various scale recurrent networks. The most serious problem in training the proposed network is that it is very difficult to find or make a sufficient number of pixel-level labeled video datasets. In order to overcome this limitation, we generate many training frames by combining the annotated foreground objects from some available datasets with the background of the target video. The contribution of this paper is to provide the first multi-scale recurrent networks for the BGS, which works well for many kinds of surveillance videos and provides the best performance in CDnet 2014 which is widely used for the BGS testing.


Background subtraction Convolutional LSTM Video augmentation 



This research was supported in part by Projects for Research and Development of Police science and Technology under Center for Research and Development of Police science and Technology and Korean National Police Agency (PA-C000001), and in part by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 1711075689, Decentralised cloud technologies for edge/IoT integration in support of AI applications).


  1. 1.
    Babaee, M., Dinh, D.T., Rigoll, G.: A deep convolutional neural network for video sequence background subtraction. Pattern Recogn. 76, 635–649 (2018)CrossRefGoogle Scholar
  2. 2.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2017)CrossRefGoogle Scholar
  3. 3.
    Bianco, S., Ciocca, G., Schettini, R.: How far can you get by combining change detection algorithms? In: Battiato, S., Gallo, G., Schettini, R., Stanco, F. (eds.) ICIAP 2017. LNCS, vol. 10484, pp. 96–107. Springer, Cham (2017). Scholar
  4. 4.
    Braham, M., Piérard, S., Van Droogenbroeck, M.: Semantic background subtraction. In: IEEE International Conference on Image Processing (ICIP), Beijing, China, pp. 4552–4556, September 2017Google Scholar
  5. 5.
    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611 (2018)
  6. 6.
    Du, Y., Yuan, C., Hu, W., Maybank, S.: Spatio-temporal self-organizing map deep network for dynamic object detection from videos. In: IEEE Conference on Computer Vison and Pattern Recognition 2017. IEEE Computer Society (2017)Google Scholar
  7. 7.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  8. 8.
    Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for multiple object tracking. arXiv preprint arXiv:1703.09554 (2017)
  9. 9.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  10. 10.
    Kong, S., Fowlkes, C.C.: Recurrent scene parsing with perspective understanding in the loop. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  11. 11.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. In: Advances in Neural Information Processing Systems, pp. 109–117 (2011)Google Scholar
  12. 12.
    Li, X., et al.: Video object segmentation with re-identification. In: The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017)Google Scholar
  13. 13.
    Lim, K., Jang, W.D., Kim, C.S.: Background subtraction using encoder-decoder structured convolutional neural network. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE (2017)Google Scholar
  14. 14.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  15. 15.
    Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104 (2016)
  16. 16.
    Mao, X., Shen, C., Yang, Y.B.: Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In: Advances in Neural Information Processing Systems, pp. 2802–2810 (2016)Google Scholar
  17. 17.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. arXiv preprint arXiv:1505.04366 (2015)
  18. 18.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Computer Vision and Pattern Recognition (2016)Google Scholar
  19. 19.
    Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv:1704.00675 (2017)
  20. 20.
    Qiu, Z., Yao, T., Mei, T.: Learning deep spatio-temporal dependency for semantic video segmentation. IEEE Trans. Multimed. PP(99), 1 (2017)Google Scholar
  21. 21.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  22. 22.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  23. 23.
    Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)Google Scholar
  24. 24.
    Valipour, S., Siam, M., Jagersand, M., Ray, N.: Recurrent fully convolutional networks for video segmentation. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 29–36. IEEE (2017)Google Scholar
  25. 25.
    Wang, Y., Jodoin, P.M., Porikli, F., Konrad, J., Benezeth, Y., Ishwar, P.: CDnet 2014: an expanded change detection benchmark dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 387–394 (2014)Google Scholar
  26. 26.
    Wang, Y., Luo, Z., Jodoin, P.M.: Interactive deep learning method for segmenting moving objects. Pattern Recogn. Lett. 96(Suppl. C), 66–75 (2017)CrossRefGoogle Scholar
  27. 27.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  28. 28.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20K dataset. arXiv preprint arXiv:1608.05442 (2016)
  29. 29.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Sungkwon Choo
    • 1
  • Wonkyo Seo
    • 1
  • Dong-ju Jeong
    • 1
  • Nam Ik Cho
    • 1
    Email author
  1. 1.Seoul National UniversitySeoulKorea

Personalised recommendations