Adversarial Training for Video Disentangled Representation

  • Renjie Xie
  • Yuancheng Wang
  • Tian Xie
  • Yuhao Zhang
  • Li Xu
  • Jian Lu
  • Qiao WangEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)


The strong demand for video analytics is largely due to the widespread application of CCTV. Perfectly encoding moving objects and scenes with different sizes and complexity in an unsupervised manner is still a challenge and seriously affects the quality of video prediction and subsequent analysis. In this paper, we introduce adversarial training to improve DrNet which disentangles a video with stationary scene and moving object representations, while taking the tiny objects and complex scene into account. These representations can be used for subsequent industrial applications such as vehicle density estimation, video retrieval, etc. Our experiment on LASIESTA database confirms the validity of this method in both reconstruction and prediction performance. Meanwhile, we propose an experiment that vanishes one of the codes and reconstructs the images by concatenating these zero and non-zero codes. This experiment separately evaluates the moving object and scene coding quality and shows that the adversarial training achieves a significant reconstruction quality in visual effect, despite of complex scene and tiny object.


Disentangled representation Video reconstruction Adversarial training Video analysis 


  1. 1.
    Baraldi, L., Grana, C., Cucchiara, R.: A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1199–1202. ACM (2015)Google Scholar
  2. 2.
    Boureau, Y.l., Cun, Y.L.: Sparse feature learning for deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1185–1192 (2008)Google Scholar
  3. 3.
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGan: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2172–2180 (2016)Google Scholar
  4. 4.
    Chong, Y.S., Tay, Y.H.: Abnormal event detection in videos using spatiotemporal autoencoder. In: Cong, F., Leung, A., Wei, Q. (eds.) ISNN 2017, Part II. LNCS, vol. 10262, pp. 189–196. Springer, Cham (2017). Scholar
  5. 5.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 539–546. IEEE (2005)Google Scholar
  6. 6.
    Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)CrossRefGoogle Scholar
  7. 7.
    Cuevas, C., Yáñez, E.M., García, N.: Labeled dataset for integral evaluation of moving object detection algorithms: LASIESTA. Comput. Vis. Image Underst. 152, 103–117 (2016)CrossRefGoogle Scholar
  8. 8.
    Denton, E.L.: Unsupervised learning of disentangled representations from video. In: Advances in Neural Information Processing Systems, pp. 4414–4423 (2017)Google Scholar
  9. 9.
    Dumoulin, V., et al.: Adversarially learned inference. arXiv preprint arXiv:1606.00704 (2016)
  10. 10.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  11. 11.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Hsieh, J.T., Liu, B., Huang, D.A., Fei-Fei, L., Niebles, J.C.: Learning to Decompose and Disentangle Representations for Video Prediction. arXiv preprint arXiv:1806.04166 (2018)
  13. 13.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  14. 14.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  15. 15.
    Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300 (2015)
  16. 16.
    Liou, C.Y., Cheng, W.C., Liou, J.W., Liou, D.R.: Autoencoder for words. Neurocomputing 139, 84–96 (2014)CrossRefGoogle Scholar
  17. 17.
    Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015)
  18. 18.
    Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., LeCun, Y.: Disentangling factors of variation in deep representation using adversarial training. In: Advances in Neural Information Processing Systems, pp. 5040–5048 (2016)Google Scholar
  19. 19.
    McLaughlin, N., Martinez del Rincon, J., Miller, P.: Recurrent convolutional network for video-based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1325–1334(2016)Google Scholar
  20. 20.
    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  21. 21.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
  22. 22.
    Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546–3554 (2015)Google Scholar
  23. 23.
    Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: Explicit invariance during feature extraction. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 833–840. Omnipress (2011)Google Scholar
  24. 24.
    Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems. pp. 3856–3866 (2017)Google Scholar
  25. 25.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 32–36. IEEE (2004)Google Scholar
  26. 26.
    Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27(7), 3210–3221 (2018)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Tran, L., Yin, X., Liu, : X.: Disentangled representation learning GAN for pose-invariant face recognition. In: CVPR, vol. 3, p. 7 (2017)Google Scholar
  28. 28.
    Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017)
  29. 29.
    Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM(2008)Google Scholar
  30. 30.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems, pp. 613–621 (2016)Google Scholar
  31. 31.
    Yingzhen, L., Mandt, S.: Disentangled sequential autoencoder. In: International Conference on Machine Learning, pp. 5656–5665 (2018)Google Scholar
  32. 32.
    Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4353–4361 (2015)Google Scholar
  33. 33.
    Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 1–16. Springer, Cham (2014). Scholar
  34. 34.
    Zhao, J., Mathieu, M., Goroshin, R., LeCun, Y.: Stacked What-Where Auto-encoders. arXiv:1506.02351 [cs, stat], June 2015
  35. 35.
    Zhao, J., Mathieu, M., LeCun, Y.: Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126 (2016)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Renjie Xie
    • 1
  • Yuancheng Wang
    • 1
  • Tian Xie
    • 1
  • Yuhao Zhang
    • 1
  • Li Xu
    • 2
  • Jian Lu
    • 1
  • Qiao Wang
    • 1
    Email author
  1. 1.School of Information Science and Engineering and Shing-Tung Yau CenterSoutheast UniversityNanjingChina
  2. 2.Intel(China) CorporationShanghaiChina

Personalised recommendations