Adversarial Training for Video Disentangled Representation

Xie, Renjie; Wang, Yuancheng; Xie, Tian; Zhang, Yuhao; Xu, Li; Lu, Jian; Wang, Qiao

doi:10.1007/978-3-030-05716-9_43

Renjie Xie¹⁹,
Yuancheng Wang¹⁹,
Tian Xie¹⁹,
Yuhao Zhang¹⁹,
Li Xu²⁰,
Jian Lu¹⁹ &
…
Qiao Wang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11296))

Included in the following conference series:

International Conference on Multimedia Modeling

2309 Accesses
1 Citations

Abstract

The strong demand for video analytics is largely due to the widespread application of CCTV. Perfectly encoding moving objects and scenes with different sizes and complexity in an unsupervised manner is still a challenge and seriously affects the quality of video prediction and subsequent analysis. In this paper, we introduce adversarial training to improve DrNet which disentangles a video with stationary scene and moving object representations, while taking the tiny objects and complex scene into account. These representations can be used for subsequent industrial applications such as vehicle density estimation, video retrieval, etc. Our experiment on LASIESTA database confirms the validity of this method in both reconstruction and prediction performance. Meanwhile, we propose an experiment that vanishes one of the codes and reconstructs the images by concatenating these zero and non-zero codes. This experiment separately evaluates the moving object and scene coding quality and shows that the adversarial training achieves a significant reconstruction quality in visual effect, despite of complex scene and tiny object.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baraldi, L., Grana, C., Cucchiara, R.: A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1199–1202. ACM (2015)
Google Scholar
Boureau, Y.l., Cun, Y.L.: Sparse feature learning for deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1185–1192 (2008)
Google Scholar
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGan: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2172–2180 (2016)
Google Scholar
Chong, Y.S., Tay, Y.H.: Abnormal event detection in videos using spatiotemporal autoencoder. In: Cong, F., Leung, A., Wei, Q. (eds.) ISNN 2017, Part II. LNCS, vol. 10262, pp. 189–196. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59081-3_23
Chapter Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 539–546. IEEE (2005)
Google Scholar
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Article Google Scholar
Cuevas, C., Yáñez, E.M., García, N.: Labeled dataset for integral evaluation of moving object detection algorithms: LASIESTA. Comput. Vis. Image Underst. 152, 103–117 (2016)
Article Google Scholar
Denton, E.L.: Unsupervised learning of disentangled representations from video. In: Advances in Neural Information Processing Systems, pp. 4414–4423 (2017)
Google Scholar
Dumoulin, V., et al.: Adversarially learned inference. arXiv preprint arXiv:1606.00704 (2016)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Hsieh, J.T., Liu, B., Huang, D.A., Fei-Fei, L., Niebles, J.C.: Learning to Decompose and Disentangle Representations for Video Prediction. arXiv preprint arXiv:1806.04166 (2018)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300 (2015)
Liou, C.Y., Cheng, W.C., Liou, J.W., Liou, D.R.: Autoencoder for words. Neurocomputing 139, 84–96 (2014)
Article Google Scholar
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015)
Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., LeCun, Y.: Disentangling factors of variation in deep representation using adversarial training. In: Advances in Neural Information Processing Systems, pp. 5040–5048 (2016)
Google Scholar
McLaughlin, N., Martinez del Rincon, J., Miller, P.: Recurrent convolutional network for video-based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1325–1334(2016)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546–3554 (2015)
Google Scholar
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: Explicit invariance during feature extraction. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 833–840. Omnipress (2011)
Google Scholar
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems. pp. 3856–3866 (2017)
Google Scholar
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 32–36. IEEE (2004)
Google Scholar
Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27(7), 3210–3221 (2018)
Article MathSciNet Google Scholar
Tran, L., Yin, X., Liu, : X.: Disentangled representation learning GAN for pose-invariant face recognition. In: CVPR, vol. 3, p. 7 (2017)
Google Scholar
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM(2008)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems, pp. 613–621 (2016)
Google Scholar
Yingzhen, L., Mandt, S.: Disentangled sequential autoencoder. In: International Conference on Machine Learning, pp. 5656–5665 (2018)
Google Scholar
Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4353–4361 (2015)
Google Scholar
Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 1–16. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_1
Chapter Google Scholar
Zhao, J., Mathieu, M., Goroshin, R., LeCun, Y.: Stacked What-Where Auto-encoders. arXiv:1506.02351 [cs, stat], June 2015
Zhao, J., Mathieu, M., LeCun, Y.: Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126 (2016)

Download references

Author information

Authors and Affiliations

School of Information Science and Engineering and Shing-Tung Yau Center, Southeast University, Nanjing, 210096, China
Renjie Xie, Yuancheng Wang, Tian Xie, Yuhao Zhang, Jian Lu & Qiao Wang
Intel(China) Corporation, Shanghai, 200241, China
Li Xu

Authors

Renjie Xie
View author publications
You can also search for this author in PubMed Google Scholar
Yuancheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tian Xie
View author publications
You can also search for this author in PubMed Google Scholar
Yuhao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Li Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Lu
View author publications
You can also search for this author in PubMed Google Scholar
Qiao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiao Wang .

Editor information

Editors and Affiliations

Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Ioannis Kompatsiaris
EURECOM, Sophia Antipolis, France
Benoit Huet
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Vasileios Mezaris
Dublin City University, Dublin, Ireland
Cathal Gurrin
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, R. et al. (2019). Adversarial Training for Video Disentangled Representation. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11296. Springer, Cham. https://doi.org/10.1007/978-3-030-05716-9_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-05716-9_43
Published: 11 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05715-2
Online ISBN: 978-3-030-05716-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics