Skip to main content
Log in

Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Human behavior understanding in videos is a complex, still unsolved problem and requires to accurately model motion at both the local (pixel-wise dense prediction) and global (aggregation of motion cues) levels. Current approaches based on supervised learning require large amounts of annotated data, whose scarce availability is one of the main limiting factors to the development of general solutions. Unsupervised learning can instead leverage the vast amount of videos available on the web and it is a promising solution for overcoming the existing limitations. In this paper, we propose an adversarial GAN-based framework that learns video representations and dynamics through a self-supervision mechanism in order to perform dense and global prediction in videos. Our approach synthesizes videos by (1) factorizing the process into the generation of static visual content and motion, (2) learning a suitable representation of a motion latent space in order to enforce spatio-temporal coherency of object trajectories, and (3) incorporating motion estimation and pixel-wise dense prediction into the training procedure. Self-supervision is enforced by using motion masks produced by the generator, as a co-product of its generation process, to supervise the discriminator network in performing dense prediction. Performance evaluation, carried out on standard benchmarks, shows that our approach is able to learn, in an unsupervised way, both local and global video dynamics. The learned representations, then, support the training of video object segmentation methods with sensibly less (about 50%) annotations, giving performance comparable to the state of the art. Furthermore, the proposed method achieves promising performance in generating realistic videos, outperforming state-of-the-art approaches especially on motion-related metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Note the distinction between unsupervised learning (the class of approaches that train machine learning models without annotations) and what is known in the video segmentation literature as unsupervised segmentation (the class of methods that perform segmentation in inference without additional input on object location other than the video frames).

References

  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In ICML.

  • Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., & Krishnan, D. (2017). Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR.

  • Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In ECCV.

  • Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., & Van Gool, L. (2017). One-shot video object segmentation. In CVPR.

  • Denton, E. L., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models using a Laplacian pyramid of adversarial networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), NIPS.

  • Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. IEEE.

  • Faktor, A., & Irani, M. (2014). Video segmentation by non-local consensus voting. In BMVC.

  • Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Proceedings of the 13th Scandinavian conference on image analysis (SCIA’03) (pp. 363–370). Berlin: Springer.

  • Fragkiadaki, K., Zhang, G., & Shi, J. (2012). Video segmentation by tracing discontinuities in a trajectory embedding. In CVPR.

  • Giordano, D., Murabito, F., Palazzo, S., & Spampinato, C. (2015). Superpixel-based video object segmentation using perceptual organization and location prior. In CVPR.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. In NIPS.

  • Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.

    Article  Google Scholar 

  • Haller, E., & Leordeanu, M. (2017). Unsupervised object segmentation in video by efficient selection of highly probable positive features. In ICCV.

  • Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNS retrace the history of 2D CNNS and imagenet? In CVPR.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS.

  • Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., & Belongie, S. (2017). Stacked generative adversarial networks. In CVPR.

  • Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

  • Jain, S. D., Xiong, B., & Grauman, K. (2017). Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In CVPR.

  • Jang, Y., Kim, G., & Song, Y. (2018). Video prediction with appearance and motion conditions. In ICML.

  • Keuper, M., Andres, B., & Brox, T. (2015). Motion trajectory segmentation via minimum cost multicuts. In ICCV.

  • Koh, Y. J., & Kim, C. (2017). Primary object segmentation in videos based on region augmentation and reduction. In CVPR.

  • Lai, W. S., Huang, J. B., & Yang, M. H. (2017). Semi-supervised learning for optical flow with generative adversarial networks. In NIPS.

  • Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  • Mahasseni, B., Lam, M., & Todorovic, S. (2017). Unsupervised video summarization with adversarial LSTM networks. In CVPR.

  • Maninis, K. K., Caelles, S., Chen, Y., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., et al. (2018). Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6), 1515–1530.

    Article  Google Scholar 

  • Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., & Paul Smolley, S. (2017). Least squares generative adversarial networks. In ICCV.

  • Odena, A. (2016). Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583.

  • Ohnishi, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Hierarchical video generation from orthogonal information: Optical flow and texture. In AAAI.

  • Papazoglou, A., & Ferrari, V. (2013). Fast object segmentation in unconstrained video. In ICCV.

  • Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., & Sorkine-Hornung, A. (2017). Learning video object segmentation from static images. In CVPR.

  • Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L. V., Gross, M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.

  • Prest, A., Leistner, C., Civera, J., Schmid, C., & Ferrari, V. (2012). Learning object class detectors from weakly annotated video. In CVPR (pp. 3282–3289).

  • Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.

  • Radosavovic, I., Dollár, P., Girshick, R. B., Gkioxari, G., & He, K. (2018). Data distillation: Towards omni-supervised learning. In CVPR.

  • Roth, K., Lucchi, A., Nowozin, S., & Hofmann, T. (2017). Stabilizing training of generative adversarial networks through regularization. In NIPS.

  • Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.

  • Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., et al. (2016). Improved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, R. Garnett (Eds.), NIPS.

  • Shoemake, K. (1985). Animating rotation with quaternion curves. In SIGGRAPH.

  • Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.

  • Souly, N., Spampinato, C., & Shah, M. (2017). Semi supervised semantic segmentation using generative adversarial network. In ICCV.

  • Stretcu, O., & Leordeanu, M. (2015). Multiple frames matching for object discovery in video. In BMVC.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In CVPR.

  • Tokmakov, P., Alahari, K., & Schmid, C. (2017). Learning motion patterns in videos. In CVPR.

  • Tsai, D., Flagg, M., Nakazawa, A., & Rehg, J. M. (2012). Motion coherent tracking using multi-label MRF optimization. International Journal of Computer Vision, 100(2), 190–202.

    Article  MathSciNet  Google Scholar 

  • Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In CVPR.

  • Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In CVPR.

  • Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017). Decomposing motion and content for natural video sequence prediction. In ICLR.

  • Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS.

  • Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In ECCV.

  • Vondrick, C., & Torralba, A. (2017). Generating the future with adversarial transformers. In CVPR.

  • Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., et al. (2018). Video-to-video synthesis. In NeurIPS.

  • Wang, W., Shen, J., & Porikli, F. (2015). Saliency-aware geodesic video object segmentation. In CVPR.

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR.

  • Yi, Z., Zhang, H., Tan, P., & Gong, M. (2017). Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV.

  • Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.

  • Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Palazzo.

Additional information

Communicated by Xavier Alameda-Pineda, Elisa Ricci, Albert Ali Salah, Nicu Sebe, Shuicheng Yan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Spampinato, C., Palazzo, S., D’Oro, P. et al. Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos. Int J Comput Vis 128, 1378–1397 (2020). https://doi.org/10.1007/s11263-019-01246-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01246-5

Keywords

Navigation