Skip to main content
Log in

Simultaneous context and motion learning in video prediction

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Video prediction aims to generate future frames from the past several given frames. It has many applications for abnormal action recognition, future traffic prediction, long-term planning and autonomous driving. Recently, various deep learning-based methods have been proposed to address this task. However, these methods seem only to focus on increasing the network performance and ignore the computational cost problem of them. Even, several methods require two separate networks to perform with two different input types such as RGB, temporal gradient and optical flow. This makes them more and more complex and requires a extremely huge computational cost and memory space. In this paper, we introduce a simple yet robust approach to learn simultaneous both appearance and motion features in only a network regardless diversity of input video modalities. Moreover, we also present a lightweight autoencoder network for addressing this issue. Our framework is conducted on various benchmarks such as KTH, KITTI and BAIR datasets. The experimental results have shown that our approach achieves competitive performance compared to state-of-the-art video prediction methods with only 34.24MB of memory space and 2.59GFLOPs. With a smaller model size and less computational cost, our framework can run faster with a small inference time compared to the other methods. Besides, it only with 2.934 s to predict the next frame, our framework is a promising approach to deploy on embedded or mobile devices without GPU in real time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

All datasets used in this work are public available in the Internet.

References

  1. Duc, Q.V.: Self-knowledge distillation: an efficient approach for falling detection. In: ICABDE, pp. 369–380. Springer (2022)

  2. Xu, H., Liu, W., Xing, W., Wei, X.: Motion-aware future frame prediction for video anomaly detection based on saliency perception. SIViP 16(8), 2121–2129 (2022)

    Article  Google Scholar 

  3. Vu, D.Q., Thu, T.P.T., Le, N., Wang, J.C., et al.: Deep learning for human action recognition: a comprehensive review. APSIPA Transactions on signal and information processing 12(2)

  4. Bhattacharyya, A., Fritz, M., Schiele, B.: Long-term on-board prediction of people in traffic scenes under uncertainty. In: CVPR, pp. 4194–4202 (2018)

  5. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. NIPS 29 (2016)

  6. Lee, D.-H., Liu, J.-L.: End-to-end deep learning of lane detection and path prediction for real-time autonomous driving. SIViP 17(1), 199–205 (2023)

    Article  Google Scholar 

  7. Akbulut, O., Konyar, M.Z.: Improved intra-subpartition coding mode for versatile video coding. SIViP 16(5), 1363–1368 (2022)

    Article  Google Scholar 

  8. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML, pp. 6105–6114 (2019). PMLR

  9. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)

  10. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-yolov4: Scaling cross stage partial network. In: CVPR, pp. 13029–13038 (2021)

  11. Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion. In: CVPR, pp. 5559–5568 (2021)

  12. Vu, D.Q., Wang, J.C., : A novel self-knowledge distillation approach with siamese representation learning for action recognition. In: VCIP, pp. 1–5 . IEEE (2021)

  13. Vu, D.Q., Le, N.T., Wang, J.C.: (2+ 1) d distilled shufflenet: a lightweight unsupervised distillation network for human action recognition. In: ICPR, pp. 3197–3203 . IEEE (2022)

  14. Gao, Z., Tan, C., Wu, L., Li, S.Z.: Simvp: Simpler yet better video prediction. In: CVPR, pp. 3170–3180 (2022)

  15. Wang, Y., Gao, Z., Long, M., Wang, J., Philip, S.Y.: Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In: ICML, pp. 5123–5132. PMLR (2018)

  16. Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: ICML, pp. 1174–1183. PMLR (2018)

  17. Wu, B., Nair, S., Martin-Martin, R., Fei-Fei, L., Finn, C.: Greedy hierarchical variational autoencoders for large-scale video prediction. In: CVPR, pp. 2318–2328 (2021)

  18. Akan, A.K., Erdem, E., Erdem, A., Güney, F.: Slamp: Stochastic latent appearance and motion prediction. In: ICCV, pp. 14728–14737 (2021)

  19. Phung, T., Nguyen, V.T., Ma, T.H.T., Duc, Q.V.: A (2+ 1) d attention convolutional neural network for video prediction. In: ICABDE, pp. 395–406. Springer (2022)

  20. Yuan, P., Guan, Y., Huang, J.: Video prediction based on spatial information transfer and time backtracking. SIViP 16(3), 825–833 (2022)

    Article  Google Scholar 

  21. Oliu, M., Selva, J., Escalera, S.: Folded recurrent neural networks for future video prediction. In: Proceedings of the European conference on computer vision (ECCV), pp. 716–731 (2018)

  22. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. ICLR (2017)

  23. Wu, Y., Wen, Q., Chen, Q.: Optimizing video prediction via video frame interpolation. In: CVPR, pp. 17814–17823 (2022)

  24. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)

  25. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)

  26. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: ICPR, vol. 3, pp. 32–36 . IEEE (2004)

  27. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR, pp. 3354–3361. IEEE (2012)

  28. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)

    Article  Google Scholar 

  29. Ebert, F., Finn, C., Lee, A.X., Levine, S.: Self-supervised visual planning with temporal skip connections. CoRL 12, 16 (2017)

    Google Scholar 

  30. Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)

  31. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. NIPS 28 (2015)

  32. Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)

  33. Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. NIPS 30 (2017)

  34. Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)

  35. Jin, B., Hu, Y., Zeng, Y., Tang, Q., Liu, S., Ye, J.: Varnet: Exploring variations for unsupervised video prediction. In: IROS, pp. 5801–5806 (2018). IEEE

  36. Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3d lstm: a model for video prediction and beyond. In: ICLR (2019)

  37. Franceschi, J.Y., Delasalles, E., Chen, M., Lamprier, S., Gallinari, P.: Stochastic latent residual video prediction. In: ICML, pp. 3233–3246 (2020). PMLR

  38. Lee, S., Kim, H.G., Choi, D.H., Kim, H.I., Ro, Y.M.: Video prediction recalling long-term motion context via memory alignment learning. In: CVPR, pp. 3054–3063 (2021)

  39. Ye, X., Bilodeau, G.-A.: Video prediction by efficient transformers. Image Vis. Comput. 130, 104612 (2023)

    Article  Google Scholar 

  40. Yu, W., Lu, Y., Easterbrook, S., Fidler, S.: Efficient and information-preserving future frame prediction and beyond. In: ICLR (2020)

  41. Guen, V.L., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: CVPR, pp. 11474–11484 (2020)

Download references

Funding

This work is supported in part by the Thai Nguyen University of Education under Grant TNUE-2022-03.

Author information

Authors and Affiliations

Authors

Contributions

(1) Quang proposed this idea. (2) All authors proceed to code, experiment and compare the results of the proposed approach with state-of-the-art methods. (3) All authors wrote the draft of the manuscript and prepared tables. (4) Quang prepared all figures. (5) All authors revised and proofread the manuscript. (6) All authors reviewed the manuscript. (7) Trang submitted the manuscript.

Corresponding author

Correspondence to Trang Phung T. Thu.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vu, DQ., Thu, T.P.T. Simultaneous context and motion learning in video prediction. SIViP 17, 3933–3942 (2023). https://doi.org/10.1007/s11760-023-02623-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02623-x

Keywords

Navigation