Simultaneous context and motion learning in video prediction

Vu, Duc-Quang; Thu, Trang Phung T.

doi:10.1007/s11760-023-02623-x

Simultaneous context and motion learning in video prediction

Original Paper
Published: 08 June 2023

Volume 17, pages 3933–3942, (2023)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Duc-Quang Vu^1,2 &
Trang Phung T. Thu³

259 Accesses
2 Citations
Explore all metrics

Abstract

Video prediction aims to generate future frames from the past several given frames. It has many applications for abnormal action recognition, future traffic prediction, long-term planning and autonomous driving. Recently, various deep learning-based methods have been proposed to address this task. However, these methods seem only to focus on increasing the network performance and ignore the computational cost problem of them. Even, several methods require two separate networks to perform with two different input types such as RGB, temporal gradient and optical flow. This makes them more and more complex and requires a extremely huge computational cost and memory space. In this paper, we introduce a simple yet robust approach to learn simultaneous both appearance and motion features in only a network regardless diversity of input video modalities. Moreover, we also present a lightweight autoencoder network for addressing this issue. Our framework is conducted on various benchmarks such as KTH, KITTI and BAIR datasets. The experimental results have shown that our approach achieves competitive performance compared to state-of-the-art video prediction methods with only 34.24MB of memory space and 2.59GFLOPs. With a smaller model size and less computational cost, our framework can run faster with a small inference time compared to the other methods. Besides, it only with 2.934 s to predict the next frame, our framework is a promising approach to deploy on embedded or mobile devices without GPU in real time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Visual attention network

Article Open access 28 July 2023

Data availability

All datasets used in this work are public available in the Internet.

References

Duc, Q.V.: Self-knowledge distillation: an efficient approach for falling detection. In: ICABDE, pp. 369–380. Springer (2022)
Xu, H., Liu, W., Xing, W., Wei, X.: Motion-aware future frame prediction for video anomaly detection based on saliency perception. SIViP 16(8), 2121–2129 (2022)
Article Google Scholar
Vu, D.Q., Thu, T.P.T., Le, N., Wang, J.C., et al.: Deep learning for human action recognition: a comprehensive review. APSIPA Transactions on signal and information processing 12(2)
Bhattacharyya, A., Fritz, M., Schiele, B.: Long-term on-board prediction of people in traffic scenes under uncertainty. In: CVPR, pp. 4194–4202 (2018)
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. NIPS 29 (2016)
Lee, D.-H., Liu, J.-L.: End-to-end deep learning of lane detection and path prediction for real-time autonomous driving. SIViP 17(1), 199–205 (2023)
Article Google Scholar
Akbulut, O., Konyar, M.Z.: Improved intra-subpartition coding mode for versatile video coding. SIViP 16(5), 1363–1368 (2022)
Article Google Scholar
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML, pp. 6105–6114 (2019). PMLR
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-yolov4: Scaling cross stage partial network. In: CVPR, pp. 13029–13038 (2021)
Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion. In: CVPR, pp. 5559–5568 (2021)
Vu, D.Q., Wang, J.C., : A novel self-knowledge distillation approach with siamese representation learning for action recognition. In: VCIP, pp. 1–5 . IEEE (2021)
Vu, D.Q., Le, N.T., Wang, J.C.: (2+ 1) d distilled shufflenet: a lightweight unsupervised distillation network for human action recognition. In: ICPR, pp. 3197–3203 . IEEE (2022)
Gao, Z., Tan, C., Wu, L., Li, S.Z.: Simvp: Simpler yet better video prediction. In: CVPR, pp. 3170–3180 (2022)
Wang, Y., Gao, Z., Long, M., Wang, J., Philip, S.Y.: Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In: ICML, pp. 5123–5132. PMLR (2018)
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: ICML, pp. 1174–1183. PMLR (2018)
Wu, B., Nair, S., Martin-Martin, R., Fei-Fei, L., Finn, C.: Greedy hierarchical variational autoencoders for large-scale video prediction. In: CVPR, pp. 2318–2328 (2021)
Akan, A.K., Erdem, E., Erdem, A., Güney, F.: Slamp: Stochastic latent appearance and motion prediction. In: ICCV, pp. 14728–14737 (2021)
Phung, T., Nguyen, V.T., Ma, T.H.T., Duc, Q.V.: A (2+ 1) d attention convolutional neural network for video prediction. In: ICABDE, pp. 395–406. Springer (2022)
Yuan, P., Guan, Y., Huang, J.: Video prediction based on spatial information transfer and time backtracking. SIViP 16(3), 825–833 (2022)
Article Google Scholar
Oliu, M., Selva, J., Escalera, S.: Folded recurrent neural networks for future video prediction. In: Proceedings of the European conference on computer vision (ECCV), pp. 716–731 (2018)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. ICLR (2017)
Wu, Y., Wen, Q., Chen, Q.: Optimizing video prediction via video frame interpolation. In: CVPR, pp. 17814–17823 (2022)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: ICPR, vol. 3, pp. 32–36 . IEEE (2004)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR, pp. 3354–3361. IEEE (2012)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Article Google Scholar
Ebert, F., Finn, C., Lee, A.X., Levine, S.: Self-supervised visual planning with temporal skip connections. CoRL 12, 16 (2017)
Google Scholar
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. NIPS 28 (2015)
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. NIPS 30 (2017)
Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)
Jin, B., Hu, Y., Zeng, Y., Tang, Q., Liu, S., Ye, J.: Varnet: Exploring variations for unsupervised video prediction. In: IROS, pp. 5801–5806 (2018). IEEE
Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3d lstm: a model for video prediction and beyond. In: ICLR (2019)
Franceschi, J.Y., Delasalles, E., Chen, M., Lamprier, S., Gallinari, P.: Stochastic latent residual video prediction. In: ICML, pp. 3233–3246 (2020). PMLR
Lee, S., Kim, H.G., Choi, D.H., Kim, H.I., Ro, Y.M.: Video prediction recalling long-term motion context via memory alignment learning. In: CVPR, pp. 3054–3063 (2021)
Ye, X., Bilodeau, G.-A.: Video prediction by efficient transformers. Image Vis. Comput. 130, 104612 (2023)
Article Google Scholar
Yu, W., Lu, Y., Easterbrook, S., Fidler, S.: Efficient and information-preserving future frame prediction and beyond. In: ICLR (2020)
Guen, V.L., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: CVPR, pp. 11474–11484 (2020)

Download references

Funding

This work is supported in part by the Thai Nguyen University of Education under Grant TNUE-2022-03.

Author information

Authors and Affiliations

Department of CSIS, Thai Nguyen University of Education, Thai Nguyen, Vietnam
Duc-Quang Vu
Department of CSIE, National Central University, Taoyuan, Taiwan
Duc-Quang Vu
Thai Nguyen University, Thai Nguyen, Vietnam
Trang Phung T. Thu

Authors

Duc-Quang Vu
View author publications
You can also search for this author in PubMed Google Scholar
Trang Phung T. Thu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

(1) Quang proposed this idea. (2) All authors proceed to code, experiment and compare the results of the proposed approach with state-of-the-art methods. (3) All authors wrote the draft of the manuscript and prepared tables. (4) Quang prepared all figures. (5) All authors revised and proofread the manuscript. (6) All authors reviewed the manuscript. (7) Trang submitted the manuscript.

Corresponding author

Correspondence to Trang Phung T. Thu.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vu, DQ., Thu, T.P.T. Simultaneous context and motion learning in video prediction. SIViP 17, 3933–3942 (2023). https://doi.org/10.1007/s11760-023-02623-x

Download citation

Received: 15 April 2023
Revised: 02 May 2023
Accepted: 05 May 2023
Published: 08 June 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11760-023-02623-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simultaneous context and motion learning in video prediction

Abstract

Access this article

Similar content being viewed by others

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Video summarization using deep learning techniques: a detailed analysis and investigation

Visual attention network

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Simultaneous context and motion learning in video prediction

Abstract

Access this article

Similar content being viewed by others

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Video summarization using deep learning techniques: a detailed analysis and investigation

Visual attention network

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation