Enhancing spatiotemporal predictive learning: an approach with nested attention module

Wang, Shaoping; Han, Ren

doi:10.1007/s10845-023-02318-7

Enhancing spatiotemporal predictive learning: an approach with nested attention module

Published: 20 February 2024

(2024)
Cite this article

Journal of Intelligent Manufacturing Aims and scope Submit manuscript

173 Accesses
Explore all metrics

Abstract

Spatiotemporal predictive learning is a deep learning method that generates future frames from historical frames in a self-supervised manner. Existing studies face the challenges in capturing long-term dependencies and producing accurate predictions over extended time horizons. To address these limitations, this paper introduces a nested attention module as a special attention mechanism to capture spatiotemporal correlations of input historical frames. Nested attention module decomposes temporal attention into inter-frame channel attention and spatiotemporal attention and uses a nested attention mechanism to capture long-term temporal dependencies, which improves the model’s performance and generalization ability. Furthermore, to prevent overfitting in models, a new regularization method is proposed which considers both the intra-frame spatial error and the inter-frame temporal evolution error of sequence frames, and enhances the robustness of the reinforcement learning model to dropout operations. The proposed model achieves state-of-the-art performance on four baseline datasets, including moving MNIST handwritten digit dataset, human 3.6 million dataset, sea surface temperature dataset, and karlsruhe institute of technology and Toyota technological institute dataset. Extended experiments demonstrate the generalization and extensibility of nested attention module on real-world datasets. A dramatic 31.7% mean squared error/26.9% mean absolute error reduction is achieved when predicting 10 frames on moving MNIST. Our proposed model provides a new baseline for future research in spatiotemporal predictive learning tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DPAST-RNN: A Dual-Phase Attention-Based Recurrent Neural Network Using Spatiotemporal LSTMs for Time Series Prediction

PSRUNet: a recurrent neural network for spatiotemporal sequence forecasting based on parallel simple recurrent unit

Article 20 April 2024

SARNN: A Spatiotemporal Prediction Model for Reducing Error Transmissions

Data availability

In order to validate the proposed model, all datasets used in this study have been publicly released or are available upon request. MovingMNIST dataset is available for research purposes and can be downloaded from the following GitHub repository: https://github.com/tychovdo/MovingMNIST and the website: http://www.cs.toronto.edu/~nitish/unsupervised_video/. The original MNIST dataset, from which the MovingMnist dataset is derived, can be found at http://yann.lecun.com/exdb/mnist/. Human3.6M dataset is available for non-commercial research purposes and can be accessed upon signing a license agreement. To obtain the dataset, visit the official Human3.6M website at http://vision.imar.ro/human3.6m/ and follow the instructions provided for requesting access. KITTI dataset is available for non-commercial use and can be downloaded from the official KITTI Vision Benchmark Suite website at http://www.cvlibs.net/datasets/kitti/. SST (NOAA Optimum Interpolation SST V2) dataset provided by the NOAA PSL, Boulder, Colorado, USA, from their website at https://psl.noaa.gov. The SST dataset is available for research purposes and can be downloaded from the following website: https://psl.noaa.gov/data/gridded/data.noaa.oisst.v2.html. Bair Robot Pushing dataset is provided by Berkeley AI Research (BAIR) and is available for research purposes. It can be accessed from their official website at http://rail.eecs.berkeley.edu/datasets/.

References

Barnston, A. G., & Tippett, M. K. (2013). Predictions of Nino3. 4 SST in CFSv1 and CFSv2: A diagnostic comparison. Climate Dynamics, 41, 1615–1633.
Article ADS Google Scholar
Brester, C., Kallio-Myers, V., Lindfors, A. V., Kolehmainen, M., & Niska, H. (2023). Evaluating neural network models in site-specific solar PV forecasting using numerical weather prediction data and weather observations. Renewable Energy, 207, 266–274.
Article Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision (pp. 213–229). Springer.
Chang, Z., Zhang, X., Wang, S., Ma, S., Ye, Y., Xinguang, X., & Gao, W. (2021). Mau: A motion-aware unit for video prediction and beyond. Advances in Neural Information Processing Systems, 34, 26950–26962.
Google Scholar
Chen, M., Peng, H., Fu, J., & Ling, H. (2021). Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12270–12280).
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., & Lu, H. (2020). Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 183–192).
Child, R. (2020). Very deep vaes generalize autoregressive models and can outperform them on images. arXiv preprint arXiv:2011.10650
Chiu, H., Adeli, E., & Niebles, J. C. (2020). Segmenting the future. IEEE Robotics and Automation Letters, 5(3), 4202–4209.
Article Google Scholar
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.
Google Scholar
Ding, X., Zhang, X., Han, J., & Ding, G. (2022). Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11963–11975).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Du, W., Wang, Y., & Qiao, Y. (2017). Recurrent spatial-temporal attention network for action recognition in videos. IEEE Transactions on Image Processing, 27(3), 1347–1360.
Article ADS MathSciNet Google Scholar
Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., & He, K. (2021). A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3299–3309).
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3146–3154).
Gao, H., Xu, H., Cai, Q.-Z., Wang, R., Yu, F., & Darrell, T. (2019). Disentangling propagation and generation for video prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9006–9015).
Gao, Z., Tan, C., Wu, L., & Li, S. Z. (2022). Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3170–3180).
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11), 1231–1237.
Article Google Scholar
Guen, V. L., & Thome, N. (2020). Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11474–11484).
Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M., & Hu, S.-M. (2023). Visual attention network. Computational Visual Media, 1–20.
Hamdi, A., Shaban, K., Erradi, A., Mohamed, A., Rumi, S. K., & Salim, F. D. (2022). Spatiotemporal data mining: a survey on challenges and open problems. Artificial Intelligence Review, 1–48.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article CAS PubMed Google Scholar
Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation (pp. 2220–2227). Presented at the Proceedings/IEEE international conference on computer vision. IEEE international conference on computer vision. https://doi.org/10.1109/ICCV.2011.6126500
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.
Article Google Scholar
Jenni, S., Meishvili, G., & Favaro, P. (2020). Video representation learning by recognizing temporal transformations. In European conference on computer vision (pp. 425–442). Springer.
Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., et al. (2022). Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European conference on computer vision (pp. 620–640). Springer.
Lee, S., Kim, H. G., Choi, D. H., Kim, H.-I., & Ro, Y. M. (2021). Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3054–3063).
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., & Qiao, Y. (2022). Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676
Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., & Yang, Y. (2023). Local-global context aware transformer for language-guided video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Lin, Y., Sun, H., Liu, N., Bian, Y., Cen, J., & Zhou, H. (2022). A lightweight multi-scale context network for salient object detection in optical remote sensing images. In 2022 26th international conference on pattern recognition (ICPR) (pp. 238–244). IEEE.
Lin, Z., Li, M., Zheng, Z., Cheng, Y., & Yuan, C. (2020). Self-attention convlstm for spatiotemporal prediction. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 11531–11538).
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3202–3211).
Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In Proceedings of the IEEE international conference on computer vision (pp. 4463–4471).
Luo, C., Zhao, X., Sun, Y., Li, X., & Ye, Y. (2022). Predrann: The spatiotemporal attention convolution recurrent neural network for precipitation nowcasting. Knowledge-Based Systems, 239, 107900.
Article Google Scholar
Muhammad, K., Hussain, T., Ullah, H., Del Ser, J., Rezaei, M., Kumar, N., et al. (2022). Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks. IEEE Transactions on Intelligent Transportation Systems.
Patraucean, V., Handa, A., & Cipolla, R. (2015). Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309
Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., & Song, J. (2020). Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2250–2259).
Reynolds, R. W., Rayner, N. A., Smith, T. M., Stokes, D. C., & Wang, W. (2002). An improved in situ and satellite SST analysis for climate. Journal of Climate, 15(13), 1609–1625.
Article ADS Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28.
Shouno, O. (2020). Photo-realistic video prediction on natural videos of largely changing frames. arXiv preprint arXiv:2003.08635
Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. In International conference on machine learning (pp. 843–852). PMLR.
Wang, Y., Long, M., Wang, J., Gao, Z., & Yu, P. S. (2017). Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in Neural Information Processing Systems, 30.
Wang, P., Li, W., Ogunbona, P., Wan, J., & Escalera, S. (2018a). RGB-D-based human motion recognition with deep learning: A survey. Computer Vision and Image Understanding, 171, 118–139.
Article CAS Google Scholar
Wang, X., Girshick, R., Gupta, A., & He, K. (2018b). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803).
Wang, Y., Gao, Z., Long, M., Wang, J., & Philip, S. Y. (2018c). Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In International Conference on Machine Learning (pp. 5123–5132). PMLR.
Wang, Y., Jiang, L., Yang, M.-H., Li, L.-J., Long, M., & Fei-Fei, L. (2018d). Eidetic 3D LSTM: A model for video prediction and beyond. In International conference on learning representations.
Wang, Y., Wu, H., Zhang, J., Gao, Z., Wang, J., Philip, S. Y., & Long, M. (2022). Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 2208–2225.
Article Google Scholar
Wang, Y., Zhang, J., Zhu, H., Long, M., Wang, J., & Yu, P. S. (2019). Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9154–9162).
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).
Wu, H., Yao, Z., Wang, J., & Long, M. (2021). MotionRNN: A flexible model for video prediction with spacetime-varying motions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15435–15444).
Wu, L., Li, J., Wang, Y., Meng, Q., Qin, T., Chen, W., et al. (2021b). R-drop: Regularized dropout for neural networks. Advances in Neural Information Processing Systems, 34, 10890–10905.
Google Scholar
Xie, J., Gu, L., Li, Z., & Lyu, L. (2022). HRANet: Hierarchical region-aware network for crowd counting. Applied Intelligence, 52(11), 12191–12205.
Article PubMed Google Scholar
Xu, H., Jiang, C., Liang, X., & Li, Z. (2019). Spatial-aware graph relation network for large-scale object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9298–9307).
Xu, Z., Wang, Y., Long, M., Wang, J., & KLiss, M. (2018). PredCNN: Predictive learning with cascade convolutions. In IJCAI (pp. 2940–2947).
Yu, W., Lu, Y., Easterbrook, S., & Fidler, S. (2020). Efficient and information-preserving future frame prediction and beyond.
Zhang, X., Chen, C., Meng, Z., Yang, Z., Jiang, H., & Cui, X. (2022a). CoAtGIN: Marrying convolution and attention for graph-based molecule property prediction. In 2022 IEEE international conference on bioinformatics and biomedicine (BIBM) (pp. 374–379). IEEE.
Zhang, Y., Zhang, H., Wu, G., & Li, J. (2022b). Spatio-temporal self-supervision enhanced transformer networks for action recognition. In 2022 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1–6). IEEE.
Zheng, L., Wang, C., & Kong, L. (2022). Linear complexity randomized self-attention mechanism. In International conference on machine learning (pp. 27011–27041). PMLR.
Zhou, B., Dong, Y., Yang, G., Hou, F., Hu, Z., Xu, S., & Ma, S. (2023). A graph-attention based spatial-temporal learning framework for tourism demand forecasting. Knowledge-Based Systems, 263, 110275.
Article Google Scholar

Download references

Acknowledgements

We wish to express our heartfelt gratitude to the many individuals and teams who supported this research through their pioneering work, open-sourced datasets, thoughtful feedback, and other invaluable contributions. In particular, we are grateful to the visionaries in spatiotemporal predictive learning whose seminal research laid the foundation and provided inspiration for our own work. We also sincerely thank the providers of key datasets including MovingMNIST, Human3.6m, KITTI, and SST, without which this research would not have been feasible. Furthermore, we appreciate the guidance and encouragement from colleagues, reviewers and others who helped strengthen this paper. This acknowledgement only begins to capture our sincere appreciation for all those who have assisted this research endeavor.

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

School of Optical-Electronical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, China
Shaoping Wang & Ren Han

Authors

Shaoping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ren Han
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SW: Data curation, Methodology, Formal analysis and investigation, Writing—original draft preparation, Modification. RH: Conceptualization, Writing—review and editing, Resources, Supervision.

Corresponding author

Correspondence to Ren Han.

Ethics declarations

Competing interests

The authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, S., Han, R. Enhancing spatiotemporal predictive learning: an approach with nested attention module. J Intell Manuf (2024). https://doi.org/10.1007/s10845-023-02318-7

Download citation

Received: 08 June 2023
Accepted: 28 December 2023
Published: 20 February 2024
DOI: https://doi.org/10.1007/s10845-023-02318-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing spatiotemporal predictive learning: an approach with nested attention module

Abstract

Access this article

Similar content being viewed by others

DPAST-RNN: A Dual-Phase Attention-Based Recurrent Neural Network Using Spatiotemporal LSTMs for Time Series Prediction

PSRUNet: a recurrent neural network for spatiotemporal sequence forecasting based on parallel simple recurrent unit

SARNN: A Spatiotemporal Prediction Model for Reducing Error Transmissions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing spatiotemporal predictive learning: an approach with nested attention module

Abstract

Access this article

Similar content being viewed by others

DPAST-RNN: A Dual-Phase Attention-Based Recurrent Neural Network Using Spatiotemporal LSTMs for Time Series Prediction

PSRUNet: a recurrent neural network for spatiotemporal sequence forecasting based on parallel simple recurrent unit

SARNN: A Spatiotemporal Prediction Model for Reducing Error Transmissions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation