Abstract
Spatiotemporal predictive learning is a deep learning method that generates future frames from historical frames in a self-supervised manner. Existing studies face the challenges in capturing long-term dependencies and producing accurate predictions over extended time horizons. To address these limitations, this paper introduces a nested attention module as a special attention mechanism to capture spatiotemporal correlations of input historical frames. Nested attention module decomposes temporal attention into inter-frame channel attention and spatiotemporal attention and uses a nested attention mechanism to capture long-term temporal dependencies, which improves the model’s performance and generalization ability. Furthermore, to prevent overfitting in models, a new regularization method is proposed which considers both the intra-frame spatial error and the inter-frame temporal evolution error of sequence frames, and enhances the robustness of the reinforcement learning model to dropout operations. The proposed model achieves state-of-the-art performance on four baseline datasets, including moving MNIST handwritten digit dataset, human 3.6 million dataset, sea surface temperature dataset, and karlsruhe institute of technology and Toyota technological institute dataset. Extended experiments demonstrate the generalization and extensibility of nested attention module on real-world datasets. A dramatic 31.7% mean squared error/26.9% mean absolute error reduction is achieved when predicting 10 frames on moving MNIST. Our proposed model provides a new baseline for future research in spatiotemporal predictive learning tasks.
Similar content being viewed by others
Data availability
In order to validate the proposed model, all datasets used in this study have been publicly released or are available upon request. MovingMNIST dataset is available for research purposes and can be downloaded from the following GitHub repository: https://github.com/tychovdo/MovingMNIST and the website: http://www.cs.toronto.edu/~nitish/unsupervised_video/. The original MNIST dataset, from which the MovingMnist dataset is derived, can be found at http://yann.lecun.com/exdb/mnist/. Human3.6M dataset is available for non-commercial research purposes and can be accessed upon signing a license agreement. To obtain the dataset, visit the official Human3.6M website at http://vision.imar.ro/human3.6m/ and follow the instructions provided for requesting access. KITTI dataset is available for non-commercial use and can be downloaded from the official KITTI Vision Benchmark Suite website at http://www.cvlibs.net/datasets/kitti/. SST (NOAA Optimum Interpolation SST V2) dataset provided by the NOAA PSL, Boulder, Colorado, USA, from their website at https://psl.noaa.gov. The SST dataset is available for research purposes and can be downloaded from the following website: https://psl.noaa.gov/data/gridded/data.noaa.oisst.v2.html. Bair Robot Pushing dataset is provided by Berkeley AI Research (BAIR) and is available for research purposes. It can be accessed from their official website at http://rail.eecs.berkeley.edu/datasets/.
References
Barnston, A. G., & Tippett, M. K. (2013). Predictions of Nino3. 4 SST in CFSv1 and CFSv2: A diagnostic comparison. Climate Dynamics, 41, 1615–1633.
Brester, C., Kallio-Myers, V., Lindfors, A. V., Kolehmainen, M., & Niska, H. (2023). Evaluating neural network models in site-specific solar PV forecasting using numerical weather prediction data and weather observations. Renewable Energy, 207, 266–274.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision (pp. 213–229). Springer.
Chang, Z., Zhang, X., Wang, S., Ma, S., Ye, Y., Xinguang, X., & Gao, W. (2021). Mau: A motion-aware unit for video prediction and beyond. Advances in Neural Information Processing Systems, 34, 26950–26962.
Chen, M., Peng, H., Fu, J., & Ling, H. (2021). Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12270–12280).
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., & Lu, H. (2020). Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 183–192).
Child, R. (2020). Very deep vaes generalize autoregressive models and can outperform them on images. arXiv preprint arXiv:2011.10650
Chiu, H., Adeli, E., & Niebles, J. C. (2020). Segmenting the future. IEEE Robotics and Automation Letters, 5(3), 4202–4209.
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.
Ding, X., Zhang, X., Han, J., & Ding, G. (2022). Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11963–11975).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Du, W., Wang, Y., & Qiao, Y. (2017). Recurrent spatial-temporal attention network for action recognition in videos. IEEE Transactions on Image Processing, 27(3), 1347–1360.
Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., & He, K. (2021). A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3299–3309).
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3146–3154).
Gao, H., Xu, H., Cai, Q.-Z., Wang, R., Yu, F., & Darrell, T. (2019). Disentangling propagation and generation for video prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9006–9015).
Gao, Z., Tan, C., Wu, L., & Li, S. Z. (2022). Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3170–3180).
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11), 1231–1237.
Guen, V. L., & Thome, N. (2020). Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11474–11484).
Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M., & Hu, S.-M. (2023). Visual attention network. Computational Visual Media, 1–20.
Hamdi, A., Shaban, K., Erradi, A., Mohamed, A., Rumi, S. K., & Salim, F. D. (2022). Spatiotemporal data mining: a survey on challenges and open problems. Artificial Intelligence Review, 1–48.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation (pp. 2220–2227). Presented at the Proceedings/IEEE international conference on computer vision. IEEE international conference on computer vision. https://doi.org/10.1109/ICCV.2011.6126500
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.
Jenni, S., Meishvili, G., & Favaro, P. (2020). Video representation learning by recognizing temporal transformations. In European conference on computer vision (pp. 425–442). Springer.
Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., et al. (2022). Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European conference on computer vision (pp. 620–640). Springer.
Lee, S., Kim, H. G., Choi, D. H., Kim, H.-I., & Ro, Y. M. (2021). Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3054–3063).
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., & Qiao, Y. (2022). Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676
Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., & Yang, Y. (2023). Local-global context aware transformer for language-guided video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Lin, Y., Sun, H., Liu, N., Bian, Y., Cen, J., & Zhou, H. (2022). A lightweight multi-scale context network for salient object detection in optical remote sensing images. In 2022 26th international conference on pattern recognition (ICPR) (pp. 238–244). IEEE.
Lin, Z., Li, M., Zheng, Z., Cheng, Y., & Yuan, C. (2020). Self-attention convlstm for spatiotemporal prediction. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 11531–11538).
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3202–3211).
Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In Proceedings of the IEEE international conference on computer vision (pp. 4463–4471).
Luo, C., Zhao, X., Sun, Y., Li, X., & Ye, Y. (2022). Predrann: The spatiotemporal attention convolution recurrent neural network for precipitation nowcasting. Knowledge-Based Systems, 239, 107900.
Muhammad, K., Hussain, T., Ullah, H., Del Ser, J., Rezaei, M., Kumar, N., et al. (2022). Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks. IEEE Transactions on Intelligent Transportation Systems.
Patraucean, V., Handa, A., & Cipolla, R. (2015). Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309
Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., & Song, J. (2020). Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2250–2259).
Reynolds, R. W., Rayner, N. A., Smith, T. M., Stokes, D. C., & Wang, W. (2002). An improved in situ and satellite SST analysis for climate. Journal of Climate, 15(13), 1609–1625.
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28.
Shouno, O. (2020). Photo-realistic video prediction on natural videos of largely changing frames. arXiv preprint arXiv:2003.08635
Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. In International conference on machine learning (pp. 843–852). PMLR.
Wang, Y., Long, M., Wang, J., Gao, Z., & Yu, P. S. (2017). Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in Neural Information Processing Systems, 30.
Wang, P., Li, W., Ogunbona, P., Wan, J., & Escalera, S. (2018a). RGB-D-based human motion recognition with deep learning: A survey. Computer Vision and Image Understanding, 171, 118–139.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018b). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803).
Wang, Y., Gao, Z., Long, M., Wang, J., & Philip, S. Y. (2018c). Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In International Conference on Machine Learning (pp. 5123–5132). PMLR.
Wang, Y., Jiang, L., Yang, M.-H., Li, L.-J., Long, M., & Fei-Fei, L. (2018d). Eidetic 3D LSTM: A model for video prediction and beyond. In International conference on learning representations.
Wang, Y., Wu, H., Zhang, J., Gao, Z., Wang, J., Philip, S. Y., & Long, M. (2022). Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 2208–2225.
Wang, Y., Zhang, J., Zhu, H., Long, M., Wang, J., & Yu, P. S. (2019). Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9154–9162).
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).
Wu, H., Yao, Z., Wang, J., & Long, M. (2021). MotionRNN: A flexible model for video prediction with spacetime-varying motions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15435–15444).
Wu, L., Li, J., Wang, Y., Meng, Q., Qin, T., Chen, W., et al. (2021b). R-drop: Regularized dropout for neural networks. Advances in Neural Information Processing Systems, 34, 10890–10905.
Xie, J., Gu, L., Li, Z., & Lyu, L. (2022). HRANet: Hierarchical region-aware network for crowd counting. Applied Intelligence, 52(11), 12191–12205.
Xu, H., Jiang, C., Liang, X., & Li, Z. (2019). Spatial-aware graph relation network for large-scale object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9298–9307).
Xu, Z., Wang, Y., Long, M., Wang, J., & KLiss, M. (2018). PredCNN: Predictive learning with cascade convolutions. In IJCAI (pp. 2940–2947).
Yu, W., Lu, Y., Easterbrook, S., & Fidler, S. (2020). Efficient and information-preserving future frame prediction and beyond.
Zhang, X., Chen, C., Meng, Z., Yang, Z., Jiang, H., & Cui, X. (2022a). CoAtGIN: Marrying convolution and attention for graph-based molecule property prediction. In 2022 IEEE international conference on bioinformatics and biomedicine (BIBM) (pp. 374–379). IEEE.
Zhang, Y., Zhang, H., Wu, G., & Li, J. (2022b). Spatio-temporal self-supervision enhanced transformer networks for action recognition. In 2022 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1–6). IEEE.
Zheng, L., Wang, C., & Kong, L. (2022). Linear complexity randomized self-attention mechanism. In International conference on machine learning (pp. 27011–27041). PMLR.
Zhou, B., Dong, Y., Yang, G., Hou, F., Hu, Z., Xu, S., & Ma, S. (2023). A graph-attention based spatial-temporal learning framework for tourism demand forecasting. Knowledge-Based Systems, 263, 110275.
Acknowledgements
We wish to express our heartfelt gratitude to the many individuals and teams who supported this research through their pioneering work, open-sourced datasets, thoughtful feedback, and other invaluable contributions. In particular, we are grateful to the visionaries in spatiotemporal predictive learning whose seminal research laid the foundation and provided inspiration for our own work. We also sincerely thank the providers of key datasets including MovingMNIST, Human3.6m, KITTI, and SST, without which this research would not have been feasible. Furthermore, we appreciate the guidance and encouragement from colleagues, reviewers and others who helped strengthen this paper. This acknowledgement only begins to capture our sincere appreciation for all those who have assisted this research endeavor.
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
SW: Data curation, Methodology, Formal analysis and investigation, Writing—original draft preparation, Modification. RH: Conceptualization, Writing—review and editing, Resources, Supervision.
Corresponding author
Ethics declarations
Competing interests
The authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Han, R. Enhancing spatiotemporal predictive learning: an approach with nested attention module. J Intell Manuf (2024). https://doi.org/10.1007/s10845-023-02318-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10845-023-02318-7