CAST: A convolutional attention spatiotemporal network for predictive learning

Sun, Fengzhen; Jin, Weidong

doi:10.1007/s10489-023-04750-x

CAST: A convolutional attention spatiotemporal network for predictive learning

Published: 13 July 2023

Volume 53, pages 23553–23563, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

138 Accesses
Explore all metrics

Abstract

Predictive learning is receiving growing interests with wide applications. Combining RNN and CNN, recent works attempt to capture temporal dependencies and spatial correlations simultaneously. However, these methods tend to be deeper networks without feature compression in scale and easily suffer from high computational costs, especially for high-resolution frames. To reduce resource burden, we introduce a novel and efficient network named CASTnet for spatiotemporal predictive learning. In this paper, we present a concise architecture cascading multiple stages, which can directly accept the raw frames without information loss in the first stage and compress features in the following stages leading to low computation relatively. Then, we adopt a special prediction head to aggregate multi-level features into predictions, which helps our model capture multi-scale targets. As for the spatiotemporal block, we adopt bidirectional convolutional attention (BCA) operations to capture the local and long-range information simultaneously without quadratic calculation complexity. In order to further improve the model performance while not causing much additional computation burden, we propose frame-wise knowledge distillation, which enables each frame at low-level to learn from each one at high-level. To evaluate our model, we conduct quantitative qualitative and ablation experiments on the MovingMNIST and Radar Echo datasets. The results show that our CASTnet achieves competitive results with lower computational costs and fewer network parameters than the state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing spatiotemporal predictive learning: an approach with nested attention module

Article 20 February 2024

SARNN: A Spatiotemporal Prediction Model for Reducing Error Transmissions

TempFormer: Temporally Consistent Transformer for Video Denoising

Availability of data and materials

The datasets in our experiments are available in http://www.cs.toronto.edu/~nitish/unsupervised_video and https://tianchi.aliyun.com/competition/entrance/231662/information

Code Availability

The code will be available since this paper is published.

References

Ba J, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bengio S, Vinyals O, Jaitly N, et al (2015) Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 1800–1807
Devlin J, Chang M, Lee K, et al (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dollár P, Appel R, Belongie SJ et al (2014) Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 36:1532–1545
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
GoodfellowIan, Pouget-AbadieJean, MirzaMehdi, et al (2020) Generative adversarial networks. Communications of The ACM
Han K, Wang Y, Chen H, et al (2020) A survey on vision transformer. arXiv preprint arXiv:2012.12556
Han K, Wang Y, Tian Q, et al (2020) GhostNet: More features from cheap operations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 1577–1586
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 770–778
Hinton GE, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Howard AG, Zhu M, Chen B, et al (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Howard AG, Sandler M, Chu G, et al (2019) Searching for MobileNetV3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp 1314–1324
Huang G, Liu Z, Weinberger KQ (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 2261–2269
Iandola FN, Moskewicz MW, Ashraf K, et al (2016) SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and \(<\)0.5MB model size. arXiv preprint arXiv:1602.07360
Jain A, Zamir AR, Savarese S, et al (2016) Structural-RNN: Deep learning on spatio-temporal graphs. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 5308–5317
Jordan MI (1997) Serial Order: A parallel distributed processing approach. Advances in psychology 121:471–495
Article Google Scholar
Khan SH, Naseer M, Hayat M, et al (2021) Transformers in Vision: A survey. arXiv preprint arXiv:2101.01169
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60:84–90
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article Google Scholar
Li X, Li T, Li S, et al (2023) Learning fusion feature representation for garbage image classification model in human-robot interaction. Infrared Physics & Technology 128:104,457
Lin TY, Goyal P, Girshick RB et al (2020) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42:318–327
Article Google Scholar
Lin Z, Li M, Zheng Z, et al (2020) Self-Attention ConvLSTM for spatiotemporal prediction. Proceedings of the AAAI Conference on Artificial Intelligence pp 11,531–11,538
Liu T, Wang J, Yang B et al (2021) NGDNet: Nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220
Article Google Scholar
Liu H, Fang S, Zhang Z et al (2022) MFDNet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Transactions on Multimedia 24:2449–2460
Article Google Scholar
Liu H, Liu T, Chen Y, et al (2023) EHPE: skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Transactions on Multimedia 1–12. https://doi.org/10.1109/TMM.2022.3197364
Liu H, Zhang C, Deng Y, et al (2023) TransIFC: Invariant cues-aware feature concentration learning for efficient fine-grained bird image classification. IEEE Transactions on Multimedia 1–14. https://doi.org/10.1109/TMM.2023.3238548
Liu T, Wang J, Yang B, et al (2020) Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Physics & Technology p 103594
Liu T, Wang J, Yang B, et al (2021) NGDNet: Nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220
Liu Z, Lin Y, Cao Y, et al (2021) Swin Transformer: hierarchical vision transformer using shiftedwindows. arXiv preprint arXiv:2103.14030
Ma N, Zhang X, Zheng H, et al (2018) ShuffleNet V2: Practical guidelines for efficient cnn architecture design. European Conference on Computer Vision
Menghani G (2021) Efficient Deep Learning: A survey on making deep learning models smaller, faster, and better. arXiv preprint arXiv:2106.08962
Oliu M, Selva J, Escalera S (2017) Folded recurrent neural networks for future video prediction. arXiv preprint arXiv:1712.00311
Sandler M, Howard AG, Zhu M, et al (2018) MobileNetV2: inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 4510–4520
Scarselli F, Gori M, Tsoi AC et al (2009) The graph neural network model. IEEE Transactions on Neural Networks 20:61–80
Article Google Scholar
Shi X, Chen Z, Wang H, et al (2015) Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems pp 802–810
Shi X, Gao Z, Lausen L, et al (2017) Deep Learning for Precipitation Nowcasting: A benchmark and a new model. arXiv preprint arXiv:1706.03458
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. ICML
Tay Y, Dehghani M, Bahri D, et al (2020) Efficient Transformers: A survey. ACM Computing Surveys (CSUR)
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. arXiv preprint arXiv:1706.03762
Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13:600–612
Article Google Scholar
Wang Y, Gao Z, Long M, et al (2018) PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv preprint arXiv:1804.06300
Wang Y, Jiang L, Yang MH, et al (2019) Eidetic 3D LSTM: A model for video prediction and beyond. International Conference on Learning Representations 100
Wang Y, Zhang J, Zhu H, et al (2019) Memory In Memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 9146–9154
Wang Y, Wu H, Zhang J, et al (2022) PredRNN: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–1
Wang Z, Bovik AC, Sheikh HR, et al (2004) Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13:600–612
Xie S, Girshick RB, Dollár P, et al (2016) Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431
Zhang X, Zhou X, Lin M et al (2018) ShuffleNet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 100:6848–6856
Zhang X, Zhou X, Lin M, et al (2018) ShuffleNet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 100:6848–6856
Zhong S, Zeng X, Ling Q, et al (2020) Spatiotemporal convolutional LSTM for radar echo extrapolation. 2020 54th Asilomar Conference on Signals, Systems, and Computers pp 58–62

Download references

Funding

Not applicable

Author information

Authors and Affiliations

School of Electrical Engineering, Southwest Jiaotong University, ChengDu, China
Fengzhen Sun & Weidong Jin

Authors

Fengzhen Sun
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Jin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Main work: Fengzhen Sun; Coach: Weidong Jin.

Corresponding author

Correspondence to Fengzhen Sun.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest.

Ethics approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, F., Jin, W. CAST: A convolutional attention spatiotemporal network for predictive learning. Appl Intell 53, 23553–23563 (2023). https://doi.org/10.1007/s10489-023-04750-x

Download citation

Accepted: 29 May 2023
Published: 13 July 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04750-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CAST: A convolutional attention spatiotemporal network for predictive learning

Abstract

Access this article

Similar content being viewed by others

Enhancing spatiotemporal predictive learning: an approach with nested attention module

SARNN: A Spatiotemporal Prediction Model for Reducing Error Transmissions

TempFormer: Temporally Consistent Transformer for Video Denoising

Availability of data and materials

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CAST: A convolutional attention spatiotemporal network for predictive learning

Abstract

Access this article

Similar content being viewed by others

Enhancing spatiotemporal predictive learning: an approach with nested attention module

SARNN: A Spatiotemporal Prediction Model for Reducing Error Transmissions

TempFormer: Temporally Consistent Transformer for Video Denoising

Availability of data and materials

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation