Skip to main content
Log in

CAST: A convolutional attention spatiotemporal network for predictive learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Predictive learning is receiving growing interests with wide applications. Combining RNN and CNN, recent works attempt to capture temporal dependencies and spatial correlations simultaneously. However, these methods tend to be deeper networks without feature compression in scale and easily suffer from high computational costs, especially for high-resolution frames. To reduce resource burden, we introduce a novel and efficient network named CASTnet for spatiotemporal predictive learning. In this paper, we present a concise architecture cascading multiple stages, which can directly accept the raw frames without information loss in the first stage and compress features in the following stages leading to low computation relatively. Then, we adopt a special prediction head to aggregate multi-level features into predictions, which helps our model capture multi-scale targets. As for the spatiotemporal block, we adopt bidirectional convolutional attention (BCA) operations to capture the local and long-range information simultaneously without quadratic calculation complexity. In order to further improve the model performance while not causing much additional computation burden, we propose frame-wise knowledge distillation, which enables each frame at low-level to learn from each one at high-level. To evaluate our model, we conduct quantitative qualitative and ablation experiments on the MovingMNIST and Radar Echo datasets. The results show that our CASTnet achieves competitive results with lower computational costs and fewer network parameters than the state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Availability of data and materials

The datasets in our experiments are available in http://www.cs.toronto.edu/~nitish/unsupervised_video and https://tianchi.aliyun.com/competition/entrance/231662/information

Code Availability

The code will be available since this paper is published.

References

  1. Ba J, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450

  2. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  3. Bengio S, Vinyals O, Jaitly N, et al (2015) Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099

  4. Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 1800–1807

  5. Devlin J, Chang M, Lee K, et al (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  6. Dollár P, Appel R, Belongie SJ et al (2014) Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 36:1532–1545

    Article  Google Scholar 

  7. Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  8. GoodfellowIan, Pouget-AbadieJean, MirzaMehdi, et al (2020) Generative adversarial networks. Communications of The ACM

  9. Han K, Wang Y, Chen H, et al (2020) A survey on vision transformer. arXiv preprint arXiv:2012.12556

  10. Han K, Wang Y, Tian Q, et al (2020) GhostNet: More features from cheap operations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 1577–1586

  11. He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 770–778

  12. Hinton GE, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  13. Howard AG, Zhu M, Chen B, et al (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

  14. Howard AG, Sandler M, Chu G, et al (2019) Searching for MobileNetV3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp 1314–1324

  15. Huang G, Liu Z, Weinberger KQ (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 2261–2269

  16. Iandola FN, Moskewicz MW, Ashraf K, et al (2016) SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and \(<\)0.5MB model size. arXiv preprint arXiv:1602.07360

  17. Jain A, Zamir AR, Savarese S, et al (2016) Structural-RNN: Deep learning on spatio-temporal graphs. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 5308–5317

  18. Jordan MI (1997) Serial Order: A parallel distributed processing approach. Advances in psychology 121:471–495

    Article  Google Scholar 

  19. Khan SH, Naseer M, Hayat M, et al (2021) Transformers in Vision: A survey. arXiv preprint arXiv:2101.01169

  20. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  21. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60:84–90

    Article  Google Scholar 

  22. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444

    Article  Google Scholar 

  23. Li X, Li T, Li S, et al (2023) Learning fusion feature representation for garbage image classification model in human-robot interaction. Infrared Physics & Technology 128:104,457

  24. Lin TY, Goyal P, Girshick RB et al (2020) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42:318–327

    Article  Google Scholar 

  25. Lin Z, Li M, Zheng Z, et al (2020) Self-Attention ConvLSTM for spatiotemporal prediction. Proceedings of the AAAI Conference on Artificial Intelligence pp 11,531–11,538

  26. Liu T, Wang J, Yang B et al (2021) NGDNet: Nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220

    Article  Google Scholar 

  27. Liu H, Fang S, Zhang Z et al (2022) MFDNet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Transactions on Multimedia 24:2449–2460

    Article  Google Scholar 

  28. Liu H, Liu T, Chen Y, et al (2023) EHPE: skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Transactions on Multimedia 1–12. https://doi.org/10.1109/TMM.2022.3197364

  29. Liu H, Zhang C, Deng Y, et al (2023) TransIFC: Invariant cues-aware feature concentration learning for efficient fine-grained bird image classification. IEEE Transactions on Multimedia 1–14. https://doi.org/10.1109/TMM.2023.3238548

  30. Liu T, Wang J, Yang B, et al (2020) Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Physics & Technology p 103594

  31. Liu T, Wang J, Yang B, et al (2021) NGDNet: Nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220

  32. Liu Z, Lin Y, Cao Y, et al (2021) Swin Transformer: hierarchical vision transformer using shiftedwindows. arXiv preprint arXiv:2103.14030

  33. Ma N, Zhang X, Zheng H, et al (2018) ShuffleNet V2: Practical guidelines for efficient cnn architecture design. European Conference on Computer Vision

  34. Menghani G (2021) Efficient Deep Learning: A survey on making deep learning models smaller, faster, and better. arXiv preprint arXiv:2106.08962

  35. Oliu M, Selva J, Escalera S (2017) Folded recurrent neural networks for future video prediction. arXiv preprint arXiv:1712.00311

  36. Sandler M, Howard AG, Zhu M, et al (2018) MobileNetV2: inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 4510–4520

  37. Scarselli F, Gori M, Tsoi AC et al (2009) The graph neural network model. IEEE Transactions on Neural Networks 20:61–80

    Article  Google Scholar 

  38. Shi X, Chen Z, Wang H, et al (2015) Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems pp 802–810

  39. Shi X, Gao Z, Lausen L, et al (2017) Deep Learning for Precipitation Nowcasting: A benchmark and a new model. arXiv preprint arXiv:1706.03458

  40. Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. ICML

  41. Tay Y, Dehghani M, Bahri D, et al (2020) Efficient Transformers: A survey. ACM Computing Surveys (CSUR)

  42. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. arXiv preprint arXiv:1706.03762

  43. Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13:600–612

    Article  Google Scholar 

  44. Wang Y, Gao Z, Long M, et al (2018) PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv preprint arXiv:1804.06300

  45. Wang Y, Jiang L, Yang MH, et al (2019) Eidetic 3D LSTM: A model for video prediction and beyond. International Conference on Learning Representations 100

  46. Wang Y, Zhang J, Zhu H, et al (2019) Memory In Memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 9146–9154

  47. Wang Y, Wu H, Zhang J, et al (2022) PredRNN: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–1

  48. Wang Z, Bovik AC, Sheikh HR, et al (2004) Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13:600–612

  49. Xie S, Girshick RB, Dollár P, et al (2016) Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431

  50. Zhang X, Zhou X, Lin M et al (2018) ShuffleNet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 100:6848–6856

  51. Zhang X, Zhou X, Lin M, et al (2018) ShuffleNet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 100:6848–6856

  52. Zhong S, Zeng X, Ling Q, et al (2020) Spatiotemporal convolutional LSTM for radar echo extrapolation. 2020 54th Asilomar Conference on Signals, Systems, and Computers pp 58–62

Download references

Funding

Not applicable

Author information

Authors and Affiliations

Authors

Contributions

Main work: Fengzhen Sun; Coach: Weidong Jin.

Corresponding author

Correspondence to Fengzhen Sun.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest.

Ethics approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, F., Jin, W. CAST: A convolutional attention spatiotemporal network for predictive learning. Appl Intell 53, 23553–23563 (2023). https://doi.org/10.1007/s10489-023-04750-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04750-x

Keywords

Navigation