Abstract
Predictive learning is receiving growing interests with wide applications. Combining RNN and CNN, recent works attempt to capture temporal dependencies and spatial correlations simultaneously. However, these methods tend to be deeper networks without feature compression in scale and easily suffer from high computational costs, especially for high-resolution frames. To reduce resource burden, we introduce a novel and efficient network named CASTnet for spatiotemporal predictive learning. In this paper, we present a concise architecture cascading multiple stages, which can directly accept the raw frames without information loss in the first stage and compress features in the following stages leading to low computation relatively. Then, we adopt a special prediction head to aggregate multi-level features into predictions, which helps our model capture multi-scale targets. As for the spatiotemporal block, we adopt bidirectional convolutional attention (BCA) operations to capture the local and long-range information simultaneously without quadratic calculation complexity. In order to further improve the model performance while not causing much additional computation burden, we propose frame-wise knowledge distillation, which enables each frame at low-level to learn from each one at high-level. To evaluate our model, we conduct quantitative qualitative and ablation experiments on the MovingMNIST and Radar Echo datasets. The results show that our CASTnet achieves competitive results with lower computational costs and fewer network parameters than the state-of-the-art models.
Similar content being viewed by others
Availability of data and materials
The datasets in our experiments are available in http://www.cs.toronto.edu/~nitish/unsupervised_video and https://tianchi.aliyun.com/competition/entrance/231662/information
Code Availability
The code will be available since this paper is published.
References
Ba J, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bengio S, Vinyals O, Jaitly N, et al (2015) Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 1800–1807
Devlin J, Chang M, Lee K, et al (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dollár P, Appel R, Belongie SJ et al (2014) Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 36:1532–1545
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
GoodfellowIan, Pouget-AbadieJean, MirzaMehdi, et al (2020) Generative adversarial networks. Communications of The ACM
Han K, Wang Y, Chen H, et al (2020) A survey on vision transformer. arXiv preprint arXiv:2012.12556
Han K, Wang Y, Tian Q, et al (2020) GhostNet: More features from cheap operations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 1577–1586
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 770–778
Hinton GE, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Howard AG, Zhu M, Chen B, et al (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Howard AG, Sandler M, Chu G, et al (2019) Searching for MobileNetV3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp 1314–1324
Huang G, Liu Z, Weinberger KQ (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 2261–2269
Iandola FN, Moskewicz MW, Ashraf K, et al (2016) SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and \(<\)0.5MB model size. arXiv preprint arXiv:1602.07360
Jain A, Zamir AR, Savarese S, et al (2016) Structural-RNN: Deep learning on spatio-temporal graphs. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp 5308–5317
Jordan MI (1997) Serial Order: A parallel distributed processing approach. Advances in psychology 121:471–495
Khan SH, Naseer M, Hayat M, et al (2021) Transformers in Vision: A survey. arXiv preprint arXiv:2101.01169
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60:84–90
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Li X, Li T, Li S, et al (2023) Learning fusion feature representation for garbage image classification model in human-robot interaction. Infrared Physics & Technology 128:104,457
Lin TY, Goyal P, Girshick RB et al (2020) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42:318–327
Lin Z, Li M, Zheng Z, et al (2020) Self-Attention ConvLSTM for spatiotemporal prediction. Proceedings of the AAAI Conference on Artificial Intelligence pp 11,531–11,538
Liu T, Wang J, Yang B et al (2021) NGDNet: Nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220
Liu H, Fang S, Zhang Z et al (2022) MFDNet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Transactions on Multimedia 24:2449–2460
Liu H, Liu T, Chen Y, et al (2023) EHPE: skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Transactions on Multimedia 1–12. https://doi.org/10.1109/TMM.2022.3197364
Liu H, Zhang C, Deng Y, et al (2023) TransIFC: Invariant cues-aware feature concentration learning for efficient fine-grained bird image classification. IEEE Transactions on Multimedia 1–14. https://doi.org/10.1109/TMM.2023.3238548
Liu T, Wang J, Yang B, et al (2020) Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Physics & Technology p 103594
Liu T, Wang J, Yang B, et al (2021) NGDNet: Nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220
Liu Z, Lin Y, Cao Y, et al (2021) Swin Transformer: hierarchical vision transformer using shiftedwindows. arXiv preprint arXiv:2103.14030
Ma N, Zhang X, Zheng H, et al (2018) ShuffleNet V2: Practical guidelines for efficient cnn architecture design. European Conference on Computer Vision
Menghani G (2021) Efficient Deep Learning: A survey on making deep learning models smaller, faster, and better. arXiv preprint arXiv:2106.08962
Oliu M, Selva J, Escalera S (2017) Folded recurrent neural networks for future video prediction. arXiv preprint arXiv:1712.00311
Sandler M, Howard AG, Zhu M, et al (2018) MobileNetV2: inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 4510–4520
Scarselli F, Gori M, Tsoi AC et al (2009) The graph neural network model. IEEE Transactions on Neural Networks 20:61–80
Shi X, Chen Z, Wang H, et al (2015) Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems pp 802–810
Shi X, Gao Z, Lausen L, et al (2017) Deep Learning for Precipitation Nowcasting: A benchmark and a new model. arXiv preprint arXiv:1706.03458
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. ICML
Tay Y, Dehghani M, Bahri D, et al (2020) Efficient Transformers: A survey. ACM Computing Surveys (CSUR)
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. arXiv preprint arXiv:1706.03762
Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13:600–612
Wang Y, Gao Z, Long M, et al (2018) PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv preprint arXiv:1804.06300
Wang Y, Jiang L, Yang MH, et al (2019) Eidetic 3D LSTM: A model for video prediction and beyond. International Conference on Learning Representations 100
Wang Y, Zhang J, Zhu H, et al (2019) Memory In Memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 9146–9154
Wang Y, Wu H, Zhang J, et al (2022) PredRNN: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–1
Wang Z, Bovik AC, Sheikh HR, et al (2004) Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13:600–612
Xie S, Girshick RB, Dollár P, et al (2016) Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431
Zhang X, Zhou X, Lin M et al (2018) ShuffleNet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 100:6848–6856
Zhang X, Zhou X, Lin M, et al (2018) ShuffleNet: An extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 100:6848–6856
Zhong S, Zeng X, Ling Q, et al (2020) Spatiotemporal convolutional LSTM for radar echo extrapolation. 2020 54th Asilomar Conference on Signals, Systems, and Computers pp 58–62
Funding
Not applicable
Author information
Authors and Affiliations
Contributions
Main work: Fengzhen Sun; Coach: Weidong Jin.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflicts of interest.
Ethics approval
Not applicable
Consent to participate
Not applicable
Consent for publication
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, F., Jin, W. CAST: A convolutional attention spatiotemporal network for predictive learning. Appl Intell 53, 23553–23563 (2023). https://doi.org/10.1007/s10489-023-04750-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04750-x