Skip to main content

Transformer with Spatio-Temporal Representation for Video Anomaly Detection

  • Conference paper
  • First Online:
Structural, Syntactic, and Statistical Pattern Recognition (S+SSPR 2022)

Abstract

With the popularity of smart surveillance devices and the increase of people’s security awareness, video anomaly detection has become an important task. However, learning rich multi-scale spatio-temporal information from high-dimensional videos to predict anomalous behaviors is a challenging task due to the large local redundancy and complex global dependencies among video frames. Although Convolutional Neural Network (CNN) has extraordinary bias induction capabilities, their inherent localization limitations lead to their lack of ability to capture long-term spatio-temporal features. Therefore, we propose a Transformer with spatio-temporal representation for video anomaly detection. The network combines the convolution operation with the Transformer operation, and uses the convolution operation to extract shallow spatial features to facilitate the recovery of sampled images. At the same time, Transformer operation is used to encode patches and efficiently capture remote dependencies through a self-attention mechanism, and to reduce the limitations in local redundancy. Experimental results on the UCSD Ped2, CUHK Avenue and ShanghaiTech datasets demonstrate the effectiveness of the proposed network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Balasundaram, A., Chellappan, C.: An intelligent video analytics model for abnormal event detection in online surveillance video. J. Real-Time Image Proc. 17(4), 915–930 (2020)

    Article  Google Scholar 

  2. Li, C.B., Li, H.J., Zhang, G.A.: Future frame prediction based on generative assistant discriminative network for anomaly detection. Appl. Intell. (2022).https://doi.org/10.1007/s10489-022-03488-2

  3. Xu, D., Yan, Y., Ricci, E., Sebe, N.: Detecting anomalous events in videos by learning deep representations of appearance and motion. Comput. Vis. Image Underst. 156, 117–127 (2017)

    Article  Google Scholar 

  4. D’Afflisio, En., Braca, P., Millefiori, L.M., Willett, P.: Detecting anomalous deviations from standard maritime routes using the Ornstein-Uhlenbeck process. IEEE Trans. Signal Process. 66(24), 6474–6487 (2018)

    Google Scholar 

  5. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical Image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 234–241 (2015)

    Google Scholar 

  6. Schlemper, J., et al.: Attention gated networks: learning to leverage salient regions in medical images. Med. Image Anal. 53, 197–207 (2019)

    Article  Google Scholar 

  7. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 7794–7803 (2018)

    Google Scholar 

  8. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  9. Zhao, Y., Deng, B., Shen, C., Liu, Y., Lu, H., Hua, X.S.: Spatiotemporal autoEncoder for video anomaly detection. In: Processing of the 25th ACM Multimedia Conference, pp. 1933–1941 (2017)

    Google Scholar 

  10. Yan, S.Y., Smith, J.S., Lu, W.J., Zhang, B.L.: Abnormal event detection from videos using a two-stream recurrent variational autoencoder. IEEE Trans. Cognit. Dev. Syst. 12(1), 30–42 (2020)

    Article  Google Scholar 

  11. Parmar, N., et al.: Image transformer. In: Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 4055–4064 (2018)

    Google Scholar 

  12. Luo, W., Liu, W., Gao, S.: Remembering history with convolutional LSTM for anomaly detection. In: Processing of the IEEE International Conference on Multimedia and Expo, pp. 439–444 (2017)

    Google Scholar 

  13. Ravanbakhsh, M., Sangineto, E., Nabi, M., Sebe, N.: Training adversarial discriminators for cross-channel abnormal event detection in crowds. In: Proceedings of the 19th IEEE Workshop on Application of Computer Vision, Waikoloa Village, USA, pp. 1896–1904 (2019)

    Google Scholar 

  14. Liu, W., Luo, W.X., Lian, D.Z., Gao, S.H.: Future frame prediction for anomaly detection – a new baseline. In: Processing of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 6536–6545 (2018)

    Google Scholar 

  15. Villegas, R., Yang, J., Hong, S., Lin X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: Processing of the International Conference on Learning Representations, Toulon, France, pp. 1–22 (2017)

    Google Scholar 

  16. Arnab, A., Dehghani, M., Heigold, G., Sun, C. Lučić, M., Schmid, C.: ViViT: a video vision transformer (2021). http://arxiv.org/abs/2103.15691

  17. Li, W.X., Mahadevan, V., Vasconcelos, N.: Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 18–32 (2014)

    Article  Google Scholar 

  18. Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in MATLAB. In: Processing of the IEEE International Conference on Computer Vision, Sydney, pp. 2720–2727. IEEE (2013)

    Google Scholar 

  19. Luo, W., Liu, W., Gao, S.: A revisit of sparse coding based anomaly detection in stacked RNN framework. In: Processing of the IEEE International Conference on Computer Vision, Sydney, pp. 341–349. IEEE (2017)

    Google Scholar 

  20. Park, H., Noh, J., Ham, B.: Learning memory-guided normality for anomaly detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 14360–14369 (2020)

    Google Scholar 

  21. Ye, M., Peng, X., Gan, W., Wu, W., Qiao, Y.: AnoPCN: video anomaly detection via deep predictive coding network. In: Processing of the 27th ACM International Conference on Multimedia, pp. 1805–1813 (2019)

    Google Scholar 

Download references

Acknowledgment

This work is supported in part by National Natural Science Foundation of China under Grant 61871241, Grant 61971245 and Grant 61976120, in part by Nantong Science and Technology Program JC2021131 and in part by Postgraduate Research and Practice Innovation Program of Jiangsu Province KYCX21_3084.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongjun Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, X., Chen, J., Shen, X., Li, H. (2022). Transformer with Spatio-Temporal Representation for Video Anomaly Detection. In: Krzyzak, A., Suen, C.Y., Torsello, A., Nobile, N. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2022. Lecture Notes in Computer Science, vol 13813. Springer, Cham. https://doi.org/10.1007/978-3-031-23028-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23028-8_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23027-1

  • Online ISBN: 978-3-031-23028-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics