Skip to main content

Fine-Tuning Pre-trained Vision Transformer Model for Anomaly Detection in Video Sequences

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 625))

Abstract

Detecting anomalous in video sequences is one of the most popular computer vision topics. It is considered a challenging task in video analysis due to its definition, which is subjective or context-dependent. Various deep learning models such as convolutional neural networks (CNNs) have been previously utilized for this purpose. This paper proposes a novel solution based on the state-of-the-art deep learning models called Vision Transformer, since it is a trendy topic nowadays and it is performance. We are going to fine-tune a pre-trained Vision Transformer model on the UCSD dataset, which enables the automatic classification of video frames (abnormal and normal objects). The evaluation of this model shows that it achieves a good Accuracy score.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Popoola, O.P., Wang, K.: Video-based abnormal human behavior recognition—a review. IEEE Trans. Syst. Man Cybern. C 42(6), 865–878 (2012). https://doi.org/10.1109/TSMCC.2011.2178594

  2. Sabokrou, M., Fayyaz, M., Fathy, M., Moayed, Z., Klette, R.: Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes. Comput. Vis. Image Underst. 172, 88–97 (2018). https://doi.org/10.1016/j.cviu.2018.02.006

  3. Pang, G., Yan, C., Shen, C., Hengel, A.V.D., Bai, X.: Self-trained deep ordinal regression for end-to-end video anomaly detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, June 2020, pp. 12170–12179 (2020). https://doi.org/10.1109/CVPR42600.2020.01219

  4. Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning Temporal Regularity in Video Sequences, April 2016. arXiv:1604.04574 [cs]. http://arxiv.org/abs/1604.04574. Accessed 04 July 2021

  5. Bidirectional Convolutional LSTM Autoencoder for Risk Detection. IJATCSE 9(5), 8585–8589 (2020). https://doi.org/10.30534/ijatcse/2020/241952020

  6. Liu, W., Luo, W., Lian, D., Gao, S.: Future frame prediction for anomaly detection–a new baseline. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, June 2018, pp. 6536–6545 (2018). https://doi.org/10.1109/CVPR.2018.00684

  7. Mahmood, S.A., Abid, A.M., Lafta, S.H.: Anomaly event detection and localization of video clips using global and local outliers. IJEECS 24(2), 1063 (2021). https://doi.org/10.11591/ijeecs.v24.i2.pp1063-1073

    Article  Google Scholar 

  8. Ravanbakhsh, M., Nabi, M., Sangineto, E., Marcenaro, L., Regazzoni, C., Sebe, N.: Abnormal Event Detection in Videos using Generative Adversarial Nets, August 2017. arXiv:1708.09644 [cs]. http://arxiv.org/abs/1708.09644. Accessed 25 July 2021

  9. Goodfellow, I., et al.: Generative Adversarial Nets, p. 9

    Google Scholar 

  10. Atghaei, A., Ziaeinejad, S., Rahmati, M.: Abnormal Event Detection in Urban Surveillance Videos Using GAN and Transfer Learning , November 2020. arXiv:2011.09619 [cs]. http://arxiv.org/abs/2011.09619. Accessed 17 May 2021

  11. O’Shea, K., Nash, R.: An Introduction to Convolutional Neural Networks, 02 December 2015. arXiv. http://arxiv.org/abs/1511.08458. Accessed 22 May 2022

  12. Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 03 June 2021. arXiv. http://arxiv.org/abs/2010.11929. Accessed 22 May 2022

  13. Vaswani, A., et al.: Attention Is All You Need, December 2017. arXiv:1706.03762 [cs]. http://arxiv.org/abs/1706.03762. Accessed 24 July 2021

  14. Chen, H., et al.: GasHis-transformer: a multi-scale visual transformer approach for gastric histopathology image classification, 17 February 2022. arXiv. http://arxiv.org/abs/2104.14528. Accessed 01 June 2022

  15. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End Object Detection with Transformers, 28 May 2020. arXiv. http://arxiv.org/abs/2005.12872. Accessed 18 May 2022

  16. Lin, J.Y.-Y., Liao, S.-M., Huang, H.-J., Kuo, W.-T., Ou, O.H.-M.: Galaxy Morphological Classification with Efficient Vision Transformer, 03 February 2022. arXiv. http://arxiv.org/abs/2110.01024. Accessed 30 June 2022

  17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  18. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 24 May 2019. arXiv. http://arxiv.org/abs/1810.04805. Accessed 22 May 2022

  19. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer Normalization, 21 July 2016. arXiv. http://arxiv.org/abs/1607.06450. Accessed 02 June 2022

  20. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization, 29 January 2017. arXiv. http://arxiv.org/abs/1412.6980. Accessed 12 June 2022

  21. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386

    Article  Google Scholar 

  22. Mahadevan, V., Li, W., Bhalodia, V., Vasconcelos, N.: Anomaly detection in crowded scenes. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, June 2010, pp. 1975–1981 (2010). https://doi.org/10.1109/CVPR.2010.5539872

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelhafid Berroukham .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Berroukham, A., Housni, K., Lahraichi, M. (2023). Fine-Tuning Pre-trained Vision Transformer Model for Anomaly Detection in Video Sequences. In: Lazaar, M., En-Naimi, E.M., Zouhair, A., Al Achhab, M., Mahboub, O. (eds) Proceedings of the 6th International Conference on Big Data and Internet of Things. BDIoT 2022. Lecture Notes in Networks and Systems, vol 625. Springer, Cham. https://doi.org/10.1007/978-3-031-28387-1_24

Download citation

Publish with us

Policies and ethics