Skip to main content
Log in

Vehicle theft recognition from surveillance video based on spatiotemporal attention

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Frequent vehicle thefts have a highly detrimental impact on public safety. Thanks to surveillance equipment distributed throughout a city, a large number of videos that can be used to recognize vehicle theft are available. However, vehicle theft behavior has the characteristics of a small criminal target and small movement. Hence, the existing action recognition algorithms cannot be directly applied for the recognition of vehicle theft. In this paper, we propose a method for vehicle theft recognition based on a spatiotemporal attention mechanism. First, a database of vehicle theft is established by collecting videos from the Internet and an existing dataset. Then, we establish a vehicle theft recognition network and introduce a spatiotemporal attention mechanism for application when extracting the spatiotemporal features of theft. Through the learning of adaptive feature weights, the features that contribute most greatly to recognition are emphasized. Simulation experiments show that our proposed algorithm can achieve 97.04% accuracy on the collected vehicle theft database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267. https://doi.org/10.1109/34.910878

    Article  Google Scholar 

  2. Wang Y, Huang K, Tan T (2007) Human activity recognition based on R transform. IEEE Comput Soc Conf Comput Vis Pattern Recog:1–8. https://doi.org/10.1109/CVPR.2007.383505

  3. Chen HS, Chen HT, Chen YW, Lee S (2006) Human action recognition using star skeleton. VSSN '06: Proc 4th ACM Int Workshop Video Surveill Sensor Networks 171–178. https://doi.org/10.1145/1178782.1178808

  4. Wang L, Suter D (2006) Informative shape representations for human action recognition. 18th Int Conf Pattern Recog (ICPR'06), Hong Kong 1266–1269. https://doi.org/10.1109/ICPR.2006.711

  5. Harris C, Stephens M (1988) A combined corner and edge detector. Proc Alvey Vis Conf 147–151. https://doi.org/10.5244/C.2.23

  6. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123. https://doi.org/10.1007/s11263-005-1838-7

    Article  Google Scholar 

  7. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. 2005 IEEE Int Workshop Visual Surveill Perform Eval Track Surveill, Beijing 65–72. https://doi.org/10.1109/VSPETS.2005.1570899

  8. Willems G, Tuytelaars T, Van Gool LJ (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. ECCV '08: Proceedings of the 10th European Conference on Computer Vision: Part II 650–663. https://doi.org/10.1007/978-3-540-88688-4_48

  9. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local Spatio-temporal features for action recognition. British Mach Vis Conf 124–135. https://doi.org/10.5244/C.23.124

  10. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Soc Conf Comput Vis Pattern Recog 886–893. https://doi.org/10.1109/CVPR.2005.177

  11. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. ECCV'06: Proceedings of the 9th European conference on Computer Vision - Volume Part II 428–441. https://doi.org/10.1007/11744047_33

  12. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. IEEE Conf Comput Vis Pattern Recog 1–8. https://doi.org/10.1109/CVPR.2008.4587756

  13. Fujiyoshi H, Lipton AJ, Kanade T (2004) Real-time human motion analysis by image skeletonization. IEICE Trans Inf Syst 87(1):113–120

    Google Scholar 

  14. Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. J Vis Commun Image Represent 25(1):2–11. https://doi.org/10.1016/j.jvcir.2013.03.001

    Article  Google Scholar 

  15. Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. IEEE Conf Comput Vision Pattern Recog 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407

  16. Wang H, Schmid C (2013) Action recognition with improved trajectories. IEEE Int Conf Comput Vis 3551–3558. https://doi.org/10.1109/ICCV.2013.441

  17. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551. https://doi.org/10.1162/neco.1989.1.4.541

    Article  Google Scholar 

  18. Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 60:1097–1105. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  19. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Ence. https://core.ac.uk/reader/25056064

  20. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S (2015) Going deeper with convolutions. IEEE Conf Comput Vis Pattern Recog 1–9. https://doi.org/10.1109/cvpr.2015.7298594

  21. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE Conf Comput Vis Pattern Recog 770–778. https://doi.org/10.1109/cvpr.2016.90

  22. Pérez-Hernández F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: application in video surveillance. Knowl-Based Syst 194:105590. https://doi.org/10.1016/j.knosys.2020.105590

    Article  Google Scholar 

  23. Theagarajan R, Thakoor N, Bhanu B (2019) Physical features and deep learning-based appearance features for vehicle classification from rear view videos. IEEE Trans Intell Transp Syst 21(3):1096–1108. https://doi.org/10.1109/TITS.2019.2902312

    Article  Google Scholar 

  24. Yao Y, Wang X, Xu M, Pu Z, Crandall D (2020) When, where, and what? A new dataset for anomaly detection in driving videos. arXiv preprint arXiv:2004.03044

  25. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems 568–576. https://doi.org/10.1002/14651858.CD001941.pub3

  26. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159

  27. Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. Eur Conf Comput Vis 20–36. https://doi.org/10.1007/978-3-319-46484-8_2

  28. Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Eur Conf Comput Vision 803–818. https://doi.org/10.1007/978-3-030-01246-5_49

  29. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal residual networks for video action recognition. IEEE Conf Comput Vis Pattern Recog 3468–3476. https://doi.org/10.1109/CVPR.2017.787

  30. Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. IEEE Conf Comput Vis Pattern Recog 1924–1932. https://doi.org/10.1109/CVPR.2016.212

  31. Fernando B, Gould S (2016) Learning end-to-end video classification with RankPooling. Proc 33rd Int Conf Int Conf Mach Learn 48:1187–1196

    Google Scholar 

  32. Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40(12):2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085

    Article  Google Scholar 

  33. Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. IEEE Conf Comput Vis Pattern Recognition 1390–1399. https://doi.org/10.1109/CVPR.2018.00151

  34. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. IEEE/CVF Int Conf Comput Vis 6202–6211. https://doi.org/10.1109/iccv.2019.00630

  35. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. IEEE Int Conf Comput Vis 4489–4497. https://doi.org/10.1109/ICCV.2015.510

  36. Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038

  37. Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608

    Article  Google Scholar 

  38. Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. Proc IEEE Int Conf Comput Vis 4597–4605. https://doi.org/10.1109/ICCV.2015.522

  39. Qiu ZF, Yao T, Mei T (2017) Learning spatiotemporal representation with pseudo-3D residual networks. IEEE Int Conf Comput Vis 5533–5541. https://doi.org/10.1109/ICCV.2017.590

  40. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. IEEE Conf Comput Vis Pattern Recog 6299–6308. https://doi.org/10.1109/CVPR.2017.502

  41. Tran D, Wang H, Torresani L, Feiszli M (2019) Video classification with channel-separated convolutional networks. IEEE Int Conf Comput Vis 5551–5560. https://doi.org/10.1109/ICCV.2019.00565

  42. Donahue J, Hendricks LA, Guadarrama S et al (2016) Long-term recurrent convolutional networks for visual recognition and description. IEEE Conf Comput Vis Pattern Recognition 39:2625–2634. https://doi.org/10.1109/TPAMI.2016.2599174

    Article  Google Scholar 

  43. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  44. Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. IEEE Int Conf Acoustics, Speech Signal Process 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947

  45. Majd M, Safabakhsh R (2019) A motion-aware ConvLSTM network for action recognition. Appl Intell 49(7):2515–2521. https://doi.org/10.1007/s10489-018-1395-8

    Article  Google Scholar 

  46. Kay W, Carreira J, Simonyan K, Zhang B, Zisserman A (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

  47. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167

  48. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proc IEEE Conf Comput Vis Pattern Recog 7132–7141. https://doi.org/10.1109/CVPR.2018.00745

  49. Xiao XH, Leedham G (1999) Signature verification by neural networks with selective attention. Appl Intell 11(2):213–223. https://doi.org/10.1023/A:1008380515294

    Article  Google Scholar 

  50. Woo S, Park J, Lee J Y, et al (2018) Cbam: convolutional block attention module. Proc Eur Conf Comput Vis 3–19. https://doi.org/10.1007/978-3-030-01234-2_1

  51. Lin J, Gan C, Han S (2019) Tsm: temporal shift module for efficient video understanding. Proc IEEE Int Conf Comput Vis:7083–7093. https://doi.org/10.1109/iccv.2019.00718

  52. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Proc IEEE Conf Comput Vis Pattern Recognition:6546-6555. https://doi.org/10.1109/cvpr.2018.00685

  53. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. Proc IEEE Conf Comput Vis Pattern Recognition:6479-6488. https://doi.org/10.1109/CVPR.2018.00678

  54. He L, Wen S, Wang L, Li F (2020), Vehicle theft dataset. https://drive.google.com/drive/folders/19c2KNVotM15bLlU9FHAvqORTA00sV5lE

  55. Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  56. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pp 2556-2563. https://doi.org/10.1109/ICCV.2011.6126543

Download references

Acknowledgements

This research work was supported in part by the National Science Foundation of China (61671365, U1903213), and the Key Research and Development Program of Shaanxi Province (2020KW-009).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fan Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, L., Wen, S., Wang, L. et al. Vehicle theft recognition from surveillance video based on spatiotemporal attention. Appl Intell 51, 2128–2143 (2021). https://doi.org/10.1007/s10489-020-01933-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01933-8

Keywords

Navigation