A deep unified framework for suspicious action recognition

  • Amine IlidrissiEmail author
  • Joo Kooi Tan
Original Article


As action recognition undergoes change as a field under influence of the recent deep learning trend, and while research in areas such as background subtraction, object segmentation and action classification is steadily progressing, experiments devoted to evaluate a combination of the aforementioned fields, be it from a speed or a performance perspective, are far and few between. In this paper, we propose a deep, unified framework targeted towards suspicious action recognition that takes advantage of recent discoveries, fully leverages the power of convolutional neural networks and strikes a balance between speed and accuracy not accounted for in most research. We carry out performance evaluation on the KTH dataset and attain a 95.4% accuracy in 200 ms computational time, which compares favorably to other state-of-the-art methods. We also apply our framework to a video surveillance dataset and obtain 91.9% accuracy for suspicious actions in 205 ms computational time.


Suspicious action recognition Deep learning Convolutional neural networks Background subtraction Optical flow estimation Action classification 



This research was supported by JSPS Kakenhi, Grant number 16K01554.


  1. 1.
    Baranwal M, Khan MT, De Silva CW (2011) Abnormal motion detection in real time using video surveillance and body sensors. Int J Inf Acquis 8:103–116CrossRefGoogle Scholar
  2. 2.
    Setyawan FXA, Tan JK, Kim H et al (2017) Moving objects detection employing iterative update of the background. Artif Life Robot 22(2):168–174CrossRefGoogle Scholar
  3. 3.
    Braham M, Van Droogenbroeck M (2016) Deep background subtraction with scene-specific convolutional neural networks. In: international conference on systems, signals and image processing, pp 1–4Google Scholar
  4. 4.
    Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: proceedings of imaging understanding workshop, pp 121–130Google Scholar
  5. 5.
    Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. Image Anal 2749:363–370CrossRefGoogle Scholar
  6. 6.
    Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-L1 optical flow. In: joint pattern recognition symposium, pp214–223Google Scholar
  7. 7.
    Dosovitskiy A, Fischer P, Ilg E et al (2015) Flownet: learning optical flow with convolutional networks. In: proceedings of the IEEE international conference on computer vision, pp 2758–2766Google Scholar
  8. 8.
    Ilg E, Mayer N, Saikia T et al (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE conference on computer vision and pattern recognition, pp 2462–2470Google Scholar
  9. 9.
    Mayer N, Ilg E, Hausser P et al (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 4040–4048Google Scholar
  10. 10.
    Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267CrossRefGoogle Scholar
  11. 11.
    Canton-Ferrer C, Casas JR, Pardas M (2006) Human model and motion based 3D action recognition in multiple view scenarios. In: 14th European signal processing conference, pp 1–5Google Scholar
  12. 12.
    Ahsan SMM, Tan JK, Kim H et al (2015) Human action representation and recognition: an approach to a histogram of spatiotemporal templates. Int J Innov Comput Inf Control 11(6):1855–1868Google Scholar
  13. 13.
    Ahad MAR, Ogata T, Tan JK et al (2008) A complex motion recognition technique employing directional motion templates. Int J Innov Comput Inf Control 4(8):1943–1954Google Scholar
  14. 14.
    Ahsan SMM, Tan JK, Kim H et al (2016) Spatiotemporal LBP and shape feature for human activity representation and recognition. Int J Innov Comput Inf Control 12(1):1–13Google Scholar
  15. 15.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 2014:568–576Google Scholar
  16. 16.
    Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36Google Scholar
  17. 17.
    Wang L, Xiong Y, Wang Z et al (2015) Towards good practices for very deep two-stream ConvNets. arXiv:1507.02159
  18. 18.
    Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. In: proceedings of the 22nd ACM international conference on Multimedia, pp 675–678Google Scholar
  19. 19.
    LeCun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278–2324CrossRefGoogle Scholar
  20. 20.
    Wang Y, Jodoin PM, Porikli F et al (2014) CDnet 2014: an expanded change detection benchmark dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 387–394Google Scholar
  21. 21.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  22. 22.
    He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  23. 23.
    Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: proceedings of the 17th international conference on pattern recognition, pp 32–36Google Scholar
  24. 24.
    Gao Z, Chen MY, Hauptmann A et al (2010) Comparing evaluation protocols on the KTH dataset. Hum Behav Underst 6219:88–100CrossRefGoogle Scholar
  25. 25.
    Yi S, Li H, Wang X (2015) Understanding pedestrian behaviors from stationary crowd groups. In: IEEE conference on computer vision and pattern recognition, pp 3488–3496Google Scholar
  26. 26.
    Han Y, Zhang P, Zhuo T et al (2017) Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recognit Lett 107:83–90CrossRefGoogle Scholar
  27. 27.
    Kim TK, Wong SF, Cipolla R (2007) Tensor canonical correlation analysis for action classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8Google Scholar
  28. 28.
    Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human action classes from videos in the wild. arXiv:1212.0402
  29. 29.
    Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video database for human motion recognition. In: IEEE international conference on computer vision, pp 2556–2563Google Scholar
  30. 30.
    Abu-El-Haija S, Kothari N, Lee J et al (2016) Youtube-8M: a large-scale video classification benchmark. arXiv:1609.08675

Copyright information

© International Society of Artificial Life and Robotics (ISAROB) 2018

Authors and Affiliations

  1. 1.Kyushu Institute of TechnologyKitakyushuJapan

Personalised recommendations