Video spatiotemporal mapping for human action recognition by convolutional neural network

  • Amin Zare
  • Hamid Abrishami MoghaddamEmail author
  • Arash Sharifi
Theoretical advances


In this paper, a 2D representation of a video clip called video spatiotemporal map (VSTM) is presented. VSTM is a compact representation of a video clip which incorporates its spatial and temporal properties. It is created by vertical concatenation of feature vectors generated from subsequent frames. The feature vector corresponding to each frame is generated by applying wavelet transform to that frame (or its subtraction from the subsequent frame) and computing vertical and horizontal projection of quantized coefficients of some specific wavelet subbands. VSTM enables convolutional neural networks (CNNs) to process a video clip for human action recognition (HAR). The proposed approach benefits from power of CNNs to analyze visual patterns and attempts to overcome some CNN challenges such as variable video length problem and lack of training data that leads to over-fitting. VSTM presents a sequence of frames to CNN without imposing any additional computational cost to the CNN learning algorithm. The experimental results of the proposed method on the KTH, Weizmann, and UCF Sports HAR benchmark datasets have shown the supremacy of the proposed method compared with the state-of-the-art methods that used CNN to solve HAR problem.


Video spatiotemporal mapping Convolutional neural network Data augmentation Human action recognition 



  1. 1.
    Wang X (2013) Intelligent multi-camera video surveillance: a review. Pattern Recognit Lett 34:3–19. CrossRefGoogle Scholar
  2. 2.
    Liu C, Hu C, Liu Q, Aggarwal JK (2013) Video event description in scene context. Neurocomputing. 119:82–93. CrossRefGoogle Scholar
  3. 3.
    Hu W, Xie N, Li L, Zeng X, Maybank S (2011) A survey on visual content-based video indexing and retrieval. IEEE Trans Syst Man Cybern 41:797–816CrossRefGoogle Scholar
  4. 4.
    Zhu F, Shao L, Xie J, Fang Y (2016) From handcrafted to learned representations for human action recognition: a survey. Image Vis Comput 55(2):42–52. CrossRefGoogle Scholar
  5. 5.
    Enzweiler M, Gavrila DM (2009) Monocular pedestrian detection: survey and experiments. IEEE Trans Pattern Anal Mach Intell 31:2179–2195CrossRefGoogle Scholar
  6. 6.
    Barr P, Noble J, Biddle R (2007) Video game values: human–computer interaction and games. Interact Comput 19:180–195. CrossRefGoogle Scholar
  7. 7.
    Gowsikhaa D, Abirami S, Baskaran R (2014) Automated human behavior analysis from surveillance videos: a survey. Artif Intell Rev 42:747–765CrossRefGoogle Scholar
  8. 8.
    Afsar P, Cortez P, Santos H (2015) Automatic visual detection of human behavior: a review from 2000 to 2014. Expert Syst Appl 42:6935–6956. CrossRefGoogle Scholar
  9. 9.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1–9Google Scholar
  10. 10.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778Google Scholar
  11. 11.
    Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R (2013) Regularization of neural networks using dropconnect. In: International conference on machine learning, ICML, pp 109–111Google Scholar
  12. 12.
    Wang X, Zhang L, Lin L, Liang Z, Zuo W (2014) Deep joint task learning for generic object extraction. In: Advances in neural information processing systems, pp 523–531Google Scholar
  13. 13.
    Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS (2015) Deep learning for visual understanding: a review. Neurocomputing. Google Scholar
  14. 14.
    Liu Y, Guo Y, Wu S, Lew MS (2015) Deepindex for accurate and efficient image retrieval. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 43–50Google Scholar
  15. 15.
    Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM international conference on multimedia, pp 157–166Google Scholar
  16. 16.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780CrossRefGoogle Scholar
  17. 17.
    Ordóñez FJ, Roggen D (2016) Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors (Switzerland). Google Scholar
  18. 18.
    Ng JY, Hausknecht MJ, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4694–4702Google Scholar
  19. 19.
    Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634Google Scholar
  20. 20.
    Ji S, Yang M, Yu K, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231. CrossRefGoogle Scholar
  21. 21.
    Diba A, Pazandeh AM, Van Gool L (2016) Efficient two-stream motion and appearance 3D CNNs for video classification. In: ECCV’16Google Scholar
  22. 22.
    Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4597–4605Google Scholar
  23. 23.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2016) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497Google Scholar
  24. 24.
    Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: Salah AA, Lepri B (eds) Human behavior understanding. Springer, Berlin, pp 29–39CrossRefGoogle Scholar
  25. 25.
    Natarajan P, Singh VK, Nevatia R (2010) Learning 3D action models from a few 2D videos for view invariant action recognition. In: Computer vision and pattern recognition (CVPR). IEEE, pp 2006–2013Google Scholar
  26. 26.
    Luvizon DC, Tabia H, Picard D (2017) Learning features combination for human action recognition from skeleton sequences. Pattern Recognit Lett. Google Scholar
  27. 27.
    Jagadeesh B, Patil CM (2016) Video based action detection and recognition human using optical flow and SVM classifier. In: 2016 IEEE international conference on recent trends in electronics, information communication technology (RTEICT), pp 1761–1765Google Scholar
  28. 28.
    Mocanu DC, Bou Ammar H, Lowet D, Driessens K, Liotta A, Weiss G, Tuyls K (2015) Factored four way conditional restricted Boltzmann machines for activity recognition. Pattern Recognit Lett 66:100–108. CrossRefGoogle Scholar
  29. 29.
    Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1932–1939Google Scholar
  30. 30.
    Lertniphonphan K, Aramvith S, Chalidabhongse TH (2011) Human action recognition using direction histograms of optical flow. In: 11th International symposium on communications and information technologies (ISCIT), pp 574–579Google Scholar
  31. 31.
    Chun S, Lee CS (2016) Human action recognition using histogram of motion intensity and direction from multiple views. IET Comput Vis 10:250–256. CrossRefGoogle Scholar
  32. 32.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer vision and pattern recognition (CVPR), pp 886–893Google Scholar
  33. 33.
    Bay H, Ess A, Tuytelaars T, Gool L Van (2008) Speeded-up robust features (SURF). Comput Vis Image Underst 110:346–359. CrossRefGoogle Scholar
  34. 34.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Comput Vis 60:91–110CrossRefGoogle Scholar
  35. 35.
    Wang H, Klaser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60–79MathSciNetCrossRefGoogle Scholar
  36. 36.
    Wang H, Ullah MM, Klaser A, Laptev I, Schmid C, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British machine vision conference, pp 121–124Google Scholar
  37. 37.
    Ballan L, Bertini M, Bimbo A Del, Seidenari L, Serra G (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14:1234–1245. CrossRefGoogle Scholar
  38. 38.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE international conference on computer vision and pattern recognition, pp 1–8Google Scholar
  39. 39.
    Coniglio C, Meurie C, Lézoray O, Berbineau M (2017) People silhouette extraction from people detection bounding boxes in images. Pattern Recognit Lett 93:182–191. CrossRefGoogle Scholar
  40. 40.
    Zeng W, Wang C, Yang F (2014) Silhouette-based gait recognition via deterministic learning. Pattern Recognit 47:3568–3584. CrossRefGoogle Scholar
  41. 41.
    Willems G, Tuytelaars T, Gool L-V (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: 10th European conference on computer vision. Springer, Berlin, pp 650–663Google Scholar
  42. 42.
    Kim H-J, Lee JS, Yang H-S (2007) Human action recognition using a modified convolutional neural network. In: Liu D, Fei S, Hou Z, Zhang H, Sun C (eds) Proceedings of the 4th international symposium on neural networks: part II—advances in neural networks. Springer, Berlin, pp 715–723Google Scholar
  43. 43.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems. MIT Press, Cambridge, pp 568–576Google Scholar
  44. 44.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition. IEEE Computer Society, Washington, DC, pp 1725–1732Google Scholar
  45. 45.
    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4305–4314Google Scholar
  46. 46.
    Wang P, Cao Y, Shen C, Liu L, Shen HT (2016) Temporal pyramid pooling based convolutional neural network for action recognition. IEEE Trans Circuits Syst Video Technol PP:1. CrossRefGoogle Scholar
  47. 47.
    Abrishami Moghaddam H, Taghizadeh Khajoie T, Rouhi AH, Saadatmand-Tarzjan M (2005) Wavelet correlogram: a new approach for image indexing and retrieval. Pattern Recognit 38:2506–2518. CrossRefGoogle Scholar
  48. 48.
    Laptev I (2005) On space-time interest points. Int J Comput Vis 64:107–123. CrossRefGoogle Scholar
  49. 49.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Int Conf Learn Represent. Google Scholar
  50. 50.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML’15,, pp 448–456Google Scholar
  51. 51.
    Yu D, Eversole A, Seltzer MLM, Yao K, Kuchaiev O, Zhang Y, Seide F, Huang Z, Guenter B, Wang H, Droppo J, Zweig G, Rossbach C, Gao J, Stolcke A, Currey J, Slaney M, Chen G, Agarwal A, Basoglu C, Padmilac M, Kamenev A, Ivanov V, Cypher S, Parthasarathi H, Mitra B, Peng B, Huang X, Akchurin E, Basoglu C, Chen G, Cyphers S, Droppo J, Eversole A, Guenter B, Hillebrand M, Huang X, Huang Z, Ivanov V, Kamenev A, Kranen P, Kuchaiev O, Manousek W, Orlov A, Padmilac M, Parthasarathi H, Peng B, Reznichenko A, Seide F, Seltzer MLM, Slaney M, Stolcke A, Wang H, Yao K, Yu D (2014) An introduction to computational networks and the computational network toolkit. Microsoft Research, RedmondGoogle Scholar
  52. 52.
    Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the IEEE international conference on pattern recognition, pp 32–36Google Scholar
  53. 53.
    Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2012) Spatio-temporal convolutional sparse auto-encoder for sequence classification. In: BMVC, pp 1–12Google Scholar
  54. 54.
    Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Daniilidis K, Maragos P, Paragios N (eds) Computer vision—ECCV 2010. Springer, Berlin, pp 140–153CrossRefGoogle Scholar
  55. 55.
    Shi Y, Zeng W, Huang T, Wang Y (2015) Learning deep trajectory descriptor for action recognition in videos using deep neural networks. In: 2015 IEEE international conference on multimedia and expo (ICME), pp 1–6Google Scholar
  56. 56.
    Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space–time shapes. IEEE Trans Pattern Anal Mach Intell 29:2247–2253CrossRefGoogle Scholar
  57. 57.
    Nasiri JA, Moghadam Charkari N, Mozafari K (2014) Energy-based model of least squares twin support vector machines for human action recognition. Sig Process 104:248–257CrossRefGoogle Scholar
  58. 58.
    Sheng B, Yang W, Sun C (2015) Action recognition using direction-dependent feature pairs and non-negative low rank sparse model. Neurocomputing 158:73–80CrossRefGoogle Scholar
  59. 59.
    Dou J, Liu J (2014) Robust human action recognition based on spatio-temporal descriptors and motion temporal templates. Optik (Stuttg) 125:1891–1896CrossRefGoogle Scholar
  60. 60.
    Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of the British machine vision conferenceGoogle Scholar
  61. 61.
    Wang L, Li R, Fang Y (2016) Gradient-layer feature transform for action detection and recognition. J Vis Commun Image Represent 40:159–167. CrossRefGoogle Scholar
  62. 62.
    Al-Azzo F, Bao C, Taqi AM, Milanova M, Ghassan N (2017) Human actions recognition based on 3D deep neural network. In: 2017 Annual conference on new trends in information and communications technology applications (NTICT), pp 240–246Google Scholar
  63. 63.
    Soomro K, Zamir AR (2014) Action recognition in realistic sports videos. In: Moeslund TB, Thomas G, Hilton A (eds) Computer vision in sports. Springer, Cham, pp 181–208Google Scholar
  64. 64.
    Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: 2011 International conference on computer vision, pp 2003–2010Google Scholar
  65. 65.
    Raptis M, Kokkinos I, Soatto S (2012) Discovering discriminative action parts from mid-level video representations. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1242–1249Google Scholar
  66. 66.
    Ma S, Zhang J, Ikizler-Cinbis N, Sclaroff S (2013) Action recognition and localization by hierarchical space–time segments. In: 2013 IEEE international conference on computer vision, pp 2744–2751Google Scholar
  67. 67.
    Yu J, Jeon M, Pedrycz W (2014) Weighted feature trajectories and concatenated bag-of-features for action recognition. Neurocomputing 131:200–207CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  • Amin Zare
    • 1
  • Hamid Abrishami Moghaddam
    • 2
    Email author
  • Arash Sharifi
    • 1
  1. 1.Department of Computer Engineering, Science and Research BranchIslamic Azad UniversityTehranIran
  2. 2.Faculty of Electrical and Computer EngineeringK.N. Toosi University of TechnologyTehranIran

Personalised recommendations