Multimedia Tools and Applications

, Volume 77, Issue 20, pp 26901–26918 | Cite as

Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

  • Bo Meng
  • XueJun LiuEmail author
  • Xiaolin Wang


Convolutional neural networks (CNN) are the state-of-the-art method for action recognition in various kinds of datasets. However, most existing CNN models are based on lower-level handcrafted features from gray or RGB image sequences from small datasets, which are incapable of being generalized for application to various realistic scenarios. Therefore, we propose a new deep learning network for action recognition that integrates quaternion spatial-temporal convolutional neural network (QST-CNN) and Long Short-Term Memory network (LSTM), called QST-CNN-LSTM. Unlike a traditional CNN, the input for a QST-CNN utilizes a quaternion expression for an RGB image, and the values of the red, green, and blue channels are considered simultaneously as a whole in a spatial convolutional layer, avoiding the loss of spatial features. Because the raw images in video datasets are large and have background redundancy, we pre-extract key motion regions from RGB videos using an improved codebook algorithm. Furthermore, the QST-CNN is combined with LSTM for capturing the dependencies between different video clips. Experiments demonstrate that QST-CNN-LSTM is effective for improving recognition rates in the Weizmann, UCF sports, and UCF11 datasets.


Human action recognition Convolutional neural network Quaternion Long short-term memory network Codebook 



This work was supported by National Natural Science Foundation of China (61602108), Jilin Science and Technology Innovation Developing Scheme (20166016), and the Electric Power Intelligent Robot Collaborative Innovation Group.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. 1.
    Aharon M, Elad M, Bruckstein A (2006) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 54(11):4311–4322CrossRefGoogle Scholar
  2. 2.
    Annane D, Chevrolet JC, Chevret S et al (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Proces Syst 1(4):568–576Google Scholar
  3. 3.
    Chaudhry R, Ravichandran A, Hager G, et al (2009) Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. Computer Vision and Pattern Recogn, pp. 1932–1939Google Scholar
  4. 4.
    Cheron G, Laptev I, Schmid C (2015) P-CNN: pose-based CNN features for action recognition. pp. 3218–3226Google Scholar
  5. 5.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, San Diego, pp 886–893. CrossRefGoogle Scholar
  6. 6.
    Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis & Machine Intelligence 39(4):677–691CrossRefGoogle Scholar
  7. 7.
    Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream LSTM: a deep fusion framework for human action recognition. In: Winter Conference on Applications of Computer Vision. IEEE, Santa Rosa, pp 177–186. CrossRefGoogle Scholar
  8. 8.
    Gorelick L, Blank M, Shechtman E et al (2005) Actions as space-time shapes. IEEE Transactions on Pattern Analysis & Machine Intelligence 29(12):2247CrossRefGoogle Scholar
  9. 9.
    Guo XL, Yang TT (2016) Dynamic gesture recognition based on kinect depth data. Journal of Northeast Electric Power University 36(2):90–94Google Scholar
  10. 10.
    Gutub AA (2010) Pixel Indicator technique for RGB image steganography. Journal of Emerging Technologies in Web Intelligence 2(1):56–64Google Scholar
  11. 11.
    Gutub A, Al-Juaid N, Khan E (2017) Counting-based secret sharing technique for multimedia applications. Multimedia Tools & Applications 1:1–29Google Scholar
  12. 12.
    Hamilton WR (1969) Elements of quaternions. Vols. I, II. Chelsea Publishing Co, New YorkGoogle Scholar
  13. 13.
    Hoai M, Zisserman A (2014) Improving human action recognition using score distribution and ranking. In: Computer Vision - ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore, pp 3–20. Google Scholar
  14. 14.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  15. 15.
    Ji S, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 35(1):221–231CrossRefGoogle Scholar
  16. 16.
    Kim K, Chalidabhongse TH, Harwood D, Davis L (2005) Real-time foreground-background segmentation using codebook model. Real-Time Imaging 11(3):172–185CrossRefGoogle Scholar
  17. 17.
    Kim HJ, Lee JS, Yang HS (2007) Human action recognition using a modified convolutional neural network. In: Advances in Neural Networks - ISNN 2007, 4th International Symposium on Neural Networks. ISNN, Nanjing, pp 715–723.
  18. 18.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp 1097–1105. Google Scholar
  19. 19.
    Lan R, Zhou Y (2016) Quaternion-michelson descriptor for color image classification. IEEE Trans Image Process 25(11):5281–5292MathSciNetCrossRefGoogle Scholar
  20. 20.
    Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: International Conference on Computer Vision. IEEE, Barcelona, pp 2003–2010.
  21. 21.
    Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond Gaussian Pyramid: multi-skip feature stacking for action recognition. In: Conference on Computer Vision and Pattern Recognition. IEEE, Boston, pp 204–212. CrossRefGoogle Scholar
  22. 22.
    Lan ZZ, Yu S-I, Yao D, Lin M, Raj B, Hauptmann AG (2016) The Best of BothWorlds: combining data-independent and data-driven approaches for action recognition. In: Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Las Vegas, pp 1196–1205. CrossRefGoogle Scholar
  23. 23.
    Laptev I, Lindeberg T (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123. CrossRefGoogle Scholar
  24. 24.
    Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: The 24th Conference on Computer Vision and Pattern Recognition. IEEE, Colorado Springs, pp 3361–3368. CrossRefGoogle Scholar
  25. 25.
    Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos. In: Computer vision and Pattern Recognition. IEEE, Miami, pp 1996–2003.
  26. 26.
    Liu Z, Zhang C, Tian Y (2016) 3d-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55:93–100CrossRefGoogle Scholar
  27. 27.
    Ng JYH, Hausknecht MJ, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Computer vision and Pattern Recognition. IEEE, Boston, pp 4694–4702. CrossRefGoogle Scholar
  28. 28.
    Norah A, Basem A, Adnan G (2017) Applicable light-weight cryptography to secure medical data in IoT systems. Journal of Research in Engineering and Appl Sci 2(2):50–58Google Scholar
  29. 29.
    Palm RB (2012) Prediction as a candidate for learning deep hierarchical models of data. Technical University of Denmark, Lyngby, pp 1–87Google Scholar
  30. 30.
    Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: Winter Conference on Application of Computer Vision. IEEE, Lake Placid, pp 1–8. CrossRefGoogle Scholar
  31. 31.
    Peng X, Qiao Y, Peng Q (2014) Motion boundary based sampling and 3D co-occurrence descriptors for action recognition. Image Vis Comput 32(9):616–628CrossRefGoogle Scholar
  32. 32.
    Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: Computer Vision - 13th european conference. ECCV, Zurich, pp 581–595. CrossRefGoogle Scholar
  33. 33.
    Ravanbakhsh M, Mousavi H, Rastegari M, Murino V, Davis LS (2015) Action recognition with image based CNN features, CoRR abs/1512.03980. Accessed 07 Jun 2017
  34. 34.
    Rodriguez MD, Ahmed J, Shah M (2008) Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, Anchorage.
  35. 35.
    Salih AAA, Youssef C (2016) Spatiotemporal representation of 3D skeleton joints-based action recognition using modified spherical harmonics. Pattern Recogn Lett 83:32–41CrossRefGoogle Scholar
  36. 36.
    Sapienza M, Cuzzolin F, Torr PHS (2014) Feature sampling and partitioning for visual vocabulary generation on large action classification datasets. CoRR abs/1405.7545. Accessed 07 Jun 2017
  37. 37.
    Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. CoRR abs/1511.04119. Accessed 07 Jun 2017
  38. 38.
    Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: 9th International Conference on Computer Vision. IEEE, Nice, p 1470–1477. CrossRefGoogle Scholar
  39. 39.
    Tian YY, Tan QC (2016) Sub-pixel edge localization algorithm for filtering noise analysis. Journal of Northeast Electric Power University 5:56–60Google Scholar
  40. 40.
    Wang H, Schmid C (2014) Action recognition with improved trajectories. In: International Conference on Computer Vision. IEEE, Sydney, pp 3551–3558. CrossRefGoogle Scholar
  41. 41.
    Wang H, Klaser A, Schmid C, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In: The 24th Conference on Computer Vision and Pattern Recognition. IEEE, Colorado Springs, pp 3169–3176. CrossRefGoogle Scholar
  42. 42.
    Wang JW, Le NT, Lee JS et al (2016) Color face image enhancement using adaptive singular value decomposition in fourier domain for face recognition. Pattern Recogn 57(C):31–49CrossRefGoogle Scholar
  43. 43.
    Wang L, Ge L, Li R et al (2017) Three-stream CNNs for action recognition[J]. Pattern Recogn Lett 92(C):33–40CrossRefGoogle Scholar
  44. 44.
    Weinzaepfel P, Harchaoui Z, Schmid C (2015) Learning to track for spatio-temporal action localization. In: International Conference on Computer Vision. IEEE, Santiago, pp 3164–3172. CrossRefGoogle Scholar
  45. 45.
    Wu K, Li Gui J, Han GL (2017) Color image detail enhancement based on quaternion guided filter. Journal of Computer-Aided Design & Computer Graphics 29(3):419–427Google Scholar
  46. 46.
    Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: MM’12 Proceedings of the 20th ACM international conference on Multimedia. Nara, pp 1057–1060.
  47. 47.
    Zeng R, Wu J, Shao Z, Chen Y, Chen B, Senhadji L, Shu H (2016) Color image classification via quaternion principal component analysis network. Neurocomputing 216:416–428CrossRefGoogle Scholar
  48. 48.
    Zou C, Kou KI, Wang Y (2016) Quaternion collaborative and sparse representation with application to color face recognition. IEEE Trans Image Process 25(7):3287–3302MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Information EngineeringNortheast Electric Power UniversityJilinChina

Personalised recommendations