Advertisement

Spatiotemporal neural networks for action recognition based on joint loss

  • Chao Jing
  • Ping WeiEmail author
  • Hongbin Sun
  • Nanning Zheng
Emerging Trends of Applied Neural Computation - E_TRAINCO
  • 37 Downloads

Abstract

Action recognition is a challenging and important problem in a myriad of significant fields, such as intelligent robots and video surveillance. In recent years, deep learning and neural network techniques have been widely applied to action recognition and attained remarkable results. However, it is still a difficult task to recognize actions in complicated scenes, such as various illumination conditions, similar motions, and background noise. In this paper, we present a spatiotemporal neural network model with a joint loss to recognize human actions from videos. This spatiotemporal neural network is comprised of two key connected substructures. The first one is a two-stream-based network extracting optical flow and appearance features from each frame of videos, which characterizes the human actions of videos in spatial dimension. The second substructure is a group of Long Short-Term Memory structures following the spatial network, which describes the temporal and transition information in videos. This research effort presents a joint loss function for training the spatiotemporal neural network model. By introducing the loss function, the action recognition performance is improved. The proposed method was tested with video samples from two challenging datasets. The experiments demonstrate that our approach outperforms the baseline comparison methods.

Keywords

Action recognition Spatiotemporal architecture LSTM Joint loss 

Notes

Acknowledgements

This work was supported by the Grants National Natural Science Foundation of China Nos. 61876149, 61790563, China Postdoctoral Science Foundation 2018M643657, and the Fundamental Research Funds for the Central Universities xzy012019035.

References

  1. 1.
    Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
  2. 2.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1(4):568–576Google Scholar
  3. 3.
    Wei P, Zhao Y, Zheng N, Zhu SC (2017) Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Trans Pattern Anal Mach Intell 39(6):1165–1179CrossRefGoogle Scholar
  4. 4.
    Shu T, Gao X, Ryoo M, Zhu S-C (2017) Learning social affordance grammar from videos: transferring human interactions to human-robot interactions. In: International conference on robotics and automation (ICRA)Google Scholar
  5. 5.
    Rezazadegan F, Shirazi S, Upcrofit B, Milford M (2017) Action recognition: from static datasets to moving robots. In: IEEE international conference on robotics and automation (ICRA), pp 3186–3191Google Scholar
  6. 6.
    Arunnehru J, Kalaiselvi Geetha M (2015) Vision-based human action recognition in surveillance videos using motion projection profile features. In: International conference on mining intelligence and knowledge exploration, pp 460–471CrossRefGoogle Scholar
  7. 7.
    Luo S, Yang H, Wang C, Che X, Meinel C (2016) Action recognition in surveillance video using ConvNets and motion history image. In: International conference on artificial neural networks, pp 187–195Google Scholar
  8. 8.
    Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115CrossRefGoogle Scholar
  9. 9.
    Rahmani H, Mian A, Shah M (2018) Learning a deep model for human action recognition from novel viewpoints. IEEE Trans Pattern Ana Mach Intell 40(3):667–681CrossRefGoogle Scholar
  10. 10.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  11. 11.
    Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer visionGoogle Scholar
  12. 12.
    Wang J, Liu Z, Wu Y, Yuan J (2012) Learning actionlet ensemble for 3D human action recognition. In: IEEE conference on computer vision and pattern recognition, pp 1290–1297Google Scholar
  13. 13.
    Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. http://arxiv.org/abs/1212.0402
  14. 14.
    Jing C, Wei P, Sun H, Zheng N (2018) Spatial-temporal neural networks for action recognition. In: International conference on artificial intelligence applications and innovations, pp 619–627Google Scholar
  15. 15.
    Wei P, Zheng N, Zhao Y, Zhu SC (2013) Concurrent action detection with structural prediction. In: International conference on computer vision, pp 3136–3143Google Scholar
  16. 16.
    Fujiyoshi H, Lipton AJ (1998) Real-time human motion analysis by image skeletonization. In: IEEE workshop on applications of computer vision, pp 15–21Google Scholar
  17. 17.
    Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3D human action representation and recognition. IEEE Trans Multimed 21(9):2195–2208CrossRefGoogle Scholar
  18. 18.
    Yan S, Xiong Y, Lin D (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI conference on artificial intelligence, pp 7444–7452Google Scholar
  19. 19.
    Zhang S, Xiao J, Liu X, Yi Y, Di X, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer LSTM Networks. IEEE Trans Multimed 20(9):2330–2343CrossRefGoogle Scholar
  20. 20.
    Papenberg N, Bruhn A, Brox T, Didas S, Weickert J (2006) Highly accurate optic flow computation with theoretically justified warping. Int J Comput Vis 67(2):141–158CrossRefGoogle Scholar
  21. 21.
    Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conference on computer vision and pattern recognition, pp 1932–1939Google Scholar
  22. 22.
    Lowe DG (1999) Object recognition from local scale-invariant features. In: International conference on computer visionGoogle Scholar
  23. 23.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, pp 886–893Google Scholar
  24. 24.
    Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: The British machine vision conferenceGoogle Scholar
  25. 25.
    Sch C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, pp 32–36Google Scholar
  26. 26.
    Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition, pp 3169–3176Google Scholar
  27. 27.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297zbMATHGoogle Scholar
  28. 28.
    Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgezbMATHGoogle Scholar
  29. 29.
    Xing H, Zhang G, Shang M (2016) Deeplearning. Int J Semant Comput 10(3):417–439CrossRefGoogle Scholar
  30. 30.
    Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444CrossRefGoogle Scholar
  31. 31.
    Mora SV, Knottenbelt WJ (2017) Deep learning for domain-specific action recognition in tennis. In: Computer vision and pattern recognition workshops, pp 170–178Google Scholar
  32. 32.
    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deepconvolutional descriptors. In: IEEE conference on computer vision and pattern recognition, pp 4305–4314Google Scholar
  33. 33.
    Husain F, Dellen B, Torras C (2016) Action recognition based on efficient deep feature learning in the spatio-temporal domain. IEEE Robot Autom Lett 1(2):984–991CrossRefGoogle Scholar
  34. 34.
    Li C, Sun S, Min X, Lin W, Nie B, Zhang X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612Google Scholar
  35. 35.
    Karpathy A, Toderici G, Shetty S (2014) Large-scale video classification with convolutional neural networks. In: Computer vision and pattern recognition, pp 1725–1732Google Scholar
  36. 36.
    Li C, Sun S, Min X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE international conference on multimedia and expo workshops, pp 609–612Google Scholar
  37. 37.
    Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517CrossRefGoogle Scholar
  38. 38.
    Ng YH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE international conference on computer vision and pattern recognition, pp 4694–4702Google Scholar
  39. 39.
    Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition, pp 677–691Google Scholar
  40. 40.
    Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39Google Scholar
  41. 41.
    Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, BerlinCrossRefGoogle Scholar
  42. 42.
    Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Luc VG (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer visionGoogle Scholar
  43. 43.
    Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: The conference on uncertainty in artificial intelligence, pp 102–112Google Scholar
  44. 44.
    Yuan ZW, Zhang J (2016) Feature extraction and image retrieval based on AlexNet. In: Eighth international conference on digital image processingGoogle Scholar
  45. 45.
    Fischer P, Dosovitskiy A, Ilg E, Häusser P, Hazırbaş C, Golkov V (2015) Flownet: learning optical flow with convolutional networks. In: IEEE international conference on computer visionGoogle Scholar
  46. 46.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM international conference on multimedia, pp 675–678Google Scholar
  47. 47.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105Google Scholar
  48. 48.
    Müller M, Röder T (2006) Motion templates for automatic classification and retrieval of motion capture data. In: ACM SIGGRAPH/EUROGRAPHICS symposium on computer animation SCA 2006 Vienna Austria September, pp 137–146Google Scholar
  49. 49.
    Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927CrossRefGoogle Scholar
  50. 50.
    Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2015) Deep convolutional neural networks for action recognition using depth map sequences. Comput Sci. arXiv:1501.04686v1

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Xi’an Jiaotong UniversityXi’anChina

Personalised recommendations