A New Hybrid Architecture for Human Activity Recognition from RGB-D Videos

  • Srijan DasEmail author
  • Monique Thonnat
  • Kaustubh Sakhalkar
  • Michal Koperski
  • Francois Bremond
  • Gianpiero Francesca
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)


Activity Recognition from RGB-D videos is still an open problem due to the presence of large varieties of actions. In this work, we propose a new architecture by mixing a high level handcrafted strategy and machine learning techniques. We propose a novel two level fusion strategy to combine features from different cues to address the problem of large variety of actions. As similar actions are common in daily living activities, we also propose a mechanism for similar action discrimination. We validate our approach on four public datasets, CAD-60, CAD-120, MSRDailyActivity3D, and NTU-RGB+D improving the state-of-the-art results on them.


Activity recognition RGB-D videos Data fusion 


  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). Software available from
  2. 2.
    Baradel, F., Wolf, C., Mille, J.: Human action recognition: pose-based attention draws focus to hands. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 604–613, October 2017Google Scholar
  3. 3.
    Baradel, F., Wolf, C., Mille, J., Taylor, G.W.: Glimpse clouds: human activity recognition from unstructured feature points. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  4. 4.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)Google Scholar
  5. 5.
    Cheron, G., Laptev, I., Schmid, C.: P-CNN: pose-based CNN features for action recognition. In: ICCV (2015)Google Scholar
  6. 6.
    Chollet, F., et al.: Keras (2015).
  7. 7.
    Das, S., Koperski, M., Bremond, F., Francesca, G.: A fusion of appearance based CNNs and temporal evolution of skeleton with LSTM for daily living action recognition. ArXiv e-prints, February 2018Google Scholar
  8. 8.
    Das, S., Koperski, M., Bremond, F., Francesca, G.: Action recognition based on a mixture of RGB and depth based skeleton. In: AVSS (2017)Google Scholar
  9. 9.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  10. 10.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  11. 11.
    Koperski, M.: Human action recognition in videos with local representation. Ph.D. thesis, University COTE D’AZUR (2017)Google Scholar
  12. 12.
    Koperski, M., Bremond, F.: Modeling spatial layout of features for real world scenario RGB-D action recognition. In: AVSS (2016)Google Scholar
  13. 13.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  14. 14.
    van der Maaten, L., Hinton, G.E.: Visualizing data using t-SNE (2008).
  15. 15.
    Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  17. 17.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  18. 18.
    Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: ICRA (2012)Google Scholar
  19. 19.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision & Pattern Recognition, Colorado Springs, United States, pp. 3169–3176, June 2011Google Scholar
  20. 20.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, Australia, Sydney (2013)Google Scholar
  21. 21.
    Wu, Y.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR (2012)Google Scholar
  22. 22.
    Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  23. 23.
    Zhang, S., Liu, X., Xiao, J.: On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 148–157, March 2017Google Scholar
  24. 24.
    Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2923–2932. IEEE (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Srijan Das
    • 1
    Email author
  • Monique Thonnat
    • 1
  • Kaustubh Sakhalkar
    • 1
  • Michal Koperski
    • 1
  • Francois Bremond
    • 1
  • Gianpiero Francesca
    • 2
  1. 1.Inria, Sophia AntipolisValbonneFrance
  2. 2.Toyota Motor EuropeZaventemBelgium

Personalised recommendations