Recognizing Human Activities in Videos Using Improved Dense Trajectories over LSTM

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 841)


We propose a deep learning based technique to classify actions based on Long Short Term Memory (LSTM) networks. The proposed scheme first learns spatial temporal features from the video, using an extension of the Convolutional Neural Networks (CNN) to 3D. A Recurrent Neural Network (RNN) is then trained to classify each sequence considering the temporal evolution of the learned features for each time step. Experimental results on the CMU MoCap, UCF 101, Hollywood 2 dataset show the efficacy of the proposed approach. We extend the proposed framework with an efficient motion feature, to enable handling significant camera motion. The proposed approach outperforms the existing deep models for each dataset.


Dense trajectories LSTM CNN RNN Human activities 



The authors wish to acknowledge the generous financial support provided by the Science and Engineering Research Board (SERB) of the Department of Science and Technology (DST), the Government of India, for conducting this research work. The financial support was provided through the project numbered ECR/2016/000652.


  1. 1.
    Ziaeefar, M., Bergevin, R.: Semantic human activity recognition: a literature review. Pattern Recognit. 48(8), 2329–2345 (2015)CrossRefGoogle Scholar
  2. 2.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)Google Scholar
  3. 3.
    Chen, M.Y., Hauptmann, A.: MoSIFT: recognizing human actions in surveillance videos. Technical report CMU-CS-09-161. Carnegie Mellon University (2009)Google Scholar
  4. 4.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)Google Scholar
  5. 5.
    Singh, S., Arora, C., Jawahar, C.V.: First person action recognition using deep learned descriptors. In: CVPR 2016 (2016)Google Scholar
  6. 6.
    Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing human action at a distance in video by key poses. IEEE Trans. Circ. Syst. Video Technol. 21(9), 1228–1241 (2011)CrossRefGoogle Scholar
  7. 7.
    Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing interactions between human performers by ‘Dominating Pose Doublet’. Mach. Vis. Appl. 25(4), 1033–1052 (2014)CrossRefGoogle Scholar
  8. 8.
    Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Mukherjee, S.: Human action recognition using dominant pose duplet. In: Nalpantidis, L., Krüger, V., Eklundh, J.-O., Gasteratos, A. (eds.) ICVS 2015. LNCS, vol. 9163, pp. 488–497. Springer, Cham (2015). Scholar
  10. 10.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)Google Scholar
  11. 11.
    Buddubariki, V., Tulluri, S.G., Mukherjee, S.: Event recognition in egocentric videos using a novel trajectory based feature. In: ICVGIP, pp. 76:1–76:8. ACM (2016)Google Scholar
  12. 12.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild. In: CRCV-TR-12-01, November 2012Google Scholar
  13. 13.
    Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010). Scholar
  14. 14.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE T-PAMI 35(1), 221–231 (2013)CrossRefGoogle Scholar
  15. 15.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  16. 16.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  17. 17.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep convolutional descriptors. In: CVPR (2015)Google Scholar
  18. 18.
    Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  19. 19.
    CMU MoCap dataset. Accessed Dec 2016
  20. 20.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)Google Scholar
  21. 21.
    Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). Scholar
  22. 22.
    Mukherjee, S., Singh, K.K.: Human action and event recognition using a novel descriptor based on improved dense trajectories. Multimed. Tools Appl. (2017).
  23. 23.
    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Indian Institute of Information Technology SriCitySri CityIndia

Personalised recommendations