Computational Visual Media

, Volume 2, Issue 1, pp 97–106 | Cite as

Saliency guided local and global descriptors for effective action recognition

Open Access
Research Article

Abstract

This paper presents a novel framework for human action recognition based on salient object detection and a new combination of local and global descriptors. We first detect salient objects in video frames and only extract features for such objects. We then use a simple strategy to identify and process only those video frames that contain salient objects. Processing salient objects instead of all frames not only makes the algorithm more efficient, but more importantly also suppresses the interference of background pixels. We combine this approach with a new combination of local and global descriptors, namely 3D-SIFT and histograms of oriented optical flow (HOOF), respectively. The resulting saliency guided 3D-SIFT–HOOF (SGSH) feature is used along with a multi-class support vector machine (SVM) classifier for human action recognition. Experiments conducted on the standard KTH and UCF-Sports action benchmarks show that our new method outperforms the competing state-of-the-art spatiotemporal feature-based human action recognition methods.

Keywords

action recognition saliency detection local and global descriptors bag of visual words (BoVWs) classification 

References

  1. [1]
    Kläser, A.; Marszalek, M.; Schmid, C. A spatiotemporal descriptor based on 3D-gradients. In: Proceedings of British Machine Vision Conference, 995–1004, 2008.Google Scholar
  2. [2]
    Scovanner, P.; Ali, S.; Shah, M. A 3-dimensional SIFT descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, 357–360, 2007.Google Scholar
  3. [3]
    Willems, G.; Tuytelaars, T.; Van Gool, L. An efficient dense and scale-invariant spatiotemporal interest point detector. In: Lecture Notes in Computer Science, Vol. 5303. Forsyth, D.; Torr, P.; Zisserman, A. Eds. Springer Berlin Heidelberg, 650–663, 2008.CrossRefGoogle Scholar
  4. [4]
    Yuan, C.; Li, X.; Hu, W.; Ling, H.; Maybank, S. 3D R transform on spatiotemporal interest points for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 724–730, 2013.Google Scholar
  5. [5]
    Zhang, H.; Zhou, W.; Reardon, C.; Parker, L. Simplex-based 3D spatio-temporal feature description for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2067–2074, 2014.Google Scholar
  6. [6]
    Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 1, 221–231, 2013.CrossRefGoogle Scholar
  7. [7]
    Taylor, G. W.; Fergus, R.; LeCun, Y.; Bregler, C. Convolutional learning of spatiotemporal features. In: Proceedings of the 11th European Conference on Computer Vision: Part VI, 140–153, 2010.Google Scholar
  8. [8]
    Sun, X.; Chen, M.; Hauptmann, A. Action recognition via local descriptors and holistic features. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern RecognitionWorkshops, 58–65, 2009.Google Scholar
  9. [9]
    Niebles, J. C.; Li, F.-F. A hierarchical model of shape and appearance for human action classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1–8, 2007.Google Scholar
  10. [10]
    Chaudhry, R.; Ravichandran, A.; Hager, G.; Vidal, R. Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1932–1939, 2009.Google Scholar
  11. [11]
    Dalal, N.; Triggs, B.; Schmid, C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European Conference on Computer Vision, Vol. 2, 428–441, 2006.Google Scholar
  12. [12]
    Laptev, I. On space–time interest points. International Journal of Computer Vision Vol. 64, Nos. 2–3, 107–123, 2005.CrossRefGoogle Scholar
  13. [13]
    Laptev, I.; Lindeberg, T. Space–time interest points. In: Proceedings of the 9th IEEE International Conference on Computer Vision, 432–439, 2003.CrossRefGoogle Scholar
  14. [14]
    Bregonzio, M.; Gong, S.; Xiang, T. Recognising action as clouds of space–time interest points. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1948–1955, 2009.Google Scholar
  15. [15]
    Dollar, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 65–72, 2005.Google Scholar
  16. [16]
    Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.CrossRefGoogle Scholar
  17. [17]
    Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, 886–893, 2005.Google Scholar
  18. [18]
    Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1–8, 2008.Google Scholar
  19. [19]
    Ma, S.; Sigal, L.; Sclaroff, S. Space–time tree ensemble for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 5024–5032, 2015.Google Scholar
  20. [20]
    Qu, T.; Liu, Y.; Li, J.; Wu, M. Action recognition using multi-layer topographic independent component analysis. Journal of Information & Computational Science Vol. 12, No. 9, 3537–3546, 2015.CrossRefGoogle Scholar
  21. [21]
    Wang, H.; Klaser, A.; Schmid, C.; Liu, C.-L. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision Vol. 103, No. 1, 60–79, 2013.MathSciNetCrossRefGoogle Scholar
  22. [22]
    Wu, J.; Zhang, Y.; Lin, W. Towards good practices for action video encoding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2577–2584, 2014.Google Scholar
  23. [23]
    Oikonomopoulos, A.; Patras, I.; Pantic, M. Spatiotemporal salient points for visual recognition of human actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics Vol. 36, No. 3, 710–719, 2005.CrossRefGoogle Scholar
  24. [24]
    Margolin, R.; Tal, A.; Zelnik-Manor, L. What makes a patch distinct? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1139–1146, 2013.Google Scholar
  25. [25]
    Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J. High accuracy optical flow estimation based on a theory for warping. In: Proceedings of the 8th European Conference on Computer Vision, Springer LNCS 3024. Pajdla, T.; Matas, J. Eds. Springer-Verlag Berlin Heidelberg, Vol. 4, 25–36, 2004.MATHGoogle Scholar
  26. [26]
    Chang, C.-C.; Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology Vol. 2, No. 3, Article No. 27, 2011.CrossRefGoogle Scholar
  27. [27]
    Al Ghamdi, M.; Zhang, L.; Gotoh, Y. Spatiotemporal SIFT and its application to human action classification. In: Lecture Notes in Computer Science, Vol. 7583. Fusiello, A.; Murino, V.; Cucchiara, R. Eds. Springer Berlin Heidelberg, 301–310, 2012.CrossRefGoogle Scholar
  28. [28]
    Liu, J.; Kuipers, B.; Savarese, S. Recognizing human actions by attributes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3337–3344, 2011.Google Scholar
  29. [29]
    Iosifidis, A.; Tefas, A.; Pitas, I. Discriminant bag of words based representation for human action recognition. Pattern Recognition Letters Vol. 49, 185–192, 2014.CrossRefGoogle Scholar
  30. [30]
    Baumann, F.; Ehlers, A.; Rosenhahn, B.; Liao, J. Recognizing human actions using novel space–time volume binary patterns. Neurocomputing Vol. 173, No. P1, 54–63, 2016.CrossRefGoogle Scholar
  31. [31]
    Kläser, A. Learning human actions in video. Ph.D. Thesis. Université de Grenoble, 2010.Google Scholar
  32. [32]
    Ji, Y.; Shimada, A.; Nagahara, H.; Taniguchi, R.-i. A compact descriptor CHOG3D and its application in human action recognition. IEEJ Transactions on Electrical and Electronic Engineering Vol. 8, No. 1, 69–77, 2013.CrossRefGoogle Scholar
  33. [33]
    Wang, H.; Klaser, A.; Schmid, C.; Liu, C.-L. Action recognition by dense trajectories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3169–3176, 2011.Google Scholar
  34. [34]
    Wu, X.; Xu, D.; Duan, L.; Luo, J. Action recognition using context and appearance distribution features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 489–496, 2011.Google Scholar
  35. [35]
    Raptis, M.; Soatto, S. Tracklet descriptors for action modeling and video analysis. In: Proceedings of the 11th European Conference on Computer vision: Part I, 577–590, 2010.Google Scholar
  36. [36]
    Raptis, M.; Kokkinos, I.; Soatto, S. Discovering discriminative action parts from mid-level video representations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1242–1249, 2012.Google Scholar
  37. [37]
    Ma, S.; Zhang, J.; Ikizler-Cinbis, N.; Sclaroff, S. Action recognition and localization by hierarchical space–time segments. In: Proceedings of IEEE International Conference on Computer Vision, 2744–2751, 2013.Google Scholar
  38. [38]
    Everts, I.; van Gemert, J. C.; Gevers, T. Evaluation of color STIPs for human action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2850–2857, 2013.Google Scholar
  39. [39]
    Le, Q. V.; Zou, W. Y.; Yeung, S. Y.; Ng, A. Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3361–3368, 2011.Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.School of Computer Science and InformaticsCardiff UniversityCardiffUK
  2. 2.Department of Computer Science, School of ScienceKerbala UniversityKerbalaIraq

Personalised recommendations