Skip to main content

Advertisement

Log in

Exploiting stereoscopic disparity for augmenting human activity recognition performance

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This work investigates several ways to exploit scene depth information, implicitly available through the modality of stereoscopic disparity in 3D videos, with the purpose of augmenting performance in the problem of recognizing complex human activities in natural settings. The standard state-of-the-art activity recognition algorithmic pipeline consists in the consecutive stages of video description, video representation and video classification. Multimodal, depth-aware modifications to standard methods are being proposed and studied, both for video description and for video representation, that indirectly incorporate scene geometry information derived from stereo disparity. At the description level, this is made possible by suitably manipulating video interest points based on disparity data. At the representation level, the followed approach represents each video by multiple vectors corresponding to different disparity zones, resulting in multiple activity descriptions defined by disparity characteristics. In both cases, a scene segmentation is thus implicitly implemented, based on the distance of each imaged object from the camera during video acquisition. The investigated approaches are flexible and able to cooperate with any monocular low-level feature descriptor. They are evaluated using a publicly available activity recognition dataset of unconstrained stereoscopic 3D videos, consisting in extracts from Hollywood movies, and compared both against competing depth-aware approaches and a state-of-the-art monocular algorithm. Quantitative evaluation reveals that some of the examined approaches achieve state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Biswas KK, Basu SK (2011) Gesture recognition using Microsoft Kinect. IEEE, Proceedings International Conference on Automation, Robotics and Applications

  2. Chen L, Wei H, Ferryman J (2013) A survey of human motion analysis using depth imagery. Pattern Recog Lett 34:1995–2006

    Article  Google Scholar 

  3. Csurka G, Bray C, Dance C, Fan L (2004) Visual categorization with bags of keypoints. In: ECCV, Workshop on Statistical Learning in Computer Vision

  4. Farneback G (2003) Two-frame motion estimation based on polynomial expansion. Lect Notes Comput Sci 2749:363–370

    Article  MATH  Google Scholar 

  5. Hadfield S, Bowden R (2013) Hollywood 3D: Recognizing actions in 3D natural scenes. In: IEEE, Proceedings Conference on Computer Vision and Pattern Recognition

  6. Iosifidis A, Marami E, Tefas A, Pitas I (2012) Eating and drinking activity recognition based on discriminant analysis of fuzzy distances and activity volumes. In: IEEE International Conference on Acoustics, Speech and Signal Processing

  7. Iosifidis A, Tefas A, Pitas I (2012) Multi-view human action recognition under occlusion based on fuzzy distances and neural networks. European Signal Processing Conference (EUSIPCO)

  8. Iosifidis A, Tefas A, Pitas I (2012) View-invariant action recognition based on artificial neural networks. IEEE Trans Neural Netw Learn Syst 23(3):412–424

    Article  Google Scholar 

  9. Iosifidis A, Tefas A, Pitas I (2013) Dynamic action recognition based on dynemes and extreme learning machine. Pattern Recog Lett 34:1890–1898

    Article  Google Scholar 

  10. Iosifidis A, Tefas A, Pitas I (2013) Minimum class variance extreme learning machine for human action recognition. IEEE Trans Circ Syst Video Technol 23(11):1968–1979

    Article  Google Scholar 

  11. Iosifidis A, Tefas A, Pitas I (2013) Multi-view action recognition based on action volumes, fuzzy distances and cluster discriminant analysis. Sig Process 93:1445–1457

    Article  Google Scholar 

  12. Iosifidis A, Tefas A, Pitas I (2014) Regularized extreme learning machine for multi-view semi-supervised action recognition. Neurocomputing 145:250–262

    Article  Google Scholar 

  13. Iosifidis A, Tefas A, Pitas I (2014) Discriminant bag of words based representation for human action recognition. Pattern Recog Lett 49:185–192

    Article  Google Scholar 

  14. Konda K, Memisevic R (2013) Unsupervised learning of depth and motion. arXiv: 1312.3429v2

  15. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123

    Article  Google Scholar 

  16. Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE, proceedings conference on computer vision and pattern recognition

  17. Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with Independent Subspace Analysis. In: IEEE, proceedings conference on computer vision and pattern recognition

  18. Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE, proceedings conference on computer vision and pattern recognition

  19. Oreifej O, Liu Z (2013) HON4D: Histogram of oriented 4d normals for activity recognition from depth sequences. CVPR:716–723

  20. Riechert C, Zilly F, Kauff P (2011) Real time depth estimation using line recursive matching. In: Proceedings European Conference on Visual Media Production

  21. Sanchez-Riera J, Cech J, Horaud R (2012) Action recognition robust to background clutter by using stereo vision. In: European Conference on Computer Vision

  22. Sanchez-Riera J, Cech J, Horaud R (2012) Action recognition robust to background clutter by using stereo vision. In: Proceedings ECCV Workshops, vol 7583

  23. Scharstein D, Szeleiski R (2002) A taxonomy and evaluation of dense two frame stereo correspondence algorithm. IEEE Int J Comput Vis 47(1/2/3):7–42

  24. Sigalas P, Trahanias M, Baltzakis H (2009) Visual tracking of independently moving body and arms. In: Proceedings International Conference on Intelligent Robots and Systems

  25. Spagnolo P, Orazio TD, Leo M, Distante A (2006) Moving object segmentation by background subtraction and temporal analysis. Image Vis Comput 24 (5):411–423

    Article  Google Scholar 

  26. Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  27. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3):249–257

    Article  Google Scholar 

  28. Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238

    Article  Google Scholar 

Download references

Acknowledgment

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 287674 (3DTVS). This publication reflects only the author’s views. The European Union is not liable for any use that may be made of the information contained therein.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ioannis Mademlis.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mademlis, I., Iosifidis, A., Tefas, A. et al. Exploiting stereoscopic disparity for augmenting human activity recognition performance. Multimed Tools Appl 75, 11641–11660 (2016). https://doi.org/10.1007/s11042-015-2719-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2719-x

Keywords

Navigation