Abstract
This work investigates several ways to exploit scene depth information, implicitly available through the modality of stereoscopic disparity in 3D videos, with the purpose of augmenting performance in the problem of recognizing complex human activities in natural settings. The standard state-of-the-art activity recognition algorithmic pipeline consists in the consecutive stages of video description, video representation and video classification. Multimodal, depth-aware modifications to standard methods are being proposed and studied, both for video description and for video representation, that indirectly incorporate scene geometry information derived from stereo disparity. At the description level, this is made possible by suitably manipulating video interest points based on disparity data. At the representation level, the followed approach represents each video by multiple vectors corresponding to different disparity zones, resulting in multiple activity descriptions defined by disparity characteristics. In both cases, a scene segmentation is thus implicitly implemented, based on the distance of each imaged object from the camera during video acquisition. The investigated approaches are flexible and able to cooperate with any monocular low-level feature descriptor. They are evaluated using a publicly available activity recognition dataset of unconstrained stereoscopic 3D videos, consisting in extracts from Hollywood movies, and compared both against competing depth-aware approaches and a state-of-the-art monocular algorithm. Quantitative evaluation reveals that some of the examined approaches achieve state-of-the-art performance.
Similar content being viewed by others
References
Biswas KK, Basu SK (2011) Gesture recognition using Microsoft Kinect. IEEE, Proceedings International Conference on Automation, Robotics and Applications
Chen L, Wei H, Ferryman J (2013) A survey of human motion analysis using depth imagery. Pattern Recog Lett 34:1995–2006
Csurka G, Bray C, Dance C, Fan L (2004) Visual categorization with bags of keypoints. In: ECCV, Workshop on Statistical Learning in Computer Vision
Farneback G (2003) Two-frame motion estimation based on polynomial expansion. Lect Notes Comput Sci 2749:363–370
Hadfield S, Bowden R (2013) Hollywood 3D: Recognizing actions in 3D natural scenes. In: IEEE, Proceedings Conference on Computer Vision and Pattern Recognition
Iosifidis A, Marami E, Tefas A, Pitas I (2012) Eating and drinking activity recognition based on discriminant analysis of fuzzy distances and activity volumes. In: IEEE International Conference on Acoustics, Speech and Signal Processing
Iosifidis A, Tefas A, Pitas I (2012) Multi-view human action recognition under occlusion based on fuzzy distances and neural networks. European Signal Processing Conference (EUSIPCO)
Iosifidis A, Tefas A, Pitas I (2012) View-invariant action recognition based on artificial neural networks. IEEE Trans Neural Netw Learn Syst 23(3):412–424
Iosifidis A, Tefas A, Pitas I (2013) Dynamic action recognition based on dynemes and extreme learning machine. Pattern Recog Lett 34:1890–1898
Iosifidis A, Tefas A, Pitas I (2013) Minimum class variance extreme learning machine for human action recognition. IEEE Trans Circ Syst Video Technol 23(11):1968–1979
Iosifidis A, Tefas A, Pitas I (2013) Multi-view action recognition based on action volumes, fuzzy distances and cluster discriminant analysis. Sig Process 93:1445–1457
Iosifidis A, Tefas A, Pitas I (2014) Regularized extreme learning machine for multi-view semi-supervised action recognition. Neurocomputing 145:250–262
Iosifidis A, Tefas A, Pitas I (2014) Discriminant bag of words based representation for human action recognition. Pattern Recog Lett 49:185–192
Konda K, Memisevic R (2013) Unsupervised learning of depth and motion. arXiv: 1312.3429v2
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE, proceedings conference on computer vision and pattern recognition
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with Independent Subspace Analysis. In: IEEE, proceedings conference on computer vision and pattern recognition
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE, proceedings conference on computer vision and pattern recognition
Oreifej O, Liu Z (2013) HON4D: Histogram of oriented 4d normals for activity recognition from depth sequences. CVPR:716–723
Riechert C, Zilly F, Kauff P (2011) Real time depth estimation using line recursive matching. In: Proceedings European Conference on Visual Media Production
Sanchez-Riera J, Cech J, Horaud R (2012) Action recognition robust to background clutter by using stereo vision. In: European Conference on Computer Vision
Sanchez-Riera J, Cech J, Horaud R (2012) Action recognition robust to background clutter by using stereo vision. In: Proceedings ECCV Workshops, vol 7583
Scharstein D, Szeleiski R (2002) A taxonomy and evaluation of dense two frame stereo correspondence algorithm. IEEE Int J Comput Vis 47(1/2/3):7–42
Sigalas P, Trahanias M, Baltzakis H (2009) Visual tracking of independently moving body and arms. In: Proceedings International Conference on Intelligent Robots and Systems
Spagnolo P, Orazio TD, Leo M, Distante A (2006) Moving object segmentation by background subtraction and temporal analysis. Image Vis Comput 24 (5):411–423
Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3):249–257
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Acknowledgment
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 287674 (3DTVS). This publication reflects only the author’s views. The European Union is not liable for any use that may be made of the information contained therein.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mademlis, I., Iosifidis, A., Tefas, A. et al. Exploiting stereoscopic disparity for augmenting human activity recognition performance. Multimed Tools Appl 75, 11641–11660 (2016). https://doi.org/10.1007/s11042-015-2719-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2719-x