Multi-view depth-based pairwise feature learning for person-person interaction recognition

Li, Meng; Leung, Howard

doi:10.1007/s11042-018-5738-6

Multi-view depth-based pairwise feature learning for person-person interaction recognition

Published: 14 February 2018

Volume 78, pages 5731–5749, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

536 Accesses
11 Citations
Explore all metrics

Abstract

This paper addresses the problem of recognizing person-person interaction using multi-view data captured by depth cameras. Due to the complex spatio-temporal structure of interaction between two persons, it is difficult to characterize different classes of person-person interactions for recognition. To handle this difficulty, we divide each person-person interaction into body part interactions, and analyze the person-person interaction using the pairwise features of these body part interactions. We first make use of two features for representing the relative movement and local physical contact between the body parts of two people and extract the pairwise features to characterize the corresponding body part interaction. For processing each camera view, we propose a regression-based learning approach with a sparsity inducing regularizer to model each person-person interaction as the combination of pairwise features for a sparse set of body part interactions. To take full advantage of the information in all depth camera views, we further extend the proposed interaction learning model to combine features from multi-views to order to increase the recognition performance. Our approach is evaluated on three public activity recognition datasets captured with depth cameras. Experimental results on the three datasets have demonstrated the efficacy of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

When \(\|\text {w}_{m}^{\phi _{s}}\|_{2}\)= 0, the objective function in (6) is not differentiable, we can regularize the ϕ_s-th diagonal block of D_m as \( \frac {1}{2\sqrt {\|\text {w}_{m}^{\phi _{s}}\|_{2}^{2}+\eta }}\text {I}_{\phi _{s}}\) in which η → 0.

References

Aggarwal JK, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80
Article Google Scholar
Amer MR, Todorovic S (2012) Sum-product networks for modeling activitie with stochastic structure. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1314–1321
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Google Scholar
Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425
Article Google Scholar
Choi W, Shahid K, Savarese S (2011) Learning context for collective activity recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3273–3280
Choi W, Savarese S (2012) A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part IV. LNCS, vol 7575. Springer, Heidelberg, pp 215–230
Chapter Google Scholar
Desai C, Ramanan D, Fowlkes C (2010) Discriminative models for static human-object interactions. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 9–16
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1110–1 118
Du Y, Fu Y, Wang L (2016) Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans Image Process 25(7):3010–3022
Article MathSciNet Google Scholar
Filipovych R, Ribeiro E (2008) Recognizing primitive interactions by exploring actor-object states. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–7
Gong S, Xiang T (2003) Recognition of group activities using dynamic probabilistic networks. IEEE international conference on computer vision (ICCV), pp 742–749
Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789
Article Google Scholar
Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951
Article Google Scholar
Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3D skeletal data: a review. Comput Vis Image Underst 158:85–105
Article Google Scholar
Ji Y, Ye G, Cheng H (2014) Interactive body part contrast mining for human interaction recognition. In: Multimedia and expo workshops (ICMEW), pp 1–6
Kong Y, Fu Y (2015) Close human interaction recognition using patch-aware models. IEEE Trans Image Process 25(1):167–178
Article MathSciNet Google Scholar
Kong Y, Jia Y, Fu Y (2012) Learning human interaction by interactive phrases. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part I. LNCS, vol 7572. Springer, Heidelberg, pp 300–313
Chapter Google Scholar
Kong Y, Jia Y, Fu Y (2014) Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans Pattern Anal Mach Intell 36(9):1775–1788
Article Google Scholar
Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562
Article Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8
Li M, Leung H (2016) Multiview skeletal interaction recognition using active joint interaction graph. IEEE Trans Multimed 18(11):2293–2302
Article Google Scholar
Odashima S, Shimosaka M, Kaneko T, Fukui R, Sato T (2012) Collective activity localization with contextual spatial pyramid. In: Fusiello A, Murino V, Cucchiara R (eds) ECCV 2012 Ws/Demos, Part III. LNCS, vol 7585. Springer, Heidelberg, pp 243–252
Chapter Google Scholar
Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in tv shows. IEEE Trans Pattern Anal Mach Intell 34 (12):2441–2453
Article Google Scholar
Prest A, Schmid C, Ferrari V (2012) Weakly supervised learning of interactions between humans and objects. IEEE Trans Pattern Anal Mach Intell 34(3):601–614
Article Google Scholar
Raptis M, Sigal L (2013) Poselet key-framing: a model for human activity recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2650–2657
Ryoo MS, Aggarwal JK (2006) Recognition of composite human activities through context-free grammar based representation. In: IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1709–1718
Ryoo M, Aggarwal J (2009) Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: IEEE international conference on computer vision (ICCV), pp 1593–1600
Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1010–1019
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1297–1304
Vahdat A, Gao B, Ranjbar M, Mori G (2011) A discriminative key pose sequence model for recognizing human interactions. In: IEEE international conference on computer vision workshops (ICCVW), pp 1729–1736
Vieira A, Nascimento E, Oliveira G, Liu Z, Campos M (2012) Stop: space-time occupancy patterns for 3D action recognition from depth map sequences. In: Iberoamerican congress on pattern recognition. Buenos Aires, pp 252–259
Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part II. LNCS, vol 7573. Springer, Heidelberg, pp 872–885
Chapter Google Scholar
Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927
Article Google Scholar
Xu N, Liu A, Nie W, Wong Y, Li F, Su Y (2015) Multi-modal & multi-view & interactive benchmark dataset for human action recognition. In: ACM international conference on multimedia, pp 1195–1198
Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 17–24
Yao A, Gall J, Fanelli G, Van Gool L (2011) Does human action recognition benefit from pose estimation? In: Proceedings of the 22nd British machine vision conference (BMVC), pp 6.71–67.11
Ye M, Zhang Q, Wang L, Zhu J, Yang R, Gall J (2013) A survey on human motion analysis from depth data. In: Grzegorzek M, Theobalt C, Koch R, Kolb A (eds) Time-of-flight and depth imaging. LNCS, vol 8200. Springer, Heidelberg, pp 149–187
Google Scholar
Yu TH, Kim TK, Cipolla R (2010) Real-time action recognition by spatiotemporal semantic and structural forests. In: British machine vision conference (BMVC), pp 1–7
Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 28–35
Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105
Article Google Scholar
Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: IEEE winter conference on applications of computer vision (WACV), pp 148–157
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, pp 3697–3703

Download references

Author information

Authors and Affiliations

School of Mathematics and Statistics, Hebei University of Economics and Business, Shijiazhuang, 050061, China
Meng Li
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong, China
Howard Leung

Authors

Meng Li
View author publications
You can also search for this author in PubMed Google Scholar
Howard Leung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Leung, H. Multi-view depth-based pairwise feature learning for person-person interaction recognition. Multimed Tools Appl 78, 5731–5749 (2019). https://doi.org/10.1007/s11042-018-5738-6

Download citation

Received: 27 September 2017
Revised: 21 December 2017
Accepted: 29 January 2018
Published: 14 February 2018
Issue Date: March 2019
DOI: https://doi.org/10.1007/s11042-018-5738-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-view depth-based pairwise feature learning for person-person interaction recognition

Abstract

Access this article

Similar content being viewed by others

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

A review of computer vision-based approaches for physical rehabilitation and assessment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-view depth-based pairwise feature learning for person-person interaction recognition

Abstract

Access this article

Similar content being viewed by others

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

A review of computer vision-based approaches for physical rehabilitation and assessment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation