Skip to main content
Log in

Multi-view depth-based pairwise feature learning for person-person interaction recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper addresses the problem of recognizing person-person interaction using multi-view data captured by depth cameras. Due to the complex spatio-temporal structure of interaction between two persons, it is difficult to characterize different classes of person-person interactions for recognition. To handle this difficulty, we divide each person-person interaction into body part interactions, and analyze the person-person interaction using the pairwise features of these body part interactions. We first make use of two features for representing the relative movement and local physical contact between the body parts of two people and extract the pairwise features to characterize the corresponding body part interaction. For processing each camera view, we propose a regression-based learning approach with a sparsity inducing regularizer to model each person-person interaction as the combination of pairwise features for a sparse set of body part interactions. To take full advantage of the information in all depth camera views, we further extend the proposed interaction learning model to combine features from multi-views to order to increase the recognition performance. Our approach is evaluated on three public activity recognition datasets captured with depth cameras. Experimental results on the three datasets have demonstrated the efficacy of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. When \(\|\text {w}_{m}^{\phi _{s}}\|_{2}\)= 0, the objective function in (6) is not differentiable, we can regularize the ϕs-th diagonal block of Dm as \( \frac {1}{2\sqrt {\|\text {w}_{m}^{\phi _{s}}\|_{2}^{2}+\eta }}\text {I}_{\phi _{s}}\) in which η → 0.

References

  1. Aggarwal JK, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80

    Article  Google Scholar 

  2. Amer MR, Todorovic S (2012) Sum-product networks for modeling activitie with stochastic structure. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1314–1321

  3. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  4. Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425

    Article  Google Scholar 

  5. Choi W, Shahid K, Savarese S (2011) Learning context for collective activity recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3273–3280

  6. Choi W, Savarese S (2012) A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part IV. LNCS, vol 7575. Springer, Heidelberg, pp 215–230

    Chapter  Google Scholar 

  7. Desai C, Ramanan D, Fowlkes C (2010) Discriminative models for static human-object interactions. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 9–16

  8. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1110–1 118

  9. Du Y, Fu Y, Wang L (2016) Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans Image Process 25(7):3010–3022

    Article  MathSciNet  Google Scholar 

  10. Filipovych R, Ribeiro E (2008) Recognizing primitive interactions by exploring actor-object states. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–7

  11. Gong S, Xiang T (2003) Recognition of group activities using dynamic probabilistic networks. IEEE international conference on computer vision (ICCV), pp 742–749

  12. Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789

    Article  Google Scholar 

  13. Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951

    Article  Google Scholar 

  14. Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3D skeletal data: a review. Comput Vis Image Underst 158:85–105

    Article  Google Scholar 

  15. Ji Y, Ye G, Cheng H (2014) Interactive body part contrast mining for human interaction recognition. In: Multimedia and expo workshops (ICMEW), pp 1–6

  16. Kong Y, Fu Y (2015) Close human interaction recognition using patch-aware models. IEEE Trans Image Process 25(1):167–178

    Article  MathSciNet  Google Scholar 

  17. Kong Y, Jia Y, Fu Y (2012) Learning human interaction by interactive phrases. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part I. LNCS, vol 7572. Springer, Heidelberg, pp 300–313

    Chapter  Google Scholar 

  18. Kong Y, Jia Y, Fu Y (2014) Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans Pattern Anal Mach Intell 36(9):1775–1788

    Article  Google Scholar 

  19. Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562

    Article  Google Scholar 

  20. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8

  21. Li M, Leung H (2016) Multiview skeletal interaction recognition using active joint interaction graph. IEEE Trans Multimed 18(11):2293–2302

    Article  Google Scholar 

  22. Odashima S, Shimosaka M, Kaneko T, Fukui R, Sato T (2012) Collective activity localization with contextual spatial pyramid. In: Fusiello A, Murino V, Cucchiara R (eds) ECCV 2012 Ws/Demos, Part III. LNCS, vol 7585. Springer, Heidelberg, pp 243–252

    Chapter  Google Scholar 

  23. Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in tv shows. IEEE Trans Pattern Anal Mach Intell 34 (12):2441–2453

    Article  Google Scholar 

  24. Prest A, Schmid C, Ferrari V (2012) Weakly supervised learning of interactions between humans and objects. IEEE Trans Pattern Anal Mach Intell 34(3):601–614

    Article  Google Scholar 

  25. Raptis M, Sigal L (2013) Poselet key-framing: a model for human activity recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2650–2657

  26. Ryoo MS, Aggarwal JK (2006) Recognition of composite human activities through context-free grammar based representation. In: IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1709–1718

  27. Ryoo M, Aggarwal J (2009) Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: IEEE international conference on computer vision (ICCV), pp 1593–1600

  28. Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1010–1019

  29. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1297–1304

  30. Vahdat A, Gao B, Ranjbar M, Mori G (2011) A discriminative key pose sequence model for recognizing human interactions. In: IEEE international conference on computer vision workshops (ICCVW), pp 1729–1736

  31. Vieira A, Nascimento E, Oliveira G, Liu Z, Campos M (2012) Stop: space-time occupancy patterns for 3D action recognition from depth map sequences. In: Iberoamerican congress on pattern recognition. Buenos Aires, pp 252–259

  32. Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part II. LNCS, vol 7573. Springer, Heidelberg, pp 872–885

    Chapter  Google Scholar 

  33. Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927

    Article  Google Scholar 

  34. Xu N, Liu A, Nie W, Wong Y, Li F, Su Y (2015) Multi-modal & multi-view & interactive benchmark dataset for human action recognition. In: ACM international conference on multimedia, pp 1195–1198

  35. Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 17–24

  36. Yao A, Gall J, Fanelli G, Van Gool L (2011) Does human action recognition benefit from pose estimation? In: Proceedings of the 22nd British machine vision conference (BMVC), pp 6.71–67.11

  37. Ye M, Zhang Q, Wang L, Zhu J, Yang R, Gall J (2013) A survey on human motion analysis from depth data. In: Grzegorzek M, Theobalt C, Koch R, Kolb A (eds) Time-of-flight and depth imaging. LNCS, vol 8200. Springer, Heidelberg, pp 149–187

    Google Scholar 

  38. Yu TH, Kim TK, Cipolla R (2010) Real-time action recognition by spatiotemporal semantic and structural forests. In: British machine vision conference (BMVC), pp 1–7

  39. Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 28–35

  40. Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105

    Article  Google Scholar 

  41. Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: IEEE winter conference on applications of computer vision (WACV), pp 148–157

  42. Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, pp 3697–3703

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meng Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Leung, H. Multi-view depth-based pairwise feature learning for person-person interaction recognition. Multimed Tools Appl 78, 5731–5749 (2019). https://doi.org/10.1007/s11042-018-5738-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-5738-6

Keywords

Navigation