Skip to main content
Log in

Spatial-temporal dual-actor CNN for human interaction prediction in video

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Predicting the interaction between two humans, when viewed as a part of video is one of the most challenging issues in the field of computer vision, due to its various applications. This paper presents a new interaction prediction method that has a high accuracy in detecting the interactions when a small percentage of the video is viewed. At first, the interacting people are detected and then a dual-actor CNN model is utilized to recognize the type of interaction between the detected people. This model consists of two CNN networks while the parameters of which are shared. Each branch of this model extracts deep temporal or spatial features. The spatial and the temporal models are learned with Long Short Term Memory (LSTM) networks to model time information. Finally, the spatial and temporal models are combined to predict the interaction. The results show that the proposed model gives improvements on standard interaction recognition datasets including the TV Human Interaction, BIT interaction and UT Interaction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Afrasiabi M, Mansoorizadeh M, et al. (2019) Dtw-cnn: time series-based human interaction prediction in videos using cnn-extracted features. Vis Comput, 1–13

  2. Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv (CSUR) 43(3):16

    Article  Google Scholar 

  3. Ahmadipour Z, Afrasiabi M, Khotanlou H (2016) Multiple human detection in images based on differential evolution and hog-lbp. In: 2016 Eighth international conference on information and knowledge technology (IKT). IEEE, pp 61–65

  4. Berlin SJ, John M (2016) Human interaction recognition through deep learning network. In: 2016 IEEE International carnahan conference on security technology (ICCST). IEEE, pp 1–4

  5. Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional lstm. IEEE Trans Cybern 49(7):2631–2641

    Article  Google Scholar 

  6. Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. Comput Vis-ECCV 2004:25–36

    MATH  Google Scholar 

  7. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 1. IEEE, pp 886–893

  8. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428–441

  9. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  10. Dyer C, Ballesteros M, Ling W, Matthews A, Smith NA (2015) Transition-based dependency parsing with stack long short-term memory. arXiv:1505.08075

  11. Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21

    Article  Google Scholar 

  12. Hoai M, Zisserman A (2014) Talking heads: detecting humans and recognizing their interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 875–882

  13. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput 9(8):1735–1780

    Article  Google Scholar 

  14. Ji X, Wang C, Ju Z (2017) A new framework of human interaction recognition based on multiple stage probability fusion. Appl Sci 7(6):567

    Article  Google Scholar 

  15. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 675–678

  16. Ke Q, Bennamoun M, An S, Boussaid F, Sohel F (2016) Human interaction prediction using deep temporal features. In: European conference on computer vision. Springer, pp 403–414

  17. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) Leveraging structural context models and ranking score fusion for human interaction prediction. IEEE Transactions on Multimedia

  18. Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: Applied imagery pattern recognition workshop, 2008. AIPR’08. 37th IEEE. IEEE, pp 1–8

  19. Kong Y, Fu Y (2016) Max-margin action prediction machine. IEEE Trans Pattern Anal Mach Intell 38(9):1844–1858

    Article  Google Scholar 

  20. Kong Y, Jia Y, Fu Y (2012) Learning human interaction by interactive phrases. In: European conference on computer vision. Springer, pp 300–313

  21. Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: European conference on computer vision. Springer, pp 596–611

  22. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  23. Kumar B, Carneiro G, Reid I, et al. (2016) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5385–5394

  24. Lan T, Chen TC, Savarese S (2014) A hierarchical representation for future action prediction. In: ECCV (3), pp 689–704

  25. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8

  26. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  27. Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC press

  28. Patron-Perez A, Marszalek M, Zisserman A, Reid ID (2010) High five: recognising human interactions in tv shows. In: BMVC, vol 1. Citeseer, p 2

  29. Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990

    Article  Google Scholar 

  30. Qi Y, Song Y, Zhang H, Liu J (2016) Sketch-based image retrieval via siamese convolutional neural network. In: 2016 IEEE International conference on image processing (ICIP). IEEE, pp 2460–2464

  31. Ramanathan M, Yau WY, Teoh EK (2014) Human action recognition with video data: research and evaluation challenges. IEEE Trans Human-Mach Syst 44 (5):650–663

    Article  Google Scholar 

  32. Ryoo MS, Aggarwal JK (2010) UT-Interaction dataset, ICPR contest on semantic description of human activities (SDHA). http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html

  33. Schmidhuber J, Hochreiter S (1997) Long short-term memory. Neur Comput 9(8):1735–1780

    Article  Google Scholar 

  34. Shu X, Tang J, Qi GJ, Song Y, Li Z, Zhang L (2017) Concurrence-aware long short-term sub-memories for person-person action recognition. arXiv:1706.00931

  35. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  36. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  37. Tao R, Gavves E, Smeulders AW (2016) Siamese instance search for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1420–1429

  38. Trong NP, Nguyen H, Kazunori K, Le Hoai B (2017) A comprehensive survey on human activity prediction. In: International conference on computational science and its applications. Springer, pp 411–425

  39. Vahdat A, Gao B, Ranjbar M, Mori G (2011) A discriminative key pose sequence model for recognizing human interactions. In: 2011 IEEE International conference on computer vision workshops (ICCV workshops). IEEE, pp 1729–1736

  40. Wang X, Farhadi A, Gupta A (2016) Actions˜ transformations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2658–2667

  41. Xu C, Tao D, Xu C (2015) Multi-view intact space learning. IEEE Trans Pattern Anal Mach Intell 37(12):2531–2544

    Article  Google Scholar 

  42. Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611

    Article  MathSciNet  Google Scholar 

  43. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515

  44. Yu C, Zhao X, Zheng Q, Zhang P, You X (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 574–589

  45. Zhang S, Wei Z, Nie J, Huang L, Wang S, Li Z (2017) A review on human activity recognition using vision-based method. J Healthcare Eng, 2017

  46. Zhang J, Xie Y, Wu Q, Xia Y (2019) Medical image classification using synergic deep learning. Med Image Anal 54:10–19

    Article  Google Scholar 

  47. Zhao Y, You X, Yu S, Xu C, Yuan W, Jing XY, Zhang T, Tao D (2018) Multi-view manifold learning with locality alignment. Pattern Recogn 78:154–166

    Article  Google Scholar 

  48. Zhu L, Xu Z, Yang Y (2017) Bidirectional multirate reconstruction for temporal modeling in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2653–2662

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hassan Khotanlou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Afrasiabi, M., Khotanlou, H. & Gevers, T. Spatial-temporal dual-actor CNN for human interaction prediction in video. Multimed Tools Appl 79, 20019–20038 (2020). https://doi.org/10.1007/s11042-020-08845-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08845-2

Keywords

Navigation