Abstract
Predicting the interaction between two humans, when viewed as a part of video is one of the most challenging issues in the field of computer vision, due to its various applications. This paper presents a new interaction prediction method that has a high accuracy in detecting the interactions when a small percentage of the video is viewed. At first, the interacting people are detected and then a dual-actor CNN model is utilized to recognize the type of interaction between the detected people. This model consists of two CNN networks while the parameters of which are shared. Each branch of this model extracts deep temporal or spatial features. The spatial and the temporal models are learned with Long Short Term Memory (LSTM) networks to model time information. Finally, the spatial and temporal models are combined to predict the interaction. The results show that the proposed model gives improvements on standard interaction recognition datasets including the TV Human Interaction, BIT interaction and UT Interaction.
Similar content being viewed by others
References
Afrasiabi M, Mansoorizadeh M, et al. (2019) Dtw-cnn: time series-based human interaction prediction in videos using cnn-extracted features. Vis Comput, 1–13
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv (CSUR) 43(3):16
Ahmadipour Z, Afrasiabi M, Khotanlou H (2016) Multiple human detection in images based on differential evolution and hog-lbp. In: 2016 Eighth international conference on information and knowledge technology (IKT). IEEE, pp 61–65
Berlin SJ, John M (2016) Human interaction recognition through deep learning network. In: 2016 IEEE International carnahan conference on security technology (ICCST). IEEE, pp 1–4
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional lstm. IEEE Trans Cybern 49(7):2631–2641
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. Comput Vis-ECCV 2004:25–36
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 1. IEEE, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428–441
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Dyer C, Ballesteros M, Ling W, Matthews A, Smith NA (2015) Transition-based dependency parsing with stack long short-term memory. arXiv:1505.08075
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21
Hoai M, Zisserman A (2014) Talking heads: detecting humans and recognizing their interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 875–882
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput 9(8):1735–1780
Ji X, Wang C, Ju Z (2017) A new framework of human interaction recognition based on multiple stage probability fusion. Appl Sci 7(6):567
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 675–678
Ke Q, Bennamoun M, An S, Boussaid F, Sohel F (2016) Human interaction prediction using deep temporal features. In: European conference on computer vision. Springer, pp 403–414
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) Leveraging structural context models and ranking score fusion for human interaction prediction. IEEE Transactions on Multimedia
Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: Applied imagery pattern recognition workshop, 2008. AIPR’08. 37th IEEE. IEEE, pp 1–8
Kong Y, Fu Y (2016) Max-margin action prediction machine. IEEE Trans Pattern Anal Mach Intell 38(9):1844–1858
Kong Y, Jia Y, Fu Y (2012) Learning human interaction by interactive phrases. In: European conference on computer vision. Springer, pp 300–313
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: European conference on computer vision. Springer, pp 596–611
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kumar B, Carneiro G, Reid I, et al. (2016) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5385–5394
Lan T, Chen TC, Savarese S (2014) A hierarchical representation for future action prediction. In: ECCV (3), pp 689–704
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC press
Patron-Perez A, Marszalek M, Zisserman A, Reid ID (2010) High five: recognising human interactions in tv shows. In: BMVC, vol 1. Citeseer, p 2
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Qi Y, Song Y, Zhang H, Liu J (2016) Sketch-based image retrieval via siamese convolutional neural network. In: 2016 IEEE International conference on image processing (ICIP). IEEE, pp 2460–2464
Ramanathan M, Yau WY, Teoh EK (2014) Human action recognition with video data: research and evaluation challenges. IEEE Trans Human-Mach Syst 44 (5):650–663
Ryoo MS, Aggarwal JK (2010) UT-Interaction dataset, ICPR contest on semantic description of human activities (SDHA). http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
Schmidhuber J, Hochreiter S (1997) Long short-term memory. Neur Comput 9(8):1735–1780
Shu X, Tang J, Qi GJ, Song Y, Li Z, Zhang L (2017) Concurrence-aware long short-term sub-memories for person-person action recognition. arXiv:1706.00931
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Tao R, Gavves E, Smeulders AW (2016) Siamese instance search for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1420–1429
Trong NP, Nguyen H, Kazunori K, Le Hoai B (2017) A comprehensive survey on human activity prediction. In: International conference on computational science and its applications. Springer, pp 411–425
Vahdat A, Gao B, Ranjbar M, Mori G (2011) A discriminative key pose sequence model for recognizing human interactions. In: 2011 IEEE International conference on computer vision workshops (ICCV workshops). IEEE, pp 1729–1736
Wang X, Farhadi A, Gupta A (2016) Actions˜ transformations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2658–2667
Xu C, Tao D, Xu C (2015) Multi-view intact space learning. IEEE Trans Pattern Anal Mach Intell 37(12):2531–2544
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Yu C, Zhao X, Zheng Q, Zhang P, You X (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 574–589
Zhang S, Wei Z, Nie J, Huang L, Wang S, Li Z (2017) A review on human activity recognition using vision-based method. J Healthcare Eng, 2017
Zhang J, Xie Y, Wu Q, Xia Y (2019) Medical image classification using synergic deep learning. Med Image Anal 54:10–19
Zhao Y, You X, Yu S, Xu C, Yuan W, Jing XY, Zhang T, Tao D (2018) Multi-view manifold learning with locality alignment. Pattern Recogn 78:154–166
Zhu L, Xu Z, Yang Y (2017) Bidirectional multirate reconstruction for temporal modeling in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2653–2662
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Afrasiabi, M., Khotanlou, H. & Gevers, T. Spatial-temporal dual-actor CNN for human interaction prediction in video. Multimed Tools Appl 79, 20019–20038 (2020). https://doi.org/10.1007/s11042-020-08845-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08845-2