Spatial-temporal dual-actor CNN for human interaction prediction in video

Afrasiabi, Mahlagha; Khotanlou, Hassan; Gevers, Theo

doi:10.1007/s11042-020-08845-2

Spatial-temporal dual-actor CNN for human interaction prediction in video

Published: 08 April 2020

Volume 79, pages 20019–20038, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Mahlagha Afrasiabi¹,
Hassan Khotanlou¹ &
Theo Gevers²

309 Accesses
4 Citations
Explore all metrics

Abstract

Predicting the interaction between two humans, when viewed as a part of video is one of the most challenging issues in the field of computer vision, due to its various applications. This paper presents a new interaction prediction method that has a high accuracy in detecting the interactions when a small percentage of the video is viewed. At first, the interacting people are detected and then a dual-actor CNN model is utilized to recognize the type of interaction between the detected people. This model consists of two CNN networks while the parameters of which are shared. Each branch of this model extracts deep temporal or spatial features. The spatial and the temporal models are learned with Long Short Term Memory (LSTM) networks to model time information. Finally, the spatial and temporal models are combined to predict the interaction. The results show that the proposed model gives improvements on standard interaction recognition datasets including the TV Human Interaction, BIT interaction and UT Interaction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

References

Afrasiabi M, Mansoorizadeh M, et al. (2019) Dtw-cnn: time series-based human interaction prediction in videos using cnn-extracted features. Vis Comput, 1–13
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv (CSUR) 43(3):16
Article Google Scholar
Ahmadipour Z, Afrasiabi M, Khotanlou H (2016) Multiple human detection in images based on differential evolution and hog-lbp. In: 2016 Eighth international conference on information and knowledge technology (IKT). IEEE, pp 61–65
Berlin SJ, John M (2016) Human interaction recognition through deep learning network. In: 2016 IEEE International carnahan conference on security technology (ICCST). IEEE, pp 1–4
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional lstm. IEEE Trans Cybern 49(7):2631–2641
Article Google Scholar
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. Comput Vis-ECCV 2004:25–36
MATH Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 1. IEEE, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428–441
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Dyer C, Ballesteros M, Ling W, Matthews A, Smith NA (2015) Transition-based dependency parsing with stack long short-term memory. arXiv:1505.08075
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21
Article Google Scholar
Hoai M, Zisserman A (2014) Talking heads: detecting humans and recognizing their interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 875–882
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput 9(8):1735–1780
Article Google Scholar
Ji X, Wang C, Ju Z (2017) A new framework of human interaction recognition based on multiple stage probability fusion. Appl Sci 7(6):567
Article Google Scholar
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 675–678
Ke Q, Bennamoun M, An S, Boussaid F, Sohel F (2016) Human interaction prediction using deep temporal features. In: European conference on computer vision. Springer, pp 403–414
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) Leveraging structural context models and ranking score fusion for human interaction prediction. IEEE Transactions on Multimedia
Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: Applied imagery pattern recognition workshop, 2008. AIPR’08. 37th IEEE. IEEE, pp 1–8
Kong Y, Fu Y (2016) Max-margin action prediction machine. IEEE Trans Pattern Anal Mach Intell 38(9):1844–1858
Article Google Scholar
Kong Y, Jia Y, Fu Y (2012) Learning human interaction by interactive phrases. In: European conference on computer vision. Springer, pp 300–313
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: European conference on computer vision. Springer, pp 596–611
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kumar B, Carneiro G, Reid I, et al. (2016) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5385–5394
Lan T, Chen TC, Savarese S (2014) A hierarchical representation for future action prediction. In: ECCV (3), pp 689–704
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC press
Patron-Perez A, Marszalek M, Zisserman A, Reid ID (2010) High five: recognising human interactions in tv shows. In: BMVC, vol 1. Citeseer, p 2
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Article Google Scholar
Qi Y, Song Y, Zhang H, Liu J (2016) Sketch-based image retrieval via siamese convolutional neural network. In: 2016 IEEE International conference on image processing (ICIP). IEEE, pp 2460–2464
Ramanathan M, Yau WY, Teoh EK (2014) Human action recognition with video data: research and evaluation challenges. IEEE Trans Human-Mach Syst 44 (5):650–663
Article Google Scholar
Ryoo MS, Aggarwal JK (2010) UT-Interaction dataset, ICPR contest on semantic description of human activities (SDHA). http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
Schmidhuber J, Hochreiter S (1997) Long short-term memory. Neur Comput 9(8):1735–1780
Article Google Scholar
Shu X, Tang J, Qi GJ, Song Y, Li Z, Zhang L (2017) Concurrence-aware long short-term sub-memories for person-person action recognition. arXiv:1706.00931
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Tao R, Gavves E, Smeulders AW (2016) Siamese instance search for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1420–1429
Trong NP, Nguyen H, Kazunori K, Le Hoai B (2017) A comprehensive survey on human activity prediction. In: International conference on computational science and its applications. Springer, pp 411–425
Vahdat A, Gao B, Ranjbar M, Mori G (2011) A discriminative key pose sequence model for recognizing human interactions. In: 2011 IEEE International conference on computer vision workshops (ICCV workshops). IEEE, pp 1729–1736
Wang X, Farhadi A, Gupta A (2016) Actions˜ transformations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2658–2667
Xu C, Tao D, Xu C (2015) Multi-view intact space learning. IEEE Trans Pattern Anal Mach Intell 37(12):2531–2544
Article Google Scholar
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
Article MathSciNet Google Scholar
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Yu C, Zhao X, Zheng Q, Zhang P, You X (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 574–589
Zhang S, Wei Z, Nie J, Huang L, Wang S, Li Z (2017) A review on human activity recognition using vision-based method. J Healthcare Eng, 2017
Zhang J, Xie Y, Wu Q, Xia Y (2019) Medical image classification using synergic deep learning. Med Image Anal 54:10–19
Article Google Scholar
Zhao Y, You X, Yu S, Xu C, Yuan W, Jing XY, Zhang T, Tao D (2018) Multi-view manifold learning with locality alignment. Pattern Recogn 78:154–166
Article Google Scholar
Zhu L, Xu Z, Yang Y (2017) Bidirectional multirate reconstruction for temporal modeling in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2653–2662

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Bu-Ali Sina University, Hamedan, Iran
Mahlagha Afrasiabi & Hassan Khotanlou
Computer Vision Lab, University of Amsterdam, Amsterdam, The Netherlands
Theo Gevers

Authors

Mahlagha Afrasiabi
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Khotanlou
View author publications
You can also search for this author in PubMed Google Scholar
Theo Gevers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hassan Khotanlou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Afrasiabi, M., Khotanlou, H. & Gevers, T. Spatial-temporal dual-actor CNN for human interaction prediction in video. Multimed Tools Appl 79, 20019–20038 (2020). https://doi.org/10.1007/s11042-020-08845-2

Download citation

Received: 05 April 2019
Revised: 20 January 2020
Accepted: 13 March 2020
Published: 08 April 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11042-020-08845-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial-temporal dual-actor CNN for human interaction prediction in video

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatial-temporal dual-actor CNN for human interaction prediction in video

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation