Abstract
We introduce a novel spatio-temporal deformable part model for offline detection of fine-grained interactions in video. One novelty of the model is that part detectors model the interacting individuals in a single graph that can contain different combinations of feature descriptors. This allows us to use both body pose and movement to model the coordination between two people in space and time. We evaluate the performance of our approach on novel and existing interaction datasets. When testing only on the target class, we achieve mean average precision scores of 0.82. When presented with distractor classes, the additional modelling of the motion of specific body parts significantly reduces the number of confusions. Cross-dataset tests demonstrate that our trained models generalize well to other settings.
This publication was supported by the Dutch national program COMMIT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
ShakeFive2 is publicly available from https://goo.gl/ObHv36.
References
Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting people using mutually consistent poselet activations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 168–181. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15567-3_13
Choi, W., Savarese, S.: Understanding collective activities of people from videos. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 36(6), 1242–1257 (2014)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 32(9), 1627–1645 (2010)
Girshick, R., Iandola, F., Darrell, T., Malik, J.: Deformable part models are convolutional neural networks. In: Proceedings Conference on Computer Vision and Pattern Recognition (CVPR), pp. 437–446 (2015)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings IEEE International Conference on Computer Vision (ICCV), pp. 3192–3199 (2013)
Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. Sect. A 34(5), 827–828 (1978)
Kong, Y., Fu, Y.: Close human interaction recognition using patch-aware models. IEEE Trans. Image Process. (TIP) 25(1), 167–178 (2015)
Kong, Y., Jia, Y., Fu, Y.: Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 36(9), 1775–1788 (2014)
Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 34(8), 1549–1562 (2012)
MarÃn-Jiménez, M.J., Yeguas, E., Pérez de la Blanca, N.: Exploring STIP-based models for recognizing human interactions in TV videos. Pattern Recognit. Lett. 34(15), 1819–1828 (2013)
Ni, B., Moulin, P., Yang, X., Yan, S.: Motion part regularization: improving action recognition via trajectory selection. In: Proceedings Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3698–3706 (2015)
Ozerov, A., Vigouroux, J., Chevallier, L., Pérez, P.: On evaluating face tracks in movies. In: Proceedings International Conference on Image Processing (ICIP), pp. 3003–3007 (2013)
Patron-Perez, A., Marszałek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 34(12), 2441–2453 (2012)
Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)
Raptis, M., Sigal, L.: Poselet key-framing: a model for human activity recognition. In: Proceedings Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2650–2657 (2013)
Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings IEEE International Conference on Computer Vision (ICCV), pp. 1036–1043 (2011)
Ryoo, M.S., Aggarwal, J.K.: UT-Interaction Dataset, ICPR contest on semantic description of human activities (SDHA) (2010). http://cvrc.ece.utexas.edu/SDHA2010
Ryoo, M.S., Aggarwal, J.K.: Stochastic representation and recognition of high-level group activities. Int. J. Comput. Vis. (IJCV) 93(2), 183–200 (2011)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings International Conference on Pattern Recognition (ICPR), pp. 32–36 (2004)
Sefidgar, Y.S., Vahdat, A., Se, S., Mori, G.: Discriminative key-component models for interaction detection and recognition. Comput. Vis. Image Underst. (CVIU) 135, 16–30 (2015)
Sener, F., İkizler-Cinbis, N.: Two-person interaction recognition via spatial multiple instance embedding. J. Vis. Commun. Image Represent. 32(C), 63–73 (2015)
Supancic III, J.S., Ramanan, D.: Self-paced learning for long-term tracking. In: Proceedings Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2379–2386 (2013)
Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: Proceedings Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2649 (2013)
van Gemeren, C., Tan, R.T., Poppe, R., Veltkamp, R.C.: Dyadic interaction detection from pose and flow. In: Proceedings Human Behavior Understanding Workshop (ECCV-HBU), pp. 101–115 (2014)
van Gemert, J.C., Jain, M., Gati, E., Snoek, C.G.M.: APT: action localization proposals from dense trajectories. In: Proceedings British Machine Vision Conference (BMVC), p. A117 (2015)
Wang, H., Kläser, A., Schmid, C., Cheng-Lin, L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. (IJCV) 103(1), 60–79 (2013)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: Proceedings IEEE International Conference on Computer Vision (ICCV), pp. 1385–1392 (2013)
Yang, Y., Baker, S., Kannan, A., Ramanan, D.: Recognizing proxemics in personal photos. In: Proceedings Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3522–3529 (2012)
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 35(12), 2878–2890 (2013)
Yao, B., Nie, B., Liu, Z., Zhu, S.-C.: Animated pose templates for modelling and detecting human actions. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 36(3), 436–452 (2014)
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: Proceedings Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 28–35 (2012)
Zhang, Y., Liu, X., Chang, M.-C., Ge, W., Chen, T.: Spatio-temporal phrases for activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 707–721. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33712-3_51
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
van Gemeren, C., Poppe, R., Veltkamp, R.C. (2016). Spatio-Temporal Detection of Fine-Grained Dyadic Human Interactions. In: Chetouani, M., Cohn, J., Salah, A. (eds) Human Behavior Understanding. HBU 2016. Lecture Notes in Computer Science(), vol 9997. Springer, Cham. https://doi.org/10.1007/978-3-319-46843-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-46843-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46842-6
Online ISBN: 978-3-319-46843-3
eBook Packages: Computer ScienceComputer Science (R0)