International Journal of Computer Vision

, Volume 126, Issue 2–4, pp 410–429 | Cite as

Joint Estimation of Human Pose and Conversational Groups from Social Scenes

  • Jagannadan VaradarajanEmail author
  • Ramanathan Subramanian
  • Samuel Rota Bulò
  • Narendra Ahuja
  • Oswald Lanz
  • Elisa Ricci


Despite many attempts in the last few years, automatic analysis of social scenes captured by wide-angle camera networks remains a very challenging task due to the low resolution of targets, background clutter and frequent and persistent occlusions. In this paper, we present a novel framework for jointly estimating (i) head, body orientations of targets and (ii) conversational groups called F-formations from social scenes. In contrast to prior works that have (a) exploited the limited range of head and body orientations to jointly learn both, or (b) employed the mutual head (but not body) pose of interactors for deducing F-formations, we propose a weakly-supervised learning algorithm for joint inference. Our algorithm employs body pose as the primary cue for F-formation estimation, and an alternating optimization strategy is proposed to iteratively refine F-formation and pose estimates. We demonstrate the increased efficacy of joint inference over the state-of-the-art via extensive experiments on three social datasets.


Head and body pose estimation F-formation estimation Semi-supervised learning Convex optimization Conversational groups Video surveillance 

Supplementary material

11263_2017_1026_MOESM1_ESM.pdf (1.7 mb)
Supplementary material 1 (pdf 1710 KB)


  1. Alameda-Pineda, X., Staiano, J., Subramanian, R., Batrinca, L., Ricci, E., Lepri, B., et al. (2016). Salsa: A novel dataset for multimodal group behavior analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8), 1707–1720.CrossRefGoogle Scholar
  2. Alameda-Pineda, X., Yan, Y., Ricci, E., Lanz, O., & Sebe, N. (2015). Analyzing free-standing conversational groups: A multimodal approach. In ACM multimedia.Google Scholar
  3. Alletto, S., Serra, G., Calderara, S., Solera, F., & Cucchiara, R. (2014). From ego to nos-vision: Detecting social relationships in first-person views. In Workshop on egocentric vision.Google Scholar
  4. Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Computer vision and pattern recognition, pp. 1014–1021.Google Scholar
  5. Ba, S., & Odobez, J. M. (2008). Multi-party focus of attention recognition in meetings from head pose and multimodal contextual cues. In IEEE international conference on acoustics, speech, and signal processing (ICASSP).Google Scholar
  6. Ba, S. O., & Odobez, J. M. (2006). A study on visual focus of attention recognition from head pose in a meeting room. In Machine learning for multimodal interaction. Springer, Berlin, Heidelberg, pp. 75–87.Google Scholar
  7. Bazzani, L., Tosato, D., Cristani, M., Farenzena, M., Pagetti, G., Menegaz, G., et al. (2013). Social interactions by visual focus of attention in a three-dimensional environment. Expert Systems, 30, 115–127.CrossRefGoogle Scholar
  8. Benfold, B., & Reid, I. (2011). Unsupervised learning of a scene-specific coarse gaze estimator. In International conference on computer vision.Google Scholar
  9. Butko, T., Canton-Ferrer, C., Segura, C., Giró, X., Nadeu, C., Hernando, J., et al. (2011). Acoustic event detection based on feature-level fusion of audio and video modalities. Eurasip Journal on Advances in Signal Processing, 2011, 485738. doi: 10.1155/2011/485738.CrossRefGoogle Scholar
  10. Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., et al. (2006) The ami meeting corpus: A pre-announcement. In International conference on machine learning for multimodal interaction, pp. 28–39.Google Scholar
  11. Chamveha, I., Sugano, Y., Sugimura, D., Siriteerakul, T., Okabe, T., Sato, Y., et al. (2013). Head direction estimation from low resolution images with scene adaptation. Computer Vision and Image Understanding, 117(10), 1502–1511.CrossRefGoogle Scholar
  12. Chen, C., Heili, A., & Odobez, J. M. (2011). A joint estimation of head and body orientation cues in surveillance video. In IEEE ICCV-SISM, international workshop on socially intelligent surveillance and monitoring.Google Scholar
  13. Chen, C., & Odobez, J. M. (2012). We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video. In Computer vision and pattern recognition.Google Scholar
  14. Chi, E. C., & Lange, K. (2015). Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4), 994–1013.MathSciNetCrossRefGoogle Scholar
  15. Choi, W., Chao, Y. W., Pantofaru, C., & Savarese, S. (2014). Discovering groups of people in images. In European conference on computer vision.Google Scholar
  16. Ciolek, T., & Kendon, A. (1980). Environment and the spatial arrangement of conversational encounters. Socialogical Inquiry, 50, 237–271.CrossRefGoogle Scholar
  17. Cristani, M., Bazzani, L., Paggetti, G., Fossati, A., Tosato, D., Del Bue, A., et al. (2011) Social interaction discovery by statistical analysis of F-formations. In British machine vision conference.Google Scholar
  18. Demirkus, M., Precup, D., Clark, J. J., & Arbel, T. (2014). Probabilistic temporal head pose estimation using a hierarchical graphical model. In European conference on computer vision.Google Scholar
  19. Eichner, M., & Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In European conference on computer vision.Google Scholar
  20. Gan, T., Wong, Y., Zhang, D., & Kankanhalli, M. (2013). Temporal encoded F-formation system for social interaction detection. In ACM Multimedia.Google Scholar
  21. Heili, A., Varadarajan, J., Ghanem, B., Ahuja, N., & Odobez, J. M. (2014). Improving head and body pose estimation through semi-supervised manifold alignment. In International conference on image processing.Google Scholar
  22. Hocking, T. D., Joulin, A., Bach, F., & Vert, J. P. (2011). Clusterpath an algorithm for clustering using convex fusion penalties. In International conference on machine learning.Google Scholar
  23. Hu, T., Messelodi, S., & Lanz, O. (2015). Dynamic task decomposition for decentralized object tracking in complex scenes. Computer Vision and Image Understanding, 134, 89–104.CrossRefGoogle Scholar
  24. Krahnstoever, N., Chang, M. C., & Ge, W. (2011). Gaze and body pose estimation from a distance. In IEEE advanced video and signal-based surveillance (AVSS).Google Scholar
  25. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.Google Scholar
  26. Leal-Taixé, L., Fenzi, M., Kuznetsova, A., Rosenhahn, B., & Savarese, S. (2014). Learning an image-based motion context for multiple people tracking. In Computer vision and pattern recognition.Google Scholar
  27. Liem, M. C., & Gavrila, D. M. (2014). Coupled person orientation estimation and appearance modeling using spherical harmonics. Image and Vision Computing, 32(10), 728–738.CrossRefGoogle Scholar
  28. Marin-Jimenez, M., Zisserman, A., Eichner, M., & Ferrari, V. (2014). Detecting people looking at each other in videos. International Journal of Computer Vision, 106(3), 282–296.CrossRefGoogle Scholar
  29. Mathias, M., Benenson, R., Timofte, R., & Gool, L. V. (2013). Handling occlusions with franken-classifiers. In International conference on computer vision.Google Scholar
  30. Meyer, G. P., Gupta, S., Frosio, I., Reddy, D., & Kautz, J. (2015). Robust model-based 3d head pose estimation. In International conference on computer vision.Google Scholar
  31. Murphy-Chutorian, E., & Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 607–626.CrossRefGoogle Scholar
  32. Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in tv shows. IEEE Transactions Pattern Analysis and Machine Intelligence, 34(12), 2441–2453.CrossRefGoogle Scholar
  33. Pellegrini, S., Ess, A., & Van Gool, L. (2010). Improving data association by joint modeling of pedestrian trajectories and groupings. In European conference on computer vision.Google Scholar
  34. Rajagopal, A. K., Subramanian, R., Ricci, E., Vieriu, R. L., Lanz, O., & Sebe, N. (2014). Exploring transfer learning approaches for head pose classification from multi-view surveillance images. International Journal of Computer Vision, 109(1–2), 146–167.CrossRefGoogle Scholar
  35. Ricci, E., Varadarajan, J., Subramanian, R., Rota Bulo, S., Ahuja, N., & Lanz, O. (2015). Uncovering interactions and interactors: Joint estimation of head, body orientation and f-formations from surveillance videos. In International conference on computer vision (ICCV).Google Scholar
  36. Robertson, N., & Reid, I. (2006). Estimating gaze direction from low-resolution faces in video. In European conference on computer vision.Google Scholar
  37. Setti, F., Hung, H., & Cristani, M. (2013). Group detection in still images by F-formation modeling: A comparative study. In International workshop on image analysis for multimedia interactive services (WIAMIS).Google Scholar
  38. Setti, F., Lanz, O., Ferrario, R., Murino, V., & Cristani, M. (2013). Multi-scale F-formation discovery for group detection. In International conference on image processing.Google Scholar
  39. Setti, F., Russell, C., Bassetti, C., & Cristani, M. (2015). F-formation detection: Individuating free-standing conversational groups in images. PLoS ONE, 10(5), e0123,783.CrossRefGoogle Scholar
  40. Smith, K., Ba, S. O., Odobez, J. M., & Gatica-Perez, D. (2008). Tracking the visual focus of attention for a varying number of wandering people. IEEE Transaction of Pattern Analysis and Machine Intelligence, 30(7), 1212–1229.CrossRefGoogle Scholar
  41. Tang, S., Andriluka, M., & Schiele, B. (2014). Detection and tracking of occluded people. International Journal of Computer Vision, 110, 58–69.CrossRefGoogle Scholar
  42. Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27, pp. 1799–1807). Red Hook: Curran Associates.Google Scholar
  43. Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Computer vision and pattern recognition.Google Scholar
  44. Tran, K. N., Bedagkar-Gala, A., Kakadiaris, I. A., & Shah, S. K. (2013). Social cues in group formation and local interactions for collective activity analysis. In International joint conference on computer vision, imaging and computer graphics theory and applications (VISAPP).Google Scholar
  45. Vascon, S., Mequanint, E. Z., Cristani, M., Hung, H., Pelillo, M., & Murino, V. (2014). A game theoretic probabilistic approach for detecting conversational groups. In Asian conference on computer vision.Google Scholar
  46. Vascon, S., Mequanint, E. Z., Cristani, M., Hung, H., Pelillo, M., & Murino, V. (2016). Detecting conversational groups in images and sequences: A robust game-theoretic approach. Computer Vision and Image Understanding, 143, 11–24.CrossRefGoogle Scholar
  47. Voit, M., & Stiefelhagen, R. (2009). A system for probabilistic joint 3d head tracking and pose estimation in low-resolution, multi-view environments. In International conference on computer vision systems, pp. 415–424Google Scholar
  48. Wojek, C., Walk, S., Roth, S., & Schiele, B. (2011). Monocular 3d scene understanding with explicit occlusion reasoning. In Computer vision and pattern recognition.Google Scholar
  49. Yan, S., Wang, H., Fu, Y., Yan, J., Tang, X., & Huang, T. (2009). Synchronized submanifold embedding for person-independent pose estimation and beyond. IEEE Transaction of the Image Processing, 18(1), 202–210.MathSciNetCrossRefzbMATHGoogle Scholar
  50. Yan, Y., Ricci, E., Subramanian, R., Lanz, O., & Sebe, N. (2013). No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In International conference on computer vision.Google Scholar
  51. Yan, Y., Ricci, E., Subramanian, R., Liu, G., Lanz, O., & Sebe, N. (2016). A multi-task learning framework for head pose estimation under target motion. IEEE Transaction of the Pattern Analysis and Machine Intelligence, 38(6), 1070–1083.CrossRefGoogle Scholar
  52. Zen, G., Lepri, B., Ricci, E., & Lanz, O. (2010). Space speaks: Towards socially and personality aware visual surveillance. In ACM multimedia workshop on multimodal pervasive video analysis.Google Scholar
  53. Zhu, X. (2005). Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison.Google Scholar
  54. Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1–130.CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Advanced Digital Sciences CenterSingaporeSingapore
  2. 2.International Institute of Information TechnologyHyderabadIndia
  3. 3.University of GlasgowGlasgowUK
  4. 4.Mapillary ResearchGrazAustria
  5. 5.Fondazione Bruno KesslerTrentoItaly
  6. 6.University of Illinois Urbana ChampaignChampaignUSA
  7. 7.Department of EngineeringUniversity of PerugiaPerugiaItaly

Personalised recommendations