Advertisement

International Journal of Computer Vision

, Volume 106, Issue 3, pp 282–296 | Cite as

Detecting People Looking at Each Other in Videos

  • M. J. Marin-Jimenez
  • A. Zisserman
  • M. Eichner
  • V. Ferrari
Article

Abstract

The objective of this work is to determine if people are interacting in TV video by detecting whether they are looking at each other or not. We determine both the temporal period of the interaction and also spatially localize the relevant people. We make the following four contributions: (i) head detection with implicit coarse pose information (front, profile, back); (ii) continuous head pose estimation in unconstrained scenarios (TV video) using Gaussian process regression; (iii) propose and evaluate several methods for assessing whether and when pairs of people are looking at each other in a video shot; and (iv) introduce new ground truth annotation for this task, extending the TV human interactions dataset (Patron-Perez et al. 2010) The performance of the methods is evaluated on this dataset, which consists of 300 video clips extracted from TV shows. Despite the variety and difficulty of this video material, our best method obtains an average precision of 87.6 % in a fully automatic manner.

Keywords

Person interactions Video search  Action recognition Head pose estimation 

Notes

Acknowledgments

We are grateful to the Spanish Minister (projects TIN2012-32952 and BROCA) and for financial support from the Swiss National Science Foundation and ERC Grant VisRec No. 228180. We also thank the reviewers for their helpful comments.

References

  1. Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  2. Ba, S., & Odobez, J. M. (2005). Evaluation of multiple cue head pose estimation algorithms in natural environements. In Proceedings of the IEEE International Conference on Multimedia and Expo.Google Scholar
  3. Ba, S., & Odobez, J. M. (2009). Recognizing visual focus of attention from head pose in natural meetings. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1), 16–33.CrossRefGoogle Scholar
  4. Benfold, B., & Reid, I. (2008). Colour invariant head pose classification in low resolution video. In Proceedings of the British Machine Vision Conference.Google Scholar
  5. Blanz, V., & Vetter, T. (2003). Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1063–1074.CrossRefGoogle Scholar
  6. Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In Proceedings of the European Conference on Computer Vision.Google Scholar
  7. Cour, T., Sapp, B., Jordan, C., & Taskar, B. (2009). Learning from ambiguously labeled images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  8. Dalal, N., & Triggs, B. (2005). Histogram of Oriented Gradients for Human Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol,2, (pp. 886–893).Google Scholar
  9. Dietterich, T. G., Lathrop, R. H., & Lozano-P érez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(12), 31–71.CrossRefzbMATHGoogle Scholar
  10. Everingham, M., & Zisserman, A. (2005). Identifying individuals in video by combining generative and discriminative head models. In Proceedings of the International Conference on Computer Vision.Google Scholar
  11. Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! My name is... Buffy: Automatic naming of characters in TV video. In Proceedings of the British Machine Vision Conference.Google Scholar
  12. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRefGoogle Scholar
  13. Fathi, A., Hodgins, J., & Regh, J. (2012). Social interactions: A first-person perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  14. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRefGoogle Scholar
  15. Ferrari, V., Tuytelaars, T., & Van Gool, L. (2001). Real-time affine region tracking and coplanar grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  16. Ferrari, V., Marin, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  17. Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2009). Pose search: Retrieving people using their pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  18. Jones, M. & Viola, P. (2003). Fast multi-view face detection. Technical Report TR2003-96, MERL.Google Scholar
  19. Kim, W. H., & Kim, J. N. (2009). An adaptive shot change detection algorithm using an average of absolute difference histogram within extension sliding window. In IEEE International Symposium on Consumer Electronics.Google Scholar
  20. Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video (pp. 219–233). ECCV-International Workshop on Sign, Gesture, Activity.Google Scholar
  21. Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  22. Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  23. Marín-Jiménez, M., Zisserman, A., & Ferrari, V. (2011). Here’s looking at you kid. Detecting people looking at each other in videos. In Proceedings of the British Machine Vision Conference.Google Scholar
  24. Marín-Jiménez, M., Pérez de la Blanca, N., & Mendoza, M. (2012). Human action recognition from simple feature pooling. Pattern Analysis and Applications.Google Scholar
  25. Murphy-Chutorian, E., & Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 607–626.CrossRefGoogle Scholar
  26. Osadchy, M., Cun, Y., & Miller, M. (2007). Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8, 1197–1215.Google Scholar
  27. Park, S. & Aggarwal, J. (2004). A hierarchical bayesian network for event recognition of human actions and interactions. Association For Computing Machinery Multimedia Systems Journal.Google Scholar
  28. Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2010). High Five: Recognising human interactions in TV shows. In Proceedings of the British Machine Vision Conference.Google Scholar
  29. Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in tv shows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12), 2441–2453.CrossRefGoogle Scholar
  30. Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  31. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
  32. Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  33. Sapp, B., Toshev, A., & Taskar, B. (2010). Cascaded models for articulated pose estimation. In Proceedings of the European Conference on Computer Vision. Google Scholar
  34. Shi, J., & Tomasi, C. (1994). Good features to track (pp. 593–600). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  35. Sim, T., Baker, S., & Bsat, M. (2003). The CMU pose, illumination, and expression database. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 1615–1618.Google Scholar
  36. Sivic, J., Everingham, M., & Zisserman, A. (2005). Person spotting: Video shot retrieval for face sets. In Proceedings of the ACM International Conference on Image and Video Retrieval.Google Scholar
  37. Sivic, J., Everingham, M., & Zisserman, A. (2009). “Who are you?”: Learning person specific classifiers from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  38. Tang, S., Andriluka, M., & Schiele, B. (2012). Detection and tracking of occluded people. In Proceedings of the British Machine Vision Conference.Google Scholar
  39. Tu, Z. (2005). Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In Proceedings of the International Conference on Computer Vision.Google Scholar
  40. Waltisberg, W., Yao, A., Gall, J., Gool, LV. (2010). Variations of a Hough-voting action recognition system. In Proceedings of the International Conference on Pattern Recognition (ICPR) 2010 Contests.Google Scholar
  41. Website. (2005). INRIA person dataset. http://pascal.inrialpes.fr/data/human/.
  42. Website. (2010). Deformable parts model code. http://www.cs.brown.edu/pff/latent/.
  43. Website. (2011a). GPML Matlab code. http://www.gaussianprocess.org/gpml/code/matlab/doc/.
  44. Website. (2011b). LAEO annotations. http://www.robots.ox.ac.uk/vgg/data/laeo/.
  45. Website. (2011c). LAEO project. http://www.robots.ox.ac.uk/vgg/research/laeo/.
  46. Yang, Y., Baker, S., Kannan, A., & Ramanan, D. (2012). Recognizing proxemics in personal photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  47. Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation and landmark localization in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • M. J. Marin-Jimenez
    • 1
  • A. Zisserman
    • 2
  • M. Eichner
    • 3
  • V. Ferrari
    • 4
  1. 1.Department of Computing and Numerical Analysis, Maimonides Institute for Biomedical Research (IMIBIC)University of CordobaCordobaSpain
  2. 2.Department of Engineering ScienceUniversity of OxfordOxfordUK
  3. 3.ETH ZurichZurichSwitzerland
  4. 4.School of InformaticsUniversity of EdinburghEdinburghUK

Personalised recommendations