Skip to main content
Log in

Detecting People Looking at Each Other in Videos

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The objective of this work is to determine if people are interacting in TV video by detecting whether they are looking at each other or not. We determine both the temporal period of the interaction and also spatially localize the relevant people. We make the following four contributions: (i) head detection with implicit coarse pose information (front, profile, back); (ii) continuous head pose estimation in unconstrained scenarios (TV video) using Gaussian process regression; (iii) propose and evaluate several methods for assessing whether and when pairs of people are looking at each other in a video shot; and (iv) introduce new ground truth annotation for this task, extending the TV human interactions dataset (Patron-Perez et al. 2010) The performance of the methods is evaluated on this dataset, which consists of 300 video clips extracted from TV shows. Despite the variety and difficulty of this video material, our best method obtains an average precision of 87.6 % in a fully automatic manner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Ba, S., & Odobez, J. M. (2005). Evaluation of multiple cue head pose estimation algorithms in natural environements. In Proceedings of the IEEE International Conference on Multimedia and Expo.

  • Ba, S., & Odobez, J. M. (2009). Recognizing visual focus of attention from head pose in natural meetings. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1), 16–33.

    Article  Google Scholar 

  • Benfold, B., & Reid, I. (2008). Colour invariant head pose classification in low resolution video. In Proceedings of the British Machine Vision Conference.

  • Blanz, V., & Vetter, T. (2003). Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1063–1074.

    Article  Google Scholar 

  • Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In Proceedings of the European Conference on Computer Vision.

  • Cour, T., Sapp, B., Jordan, C., & Taskar, B. (2009). Learning from ambiguously labeled images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Dalal, N., & Triggs, B. (2005). Histogram of Oriented Gradients for Human Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol,2, (pp. 886–893).

  • Dietterich, T. G., Lathrop, R. H., & Lozano-P érez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(12), 31–71.

    Article  MATH  Google Scholar 

  • Everingham, M., & Zisserman, A. (2005). Identifying individuals in video by combining generative and discriminative head models. In Proceedings of the International Conference on Computer Vision.

  • Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! My name is... Buffy: Automatic naming of characters in TV video. In Proceedings of the British Machine Vision Conference.

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fathi, A., Hodgins, J., & Regh, J. (2012). Social interactions: A first-person perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Ferrari, V., Tuytelaars, T., & Van Gool, L. (2001). Real-time affine region tracking and coplanar grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Ferrari, V., Marin, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2009). Pose search: Retrieving people using their pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Jones, M. & Viola, P. (2003). Fast multi-view face detection. Technical Report TR2003-96, MERL.

  • Kim, W. H., & Kim, J. N. (2009). An adaptive shot change detection algorithm using an average of absolute difference histogram within extension sliding window. In IEEE International Symposium on Consumer Electronics.

  • Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video (pp. 219–233). ECCV-International Workshop on Sign, Gesture, Activity.

  • Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Marín-Jiménez, M., Zisserman, A., & Ferrari, V. (2011). Here’s looking at you kid. Detecting people looking at each other in videos. In Proceedings of the British Machine Vision Conference.

  • Marín-Jiménez, M., Pérez de la Blanca, N., & Mendoza, M. (2012). Human action recognition from simple feature pooling. Pattern Analysis and Applications.

  • Murphy-Chutorian, E., & Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 607–626.

    Article  Google Scholar 

  • Osadchy, M., Cun, Y., & Miller, M. (2007). Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8, 1197–1215.

    Google Scholar 

  • Park, S. & Aggarwal, J. (2004). A hierarchical bayesian network for event recognition of human actions and interactions. Association For Computing Machinery Multimedia Systems Journal.

  • Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2010). High Five: Recognising human interactions in TV shows. In Proceedings of the British Machine Vision Conference.

  • Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in tv shows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12), 2441–2453.

    Article  Google Scholar 

  • Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  • Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Sapp, B., Toshev, A., & Taskar, B. (2010). Cascaded models for articulated pose estimation. In Proceedings of the European Conference on Computer Vision.

  • Shi, J., & Tomasi, C. (1994). Good features to track (pp. 593–600). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Sim, T., Baker, S., & Bsat, M. (2003). The CMU pose, illumination, and expression database. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 1615–1618.

    Google Scholar 

  • Sivic, J., Everingham, M., & Zisserman, A. (2005). Person spotting: Video shot retrieval for face sets. In Proceedings of the ACM International Conference on Image and Video Retrieval.

  • Sivic, J., Everingham, M., & Zisserman, A. (2009). “Who are you?”: Learning person specific classifiers from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Tang, S., Andriluka, M., & Schiele, B. (2012). Detection and tracking of occluded people. In Proceedings of the British Machine Vision Conference.

  • Tu, Z. (2005). Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In Proceedings of the International Conference on Computer Vision.

  • Waltisberg, W., Yao, A., Gall, J., Gool, LV. (2010). Variations of a Hough-voting action recognition system. In Proceedings of the International Conference on Pattern Recognition (ICPR) 2010 Contests.

  • Website. (2005). INRIA person dataset. http://pascal.inrialpes.fr/data/human/.

  • Website. (2010). Deformable parts model code. http://www.cs.brown.edu/pff/latent/.

  • Website. (2011a). GPML Matlab code. http://www.gaussianprocess.org/gpml/code/matlab/doc/.

  • Website. (2011b). LAEO annotations. http://www.robots.ox.ac.uk/vgg/data/laeo/.

  • Website. (2011c). LAEO project. http://www.robots.ox.ac.uk/vgg/research/laeo/.

  • Yang, Y., Baker, S., Kannan, A., & Ramanan, D. (2012). Recognizing proxemics in personal photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation and landmark localization in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Download references

Acknowledgments

We are grateful to the Spanish Minister (projects TIN2012-32952 and BROCA) and for financial support from the Swiss National Science Foundation and ERC Grant VisRec No. 228180. We also thank the reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. J. Marin-Jimenez.

Appendix: Released Materials

Appendix: Released Materials

We have released a variety of output from the research that led to this paper:

  1. (i)

    the video shot decomposition Website (2011b) of the TVHID videos;

  2. (ii)

    the LAEO annotations Website (2011b) on TVHID used in our experiments; and,

  3. (iii)

    the head detector Website (2011c) trained to deal with different viewpoints.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marin-Jimenez, M.J., Zisserman, A., Eichner, M. et al. Detecting People Looking at Each Other in Videos. Int J Comput Vis 106, 282–296 (2014). https://doi.org/10.1007/s11263-013-0655-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0655-7

Keywords

Navigation