Abstract
The objective of this work is to determine if people are interacting in TV video by detecting whether they are looking at each other or not. We determine both the temporal period of the interaction and also spatially localize the relevant people. We make the following four contributions: (i) head detection with implicit coarse pose information (front, profile, back); (ii) continuous head pose estimation in unconstrained scenarios (TV video) using Gaussian process regression; (iii) propose and evaluate several methods for assessing whether and when pairs of people are looking at each other in a video shot; and (iv) introduce new ground truth annotation for this task, extending the TV human interactions dataset (Patron-Perez et al. 2010) The performance of the methods is evaluated on this dataset, which consists of 300 video clips extracted from TV shows. Despite the variety and difficulty of this video material, our best method obtains an average precision of 87.6 % in a fully automatic manner.
Similar content being viewed by others
References
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Ba, S., & Odobez, J. M. (2005). Evaluation of multiple cue head pose estimation algorithms in natural environements. In Proceedings of the IEEE International Conference on Multimedia and Expo.
Ba, S., & Odobez, J. M. (2009). Recognizing visual focus of attention from head pose in natural meetings. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1), 16–33.
Benfold, B., & Reid, I. (2008). Colour invariant head pose classification in low resolution video. In Proceedings of the British Machine Vision Conference.
Blanz, V., & Vetter, T. (2003). Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1063–1074.
Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In Proceedings of the European Conference on Computer Vision.
Cour, T., Sapp, B., Jordan, C., & Taskar, B. (2009). Learning from ambiguously labeled images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Dalal, N., & Triggs, B. (2005). Histogram of Oriented Gradients for Human Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol,2, (pp. 886–893).
Dietterich, T. G., Lathrop, R. H., & Lozano-P érez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(12), 31–71.
Everingham, M., & Zisserman, A. (2005). Identifying individuals in video by combining generative and discriminative head models. In Proceedings of the International Conference on Computer Vision.
Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! My name is... Buffy: Automatic naming of characters in TV video. In Proceedings of the British Machine Vision Conference.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Fathi, A., Hodgins, J., & Regh, J. (2012). Social interactions: A first-person perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Ferrari, V., Tuytelaars, T., & Van Gool, L. (2001). Real-time affine region tracking and coplanar grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Ferrari, V., Marin, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2009). Pose search: Retrieving people using their pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Jones, M. & Viola, P. (2003). Fast multi-view face detection. Technical Report TR2003-96, MERL.
Kim, W. H., & Kim, J. N. (2009). An adaptive shot change detection algorithm using an average of absolute difference histogram within extension sliding window. In IEEE International Symposium on Consumer Electronics.
Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video (pp. 219–233). ECCV-International Workshop on Sign, Gesture, Activity.
Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Marín-Jiménez, M., Zisserman, A., & Ferrari, V. (2011). Here’s looking at you kid. Detecting people looking at each other in videos. In Proceedings of the British Machine Vision Conference.
Marín-Jiménez, M., Pérez de la Blanca, N., & Mendoza, M. (2012). Human action recognition from simple feature pooling. Pattern Analysis and Applications.
Murphy-Chutorian, E., & Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 607–626.
Osadchy, M., Cun, Y., & Miller, M. (2007). Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8, 1197–1215.
Park, S. & Aggarwal, J. (2004). A hierarchical bayesian network for event recognition of human actions and interactions. Association For Computing Machinery Multimedia Systems Journal.
Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2010). High Five: Recognising human interactions in TV shows. In Proceedings of the British Machine Vision Conference.
Patron-Perez, A., Marszalek, M., Reid, I., & Zisserman, A. (2012). Structured learning of human interactions in tv shows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12), 2441–2453.
Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press.
Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Sapp, B., Toshev, A., & Taskar, B. (2010). Cascaded models for articulated pose estimation. In Proceedings of the European Conference on Computer Vision.
Shi, J., & Tomasi, C. (1994). Good features to track (pp. 593–600). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Sim, T., Baker, S., & Bsat, M. (2003). The CMU pose, illumination, and expression database. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 1615–1618.
Sivic, J., Everingham, M., & Zisserman, A. (2005). Person spotting: Video shot retrieval for face sets. In Proceedings of the ACM International Conference on Image and Video Retrieval.
Sivic, J., Everingham, M., & Zisserman, A. (2009). “Who are you?”: Learning person specific classifiers from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Tang, S., Andriluka, M., & Schiele, B. (2012). Detection and tracking of occluded people. In Proceedings of the British Machine Vision Conference.
Tu, Z. (2005). Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In Proceedings of the International Conference on Computer Vision.
Waltisberg, W., Yao, A., Gall, J., Gool, LV. (2010). Variations of a Hough-voting action recognition system. In Proceedings of the International Conference on Pattern Recognition (ICPR) 2010 Contests.
Website. (2005). INRIA person dataset. http://pascal.inrialpes.fr/data/human/.
Website. (2010). Deformable parts model code. http://www.cs.brown.edu/pff/latent/.
Website. (2011a). GPML Matlab code. http://www.gaussianprocess.org/gpml/code/matlab/doc/.
Website. (2011b). LAEO annotations. http://www.robots.ox.ac.uk/vgg/data/laeo/.
Website. (2011c). LAEO project. http://www.robots.ox.ac.uk/vgg/research/laeo/.
Yang, Y., Baker, S., Kannan, A., & Ramanan, D. (2012). Recognizing proxemics in personal photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation and landmark localization in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Acknowledgments
We are grateful to the Spanish Minister (projects TIN2012-32952 and BROCA) and for financial support from the Swiss National Science Foundation and ERC Grant VisRec No. 228180. We also thank the reviewers for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Appendix: Released Materials
Appendix: Released Materials
We have released a variety of output from the research that led to this paper:
Rights and permissions
About this article
Cite this article
Marin-Jimenez, M.J., Zisserman, A., Eichner, M. et al. Detecting People Looking at Each Other in Videos. Int J Comput Vis 106, 282–296 (2014). https://doi.org/10.1007/s11263-013-0655-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-013-0655-7