We Are Family: Joint Pose Estimation of Multiple Persons

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6311)


We present a novel multi-person pose estimation framework, which extends pictorial structures (PS) to explicitly model interactions between people and to estimate their poses jointly. Interactions are modeled as occlusions between people. First, we propose an occlusion probability predictor, based on the location of persons automatically detected in the image, and incorporate the predictions as occlusion priors into our multi-person PS model. Moreover, our model includes an inter-people exclusion penalty, preventing body parts from different people from occupying the same image region. Thanks to these elements, our model has a global view of the scene, resulting in better pose estimates in group photos, where several persons stand nearby and occlude each other. In a comprehensive evaluation on a new, challenging group photo datasets we demonstrate the benefits of our multi-person model over a state-of-the-art single-person pose estimator which treats each person independently.


Body Part Group Photo Joint Model Appearance Model Detection Window 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Supplementary material

978-3-642-15549-9_17_MOESM1_ESM.pdf (12.6 mb)
Electronic Supplementary Material (12,887 KB)


  1. 1.
    Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. In: BMVC (2009)Google Scholar
  2. 2.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Pose search: retrieving people using their pose. In: CVPR (2009)Google Scholar
  3. 3.
    Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: CVPR (2009)Google Scholar
  4. 4.
  5. 5.
    Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV 61 (2005)Google Scholar
  6. 6.
    Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS (2006)Google Scholar
  7. 7.
    Sigal, L., Black, M.: Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In: CVPR, vol. 2, pp. 2041–2048 (2006)Google Scholar
  8. 8.
    Lan, X., Huttenlocher, D.P.: A unified spatio-temporal articulated model for tracking. In: CVPR, vol. 1, pp. 722–729 (2004)Google Scholar
  9. 9.
    Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. PAMI 28, 44–58 (2006)Google Scholar
  10. 10.
    Wang, Y., Mori, G.: Multiple tree models for occlusion and spatial constraints in human pose estimation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 710–724. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  11. 11.
    Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet part detectors. IJCV 75, 247–266 (2007)CrossRefGoogle Scholar
  12. 12.
    Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: CVPR (2008)Google Scholar
  13. 13.
    Ess, A., Leibe, B., Schindler, K., Gool, L.V.: Robust multi-person tracking from a mobile platform. PAMI 31(10), 1831–1846 (2009)Google Scholar
  14. 14.
    Pellegrini, S., Ess, A., Schindler, K., van Gool, L.: You’ll never walk alone: Modeling social behavior for multi-target tracking. In: ICCV (2009)Google Scholar
  15. 15.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: ICML (2001)Google Scholar
  16. 16.
    Buehler, P., Everinghan, M., Huttenlocher, D., Zisserman, A.: Long term arm and hand tracking for continuous sign language tv broadcasts. In: BMVC (2008)Google Scholar
  17. 17.
    Lan, X., Huttenlocher, D.: Beyond trees: Common-factor models for 2D human pose recovery. In: ICCV, vol. 1 (2005)Google Scholar
  18. 18.
    Gammeter, S., Ess, A., Jaeggli, T., Schindler, K., Van Gool, L.: Articulated multi-body tracking under egomotion. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 816–830. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  19. 19.
    Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. PAMI 28, 1568–1583 (2006)Google Scholar
  20. 20.
    Sutherland, I., Hodgman, G.: Re-entrant polygon clipping. Communications of the ACM (1974)Google Scholar
  21. 21.
    Buehler, P., Everingham, M., Huttenlocher, D.P., Zisserman, A.: Long term arm and hand tracking for continuous sign language TV broadcasts. In: BMVC (2008)Google Scholar
  22. 22.
    Jiang, H.: Human pose estimation using consistent max-covering. In: ICCV (2009)Google Scholar
  23. 23.
    Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. JMLR 6, 1453–1484 (2005)MathSciNetGoogle Scholar
  24. 24.
    Froba, B., Ernst, A.: Face detection with the modified census transform. In: IEEE International Conference on Automatic Face and Gesture Recognition (2004)Google Scholar
  25. 25.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI (2009) (in press)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.Computer Vision LaboratoryETH ZurichSwitzerland

Personalised recommendations