Skip to main content
Log in

2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present a technique for estimating the spatial layout of humans in still images—the position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side).

We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films.

We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their classification and retrieval performance to two baselines built on state-of-the-art object detection models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agarwal, A., & Triggs, B. (2004a). 3d human pose from silhouettes by relevance vector regression. In CVPR.

    Google Scholar 

  • Agarwal, A., & Triggs, B. (2004b). Tracking articulated motion using a mixture of autoregressive models. In ECCV.

    Google Scholar 

  • Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: people detection and articulated pose estimation. In CVPR.

    Google Scholar 

  • Arandjelovic, O., & Zisserman, A. (2005). Automatic face recognition for film character retrieval in feature-length films. In CVPR.

    Google Scholar 

  • Bergtholdt, M., Knappes, J., & Schnorr, C. (2008). Learning of graphical models and efficient inference for object class recognition. In DAGM.

    Google Scholar 

  • Bishop, C. (2006). Pattern recognition and machine learning. Berlin: Springer.

    MATH  Google Scholar 

  • Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In ICCV.

    Google Scholar 

  • Bobick, A., & Davis, J. (2001). The recognition of human movement using temporal templates. In PAMI.

    Google Scholar 

  • Buehler, P., Everinghan, M., Huttenlocher, D., & Zisserman, A. (2008). Long term arm and hand tracking for continuous sign language TV broadcasts. In BMVC.

    Google Scholar 

  • Cham, T., & Rehg, J. (1999). A multiple hypothesis approach to figure tracking. In CVPR.

    Google Scholar 

  • Comaniciu, D., & Meer, P. (2002). Mean shift: a robust approach toward feature space analysis. In PAMI.

    Google Scholar 

  • Crow, F. (1984). Summed-area tables for texture mapping. In SIGGRAPH.

    Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. In CVPR.

    Google Scholar 

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In ICCV VS-PETS.

    Google Scholar 

  • Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In BMVC.

    Google Scholar 

  • Eichner, M., & Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In ECCV.

    Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results.

  • Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion features. In CVPR.

    Google Scholar 

  • Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  • Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.

    Google Scholar 

  • Ferrari, V., Tuytelaars, T., & Van Gool, L. (2001). Real-time affine region tracking and coplanar grouping. In CVPR.

    Google Scholar 

  • Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In CVPR.

    Google Scholar 

  • Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2009). Pose search: retrieving people using their pose. In CVPR.

    Google Scholar 

  • Forsyth, D., & Fleck, M. (1997). Body plans. In CVPR.

    Google Scholar 

  • Gavrilla, D. M. (2000). Pedestrian detection from a moving vehicle. In ECCV.

    Google Scholar 

  • Guan, P., Weiss, A., Balan, A., & Black, M. (2009). Estimating human shape and pose from a single image. In ICCV.

    Google Scholar 

  • Hua, G., Yang, M. H., & Wu, Y. (2005). Learning to estimate human pose with data driven belief propagation. In CVPR.

    Google Scholar 

  • Ikizler, N., & Duygulu, P. (2007). Human action recognition using distribution of oriented rectangular patches. In ICCV workshop on human motion understanding.

    Google Scholar 

  • Ioffe, S., & Forsyth, D. (1999). Finding people by sampling. In ICCV.

    Google Scholar 

  • Jiang, H. (2009). Human pose estimation using consistent max-covering. In ICCV.

    Google Scholar 

  • Jiang, H., & Martin, D. R. (2008). Global pose estimation using non-tree models. In CVPR.

    Google Scholar 

  • Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In MLVMA.

    Google Scholar 

  • Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC.

    Google Scholar 

  • Ke, Y., Sukthankar, R., & Hebert, M. (2007). Spatio-temporal shape and flow correlation for action recognition. In CVPR.

    Google Scholar 

  • Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2004). Learning layered pictorial structures from video. In ICVGIP.

    Google Scholar 

  • Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2009). Efficient discriminative learning of parts-based models. In ICCV.

    Google Scholar 

  • Lan, X., & Huttenlocher, D. P. (2004). A unified spatio-temporal articulated model for tracking. In CVPR.

    Google Scholar 

  • Lan, X., & Huttenlocher, D. (2005). Beyond trees: common-factor models for 2D human pose recovery. In ICCV.

    Google Scholar 

  • Laptev, I. (2006). Improvements of object detection using boosted histograms. In BMVC.

    Google Scholar 

  • Laptev, I., Perez, P. (2007). Retrieving actions in movies. In ICCV.

    Google Scholar 

  • Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

    Google Scholar 

  • Lee, M.W., Cohen, I. (2004). Proposal maps driven MCMC for estimating human body pose in static images. In CVPR.

    Google Scholar 

  • Li, P., Ai, H., Li, Y., & Huang, C. (2007). Video parsing based on head tracking and face recognition. In CIVR.

    Google Scholar 

  • Mikolajczyk, K., Schmid, C., & Zisserman, A. (2004). Human detection based on a probabilistic assembly of robust part detectors. In ECCV.

    Google Scholar 

  • Mori, G., & Malik, J. (2002). Estimating human body configurations using shape context matching. In CVPR.

    Google Scholar 

  • Niebles, J., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In CVPR.

    Google Scholar 

  • Nocedal, J., & Wright, S. (2006). Numerical optimization. Berlin: Springer.

    MATH  Google Scholar 

  • Ozuysal, M., Lepetit, V., Fleuret, F., & Fua, P. (2006). Feature harvesting for tracking-by-detection. In ECCV.

    Google Scholar 

  • Ramanan, D. (2006). Learning to parse images of articulated bodies. In NIPS.

    Google Scholar 

  • Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In CVPR.

    Google Scholar 

  • Ren, X., Berg, A., & Malik, J. (2005). Recovering human body configurations using pairwise constraints between parts. In CVPR.

    Google Scholar 

  • Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In ECCV.

    Google Scholar 

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: interactive foreground extraction using iterated graph cuts. In SIGGRAPH.

    Google Scholar 

  • Sapp, B., Jordan, C., & Taskar, B. (2010a). Adaptive pose priors for pictorial structures. In CVPR.

    Google Scholar 

  • Sapp, B., Toshev, A., & Taskar, B. (2010b). Cascaded models for articulated pose estimation. In ECCV.

    Google Scholar 

  • Shechtman, E., & Irani, M. (2007). Matching local self-similarities across images and videos. In CVPR.

    Google Scholar 

  • Sigal, L., & Black, M. (2006). Measure locally, reason globally: occlusion-sensitive articulated pose estimation. In CVPR.

    Google Scholar 

  • Sigal, L., Isard, M., Sigelman, B. H., & Black, M. J. (2003). Attractive people: assembling loose-limbed models using non-parametric belief propagation. In NIPS.

    Google Scholar 

  • Singh, V. K., Nevatia, R., & Huang, C. (2010). Efficient inference with multiple heterogeneous part detectors for human pose estimation. In ECCV.

    Google Scholar 

  • Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In ICCV.

    Google Scholar 

  • Sivic, J., Everingham, M., & Zisserman, A. (2005). Person spotting: video shot retrieval for face sets. In CIVR.

    Google Scholar 

  • Tian, T. P., & Sclaroff, S. (2010a). Fast globally optimal 2D human detection with loopy graph models. In CVPR.

    Google Scholar 

  • Tian, T. P., & Sclaroff, S. (2010b). Fast multi-aspect 2D human detection. In ECCV.

    Google Scholar 

  • Tran, D., & Forsyth, D. (2010). Improved human parsing with a full relational model. In ECCV.

    Google Scholar 

  • Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In CVPR.

    Google Scholar 

  • Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In ECCV.

    Google Scholar 

  • website (2008). VGG upper body detector. http://www.robots.ox.ac.uk/~vgg/software/UpperBody/.

  • website (2009a). Buffy stickmen dataset. http://www.robots.ox.ac.uk/~vgg/data/stickmen/.

  • website (2009b). ETHZ PASCAL stickmen dataset. http://www.vision.ee.ethz.ch/~calvin/ethz_pascal_stickmen/.

  • website (2009c). HPE software. http://www.vision.ee.ethz.ch/~calvin/articulated_human_pose_estimation_code/.

  • website (2009d). VGG pose estimation and search. http://www.robots.ox.ac.uk/~vgg/research/pose_estimation/.

  • website (2010a). CALVIN upper body detector. http://www.vision.ee.ethz.ch/~calvin/calvin_upperbody_detector/.

  • website (2010b). HPE online demo. http://www.vision.ee.ethz.ch/~hpedemo/.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Eichner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eichner, M., Marin-Jimenez, M., Zisserman, A. et al. 2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images. Int J Comput Vis 99, 190–214 (2012). https://doi.org/10.1007/s11263-012-0524-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0524-9

Keywords

Navigation