2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

Eichner, M.; Marin-Jimenez, M.; Zisserman, A.; Ferrari, V.

doi:10.1007/s11263-012-0524-9

2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

Published: 28 March 2012

Volume 99, pages 190–214, (2012)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

M. Eichner¹,
M. Marin-Jimenez³,
A. Zisserman⁴ &
…
V. Ferrari²

1545 Accesses
161 Citations
7 Altmetric
Explore all metrics

Abstract

We present a technique for estimating the spatial layout of humans in still images—the position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side).

We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films.

We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their classification and retrieval performance to two baselines built on state-of-the-art object detection models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

Real-Time Human Pose Recognition in Parts from Single Depth Images

DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model

References

Agarwal, A., & Triggs, B. (2004a). 3d human pose from silhouettes by relevance vector regression. In CVPR.
Google Scholar
Agarwal, A., & Triggs, B. (2004b). Tracking articulated motion using a mixture of autoregressive models. In ECCV.
Google Scholar
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: people detection and articulated pose estimation. In CVPR.
Google Scholar
Arandjelovic, O., & Zisserman, A. (2005). Automatic face recognition for film character retrieval in feature-length films. In CVPR.
Google Scholar
Bergtholdt, M., Knappes, J., & Schnorr, C. (2008). Learning of graphical models and efficient inference for object class recognition. In DAGM.
Google Scholar
Bishop, C. (2006). Pattern recognition and machine learning. Berlin: Springer.
MATH Google Scholar
Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In ICCV.
Google Scholar
Bobick, A., & Davis, J. (2001). The recognition of human movement using temporal templates. In PAMI.
Google Scholar
Buehler, P., Everinghan, M., Huttenlocher, D., & Zisserman, A. (2008). Long term arm and hand tracking for continuous sign language TV broadcasts. In BMVC.
Google Scholar
Cham, T., & Rehg, J. (1999). A multiple hypothesis approach to figure tracking. In CVPR.
Google Scholar
Comaniciu, D., & Meer, P. (2002). Mean shift: a robust approach toward feature space analysis. In PAMI.
Google Scholar
Crow, F. (1984). Summed-area tables for texture mapping. In SIGGRAPH.
Google Scholar
Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. In CVPR.
Google Scholar
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In ICCV VS-PETS.
Google Scholar
Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In BMVC.
Google Scholar
Eichner, M., & Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In ECCV.
Google Scholar
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results.
Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion features. In CVPR.
Google Scholar
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Article Google Scholar
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.
Google Scholar
Ferrari, V., Tuytelaars, T., & Van Gool, L. (2001). Real-time affine region tracking and coplanar grouping. In CVPR.
Google Scholar
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In CVPR.
Google Scholar
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2009). Pose search: retrieving people using their pose. In CVPR.
Google Scholar
Forsyth, D., & Fleck, M. (1997). Body plans. In CVPR.
Google Scholar
Gavrilla, D. M. (2000). Pedestrian detection from a moving vehicle. In ECCV.
Google Scholar
Guan, P., Weiss, A., Balan, A., & Black, M. (2009). Estimating human shape and pose from a single image. In ICCV.
Google Scholar
Hua, G., Yang, M. H., & Wu, Y. (2005). Learning to estimate human pose with data driven belief propagation. In CVPR.
Google Scholar
Ikizler, N., & Duygulu, P. (2007). Human action recognition using distribution of oriented rectangular patches. In ICCV workshop on human motion understanding.
Google Scholar
Ioffe, S., & Forsyth, D. (1999). Finding people by sampling. In ICCV.
Google Scholar
Jiang, H. (2009). Human pose estimation using consistent max-covering. In ICCV.
Google Scholar
Jiang, H., & Martin, D. R. (2008). Global pose estimation using non-tree models. In CVPR.
Google Scholar
Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In MLVMA.
Google Scholar
Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC.
Google Scholar
Ke, Y., Sukthankar, R., & Hebert, M. (2007). Spatio-temporal shape and flow correlation for action recognition. In CVPR.
Google Scholar
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2004). Learning layered pictorial structures from video. In ICVGIP.
Google Scholar
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2009). Efficient discriminative learning of parts-based models. In ICCV.
Google Scholar
Lan, X., & Huttenlocher, D. P. (2004). A unified spatio-temporal articulated model for tracking. In CVPR.
Google Scholar
Lan, X., & Huttenlocher, D. (2005). Beyond trees: common-factor models for 2D human pose recovery. In ICCV.
Google Scholar
Laptev, I. (2006). Improvements of object detection using boosted histograms. In BMVC.
Google Scholar
Laptev, I., Perez, P. (2007). Retrieving actions in movies. In ICCV.
Google Scholar
Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
Google Scholar
Lee, M.W., Cohen, I. (2004). Proposal maps driven MCMC for estimating human body pose in static images. In CVPR.
Google Scholar
Li, P., Ai, H., Li, Y., & Huang, C. (2007). Video parsing based on head tracking and face recognition. In CIVR.
Google Scholar
Mikolajczyk, K., Schmid, C., & Zisserman, A. (2004). Human detection based on a probabilistic assembly of robust part detectors. In ECCV.
Google Scholar
Mori, G., & Malik, J. (2002). Estimating human body configurations using shape context matching. In CVPR.
Google Scholar
Niebles, J., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In CVPR.
Google Scholar
Nocedal, J., & Wright, S. (2006). Numerical optimization. Berlin: Springer.
MATH Google Scholar
Ozuysal, M., Lepetit, V., Fleuret, F., & Fua, P. (2006). Feature harvesting for tracking-by-detection. In ECCV.
Google Scholar
Ramanan, D. (2006). Learning to parse images of articulated bodies. In NIPS.
Google Scholar
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In CVPR.
Google Scholar
Ren, X., Berg, A., & Malik, J. (2005). Recovering human body configurations using pairwise constraints between parts. In CVPR.
Google Scholar
Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In ECCV.
Google Scholar
Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: interactive foreground extraction using iterated graph cuts. In SIGGRAPH.
Google Scholar
Sapp, B., Jordan, C., & Taskar, B. (2010a). Adaptive pose priors for pictorial structures. In CVPR.
Google Scholar
Sapp, B., Toshev, A., & Taskar, B. (2010b). Cascaded models for articulated pose estimation. In ECCV.
Google Scholar
Shechtman, E., & Irani, M. (2007). Matching local self-similarities across images and videos. In CVPR.
Google Scholar
Sigal, L., & Black, M. (2006). Measure locally, reason globally: occlusion-sensitive articulated pose estimation. In CVPR.
Google Scholar
Sigal, L., Isard, M., Sigelman, B. H., & Black, M. J. (2003). Attractive people: assembling loose-limbed models using non-parametric belief propagation. In NIPS.
Google Scholar
Singh, V. K., Nevatia, R., & Huang, C. (2010). Efficient inference with multiple heterogeneous part detectors for human pose estimation. In ECCV.
Google Scholar
Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In ICCV.
Google Scholar
Sivic, J., Everingham, M., & Zisserman, A. (2005). Person spotting: video shot retrieval for face sets. In CIVR.
Google Scholar
Tian, T. P., & Sclaroff, S. (2010a). Fast globally optimal 2D human detection with loopy graph models. In CVPR.
Google Scholar
Tian, T. P., & Sclaroff, S. (2010b). Fast multi-aspect 2D human detection. In ECCV.
Google Scholar
Tran, D., & Forsyth, D. (2010). Improved human parsing with a full relational model. In ECCV.
Google Scholar
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In CVPR.
Google Scholar
Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In ECCV.
Google Scholar
website (2008). VGG upper body detector. http://www.robots.ox.ac.uk/~vgg/software/UpperBody/.
website (2009a). Buffy stickmen dataset. http://www.robots.ox.ac.uk/~vgg/data/stickmen/.
website (2009b). ETHZ PASCAL stickmen dataset. http://www.vision.ee.ethz.ch/~calvin/ethz_pascal_stickmen/.
website (2009c). HPE software. http://www.vision.ee.ethz.ch/~calvin/articulated_human_pose_estimation_code/.
website (2009d). VGG pose estimation and search. http://www.robots.ox.ac.uk/~vgg/research/pose_estimation/.
website (2010a). CALVIN upper body detector. http://www.vision.ee.ethz.ch/~calvin/calvin_upperbody_detector/.
website (2010b). HPE online demo. http://www.vision.ee.ethz.ch/~hpedemo/.

Download references

Author information

Authors and Affiliations

ETH Zurich, Zurich, Switzerland
M. Eichner
University of Edinburgh, Edinburgh, UK
V. Ferrari
University of Cordoba, Cordoba, Spain
M. Marin-Jimenez
University of Oxford, Oxford, UK
A. Zisserman

Authors

M. Eichner
View author publications
You can also search for this author in PubMed Google Scholar
M. Marin-Jimenez
View author publications
You can also search for this author in PubMed Google Scholar
A. Zisserman
View author publications
You can also search for this author in PubMed Google Scholar
V. Ferrari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Eichner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eichner, M., Marin-Jimenez, M., Zisserman, A. et al. 2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images. Int J Comput Vis 99, 190–214 (2012). https://doi.org/10.1007/s11263-012-0524-9

Download citation

Received: 11 August 2010
Accepted: 06 March 2012
Published: 28 March 2012
Issue Date: September 2012
DOI: https://doi.org/10.1007/s11263-012-0524-9

Keywords

Articulated human pose estimation search retrieval

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

Abstract

Access this article

Similar content being viewed by others

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

Real-Time Human Pose Recognition in Parts from Single Depth Images

DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

Abstract

Access this article

Similar content being viewed by others

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

Real-Time Human Pose Recognition in Parts from Single Depth Images

DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation