Skip to main content

Advertisement

Log in

Fast Human Pose Detection Using Randomized Hierarchical Cascades of Rejectors

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper addresses human detection and pose estimation from monocular images by formulating it as a classification problem. Our main contribution is a multi-class pose detector that uses the best components of state-of-the-art classifiers including hierarchical trees, cascades of rejectors as well as randomized forests. Given a database of images with corresponding human poses, we define a set of classes by discretizing camera viewpoint and pose space. A bottom-up approach is first followed to build a hierarchical tree by recursively clustering and merging the classes at each level. For each branch of this decision tree, we take advantage of the alignment of training images to build a list of potentially discriminative HOG (Histograms of Orientated Gradients) features. We then select the HOG blocks that show the best rejection performances. We finally grow an ensemble of cascades by randomly sampling one of these HOG-based rejectors at each branch of the tree. The resulting multi-class classifier is then used to scan images in a sliding window scheme. One of the properties of our algorithm is that the randomization can be applied on-line at no extra-cost, therefore classifying each window with a different ensemble of randomized cascades. Our approach, when compared to other pose classifiers, gives fast and efficient detection performances with both fixed and moving cameras. We present results using different publicly available training and testing data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agarwal, A., & Triggs, B. (2006). Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58.

    Article  Google Scholar 

  • Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In CVPR.

    Google Scholar 

  • Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3d pose estimation and tracking by detection. In CVPR (pp. 623–630).

    Google Scholar 

  • Bergtholdt, M., Kappes, J. H., Schmidt, S., & Schnörr, C. (2010). A study of parts-based object class detection using complete graphs. International Journal of Computer Vision, 87(1–2), 93–117.

    Article  MathSciNet  Google Scholar 

  • Bissacco, A., Yang, M. H., & Soatto, S. (2006). Detecting humans via their pose. In NIPS (pp. 169–176).

    Google Scholar 

  • Bissacco, A., Yang, M. H., & Soatto, S. (2007). Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In CVPR.

    Google Scholar 

  • Bookstein, F. (1991). Morphometric tools for landmark data: geometry and biology. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Bosch, A., Zisserman, A., & Munoz, X. (2007). Image classification using random forests and ferns. In ICCV.

    Google Scholar 

  • Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3d human pose annotations. In ICCV.

    Google Scholar 

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.

    MathSciNet  MATH  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In ECCV (pp. 44–57).

    Google Scholar 

  • Collins, R., & Liu, Y. (2003). On-line selection of discriminative tracking features. In ICCV.

    Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR (pp. 886–893).

    Google Scholar 

  • Datar, M., Immorlica, N., Indyk, P., & Mirrokni, V. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of the 20th annual symposium on computational geometry (pp. 253–262).

    Google Scholar 

  • Deselaers, T., Criminisi, A., Winn, J. M., & Agarwal, A. (2007). Incorporating on-demand stereo for real time recognition. In CVPR.

    Google Scholar 

  • Dimitrijevic, M., Lepetit, V., & Fua, P. (2006). Human body pose detection using bayesian spatio-temporal templates. Computer Vision and Image Understanding, 104(2), 127–139.

    Article  Google Scholar 

  • Elgammal, A. M., & Lee, C. S. (2009). Tracking people on a torus. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(3), 520–538.

    Article  Google Scholar 

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  • Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. A. (2010). Cascade object detection with deformable part models. In CVPR (pp. 2241–2248).

    Google Scholar 

  • Ferrari, V., Marn-Jimnez, M. J., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In CVPR.

    Google Scholar 

  • Fossati, A., Dimitrijevic, M., Lepetit, V., & Fua, P. (2007). Bridging the gap between detection and tracking for 3d monocular video-based motion capture. In CVPR.

    Google Scholar 

  • Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010). Optimization and filtering for human motion capture. International Journal of Computer Vision, 87(1–2), 75–92.

    Article  Google Scholar 

  • Gavrila, D. M. (2007). A bayesian, exemplar-based approach to hierarchical shape matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8), 1408–1421.

    Article  Google Scholar 

  • Gross, R., & Shi, J. (2001). The cmu motion of body (mobo) database. Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.

  • Jaeggli, T., Koller-Meier, E., & Gool, L. J. V. (2009). Learning generative models for multi-activity body pose estimation. International Journal of Computer Vision, 83(2), 121–134.

    Article  Google Scholar 

  • Kanade, T., Cohn, J. F., & Tian, Y. (2000). Comprehensive database for facial expression analysis. In FG (pp. 46–53).

    Google Scholar 

  • Laptev, I. (2009). Improving object detection with boosted histograms. Image and Vision Computing, 27(5), 535–544.

    Article  Google Scholar 

  • Lee, C. S., & Elgammal, AM (2010). Coupled visual and kinematic manifold models for tracking. International Journal of Computer Vision, 87(1–2), 118–139.

    Article  Google Scholar 

  • Lepetit, V., & Fua, P. (2006). Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1465–1479.

    Article  Google Scholar 

  • Lin, Z., & Davis, L. S. (2010). Shape-based human detection and segmentation via hierarchical part-template matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4), 604–618.

    Article  Google Scholar 

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Ma, Y., & Ding, X. (2005). Real-time multi-view face detection and pose estimation based on cost-sensitive adaboost. Tsinghua Science and Technology, 10(2), 152–157.

    Article  Google Scholar 

  • Moosmann, F., Nowak, E., & Jurie, F. (2008). Randomized clustering forests for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9), 1632–1646.

    Article  Google Scholar 

  • Mori, G., & Malik, J. (2006). Recovering 3d human body configurations using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(7), 1052–1062.

    Article  Google Scholar 

  • Navaratnam, R., Thayananthan, A., Torr, P., & Cipolla, R. (2005). Hierarchical part-based human body pose estimation. In BMVC.

    Google Scholar 

  • Okada, R., & Soatto, S. (2008). Relevant feature selection for human pose estimation and localization in cluttered images. In ECCV (pp. 434–445).

    Google Scholar 

  • Okada, R., & Stenger, B. (2008). A single camera motion capture system for human-computer interaction. IEICE Transactions on Information and Systems, 91(7), 1855–1862.

    Article  Google Scholar 

  • Orrite, C., Gañán, A., & Rogez, G. (2009). Hog-based decision tree for facial expression classification. In IbPRIA (pp. 176–183).

    Google Scholar 

  • Roberts, T., McKenna, S., & Ricketts, I. (2004). Human pose estimation using learnt probabilistic region similarities and partial configurations. In ECCV (pp. 291–303).

    Google Scholar 

  • Rogez, G., Orrite, C., & Martínez, J. (2008a). A spatio-temporal 2d-models framework for human pose recovery in monocular sequences. Pattern Recognition.

  • Rogez, G., Rihan, J., Ramalingam, S., Orrite, C., & Torr, P. H. (2008b). Randomized trees for human pose detection. In CVPR (pp. 1–8).

    Google Scholar 

  • Sabzmeydani, P., & Mori, G. (2007). Detecting pedestrians by learning shapelet features. In CVPR07.

    Google Scholar 

  • Shakhnarovich, G., Viola, P., & Darrell, R. (2003). Fast pose estimation with parameter-sensitive hashing. In ICCV.

    Google Scholar 

  • Shotton, J., Johnson, M., Cipolla, R., Center, T., & Kawasaki, J. (2008). Semantic texton forests for image categorization and segmentation. In CVPR.

    Google Scholar 

  • Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR.

    Google Scholar 

  • Sigal, L., & Black, M. J. (2010). Guest editorial: State of the art in image- and video-based human pose and motion estimation. International Journal of Computer Vision, 87(1–2), 1–3.

    Article  Google Scholar 

  • Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2), 4–27.

    Article  Google Scholar 

  • Sminchisescu, C., Kanaujia, A., & Metaxas, D. N. (2006). Learning joint top-down and bottom-up processes for 3d visual inference. In CVPR (2) (pp. 1743–1752).

    Google Scholar 

  • Stenger, B. (2004). Model-based hand tracking using a hierarchical bayesian filter. PhD thesis, Department of Engineering, University of Cambridge.

  • Sugano, H., & Miyamoto, R. (2007). A real-time object recognition system on cell broadband engine. In Proc. of the 2nd Pacific Rim conference on advances in image and video technology (pp. 932–943).

    Google Scholar 

  • Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P. H. S., & Cipolla, R. (2006). Multivariate relevance vector machines for tracking. In ECCV (3) (pp. 124–138).

    Google Scholar 

  • Toyama, K., & Blake, A. (2002). Probabilistic tracking with exemplars in a metric space. International Journal of Computer Vision, 48(1), 9–19.

    Article  MATH  Google Scholar 

  • Villamizar, M., Sanfeliu, A., & Andrade-Cetto, J. (2009). Local boosted features for pedestrian detection. In IbPRIA (pp. 128–135).

    Google Scholar 

  • Viola, P., & Jones, M. (2002). Robust real-time object detection. International Journal of Computer Vision.

  • Viola, P., Jones, M. J., & Snow, D. (2005). Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, 63(2), 153–161.

    Article  Google Scholar 

  • Wu, B., & Nevatia, R. (2005). Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In ICCV (pp. 90–97).

    Google Scholar 

  • Zehnder, P., Koller-Meier, E., & Van Gool, L. (2005). A hierarchical system for recognition, tracking and pose estimation. In MLMI (pp. 329–340).

    Google Scholar 

  • Zhang, J., Zhou, S., McMillan, L., & Comaniciu, D. (2007). Joint real-time object detection and pose estimation using probabilistic boosting network. In CVPR (pp. 1–8).

    Google Scholar 

  • Zhang, Z., Zhu, L., Li, S., & Zhang, H. (2002). Real-time multi-view face detection. In Proc. int’l conf. automatic face and gesture recognition (pp. 149–154).

    Google Scholar 

  • Zhu, Q., Avidan, S., Yeh, M. C., & Cheng, K. T. (2006). Fast human detection using a cascade of histograms of oriented gradients. In CVPR (pp. 1491–1498).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Grégory Rogez.

Additional information

Part of this work was conducted while the first author was a research fellow at Oxford Brookes University. This work was partly supported by the EPSRC grant GR/T21790/01(P) and by Sony Entertainment Europe (SCEE). G. Rogez and C. Orrite would like to acknowledge support provided by: “Departamento de Ciencia, Tecnología y Universidad del Gobierno de Aragón”, “Fondo Social Europeo” and “Ministerio de Ciencia e Innovación (TIN2010-20177)”. Prof. Torr is in receipt of a Royal Society Wolfson Research Merit Award.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rogez, G., Rihan, J., Orrite-Uruñuela, C. et al. Fast Human Pose Detection Using Randomized Hierarchical Cascades of Rejectors. Int J Comput Vis 99, 25–52 (2012). https://doi.org/10.1007/s11263-012-0516-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0516-9

Keywords

Navigation