Abstract
This paper addresses human detection and pose estimation from monocular images by formulating it as a classification problem. Our main contribution is a multi-class pose detector that uses the best components of state-of-the-art classifiers including hierarchical trees, cascades of rejectors as well as randomized forests. Given a database of images with corresponding human poses, we define a set of classes by discretizing camera viewpoint and pose space. A bottom-up approach is first followed to build a hierarchical tree by recursively clustering and merging the classes at each level. For each branch of this decision tree, we take advantage of the alignment of training images to build a list of potentially discriminative HOG (Histograms of Orientated Gradients) features. We then select the HOG blocks that show the best rejection performances. We finally grow an ensemble of cascades by randomly sampling one of these HOG-based rejectors at each branch of the tree. The resulting multi-class classifier is then used to scan images in a sliding window scheme. One of the properties of our algorithm is that the randomization can be applied on-line at no extra-cost, therefore classifying each window with a different ensemble of randomized cascades. Our approach, when compared to other pose classifiers, gives fast and efficient detection performances with both fixed and moving cameras. We present results using different publicly available training and testing data sets.
Similar content being viewed by others
References
Agarwal, A., & Triggs, B. (2006). Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58.
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In CVPR.
Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3d pose estimation and tracking by detection. In CVPR (pp. 623–630).
Bergtholdt, M., Kappes, J. H., Schmidt, S., & Schnörr, C. (2010). A study of parts-based object class detection using complete graphs. International Journal of Computer Vision, 87(1–2), 93–117.
Bissacco, A., Yang, M. H., & Soatto, S. (2006). Detecting humans via their pose. In NIPS (pp. 169–176).
Bissacco, A., Yang, M. H., & Soatto, S. (2007). Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In CVPR.
Bookstein, F. (1991). Morphometric tools for landmark data: geometry and biology. Cambridge: Cambridge University Press.
Bosch, A., Zisserman, A., & Munoz, X. (2007). Image classification using random forests and ferns. In ICCV.
Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3d human pose annotations. In ICCV.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In ECCV (pp. 44–57).
Collins, R., & Liu, Y. (2003). On-line selection of discriminative tracking features. In ICCV.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR (pp. 886–893).
Datar, M., Immorlica, N., Indyk, P., & Mirrokni, V. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of the 20th annual symposium on computational geometry (pp. 253–262).
Deselaers, T., Criminisi, A., Winn, J. M., & Agarwal, A. (2007). Incorporating on-demand stereo for real time recognition. In CVPR.
Dimitrijevic, M., Lepetit, V., & Fua, P. (2006). Human body pose detection using bayesian spatio-temporal templates. Computer Vision and Image Understanding, 104(2), 127–139.
Elgammal, A. M., & Lee, C. S. (2009). Tracking people on a torus. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(3), 520–538.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. A. (2010). Cascade object detection with deformable part models. In CVPR (pp. 2241–2248).
Ferrari, V., Marn-Jimnez, M. J., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In CVPR.
Fossati, A., Dimitrijevic, M., Lepetit, V., & Fua, P. (2007). Bridging the gap between detection and tracking for 3d monocular video-based motion capture. In CVPR.
Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010). Optimization and filtering for human motion capture. International Journal of Computer Vision, 87(1–2), 75–92.
Gavrila, D. M. (2007). A bayesian, exemplar-based approach to hierarchical shape matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8), 1408–1421.
Gross, R., & Shi, J. (2001). The cmu motion of body (mobo) database. Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.
Jaeggli, T., Koller-Meier, E., & Gool, L. J. V. (2009). Learning generative models for multi-activity body pose estimation. International Journal of Computer Vision, 83(2), 121–134.
Kanade, T., Cohn, J. F., & Tian, Y. (2000). Comprehensive database for facial expression analysis. In FG (pp. 46–53).
Laptev, I. (2009). Improving object detection with boosted histograms. Image and Vision Computing, 27(5), 535–544.
Lee, C. S., & Elgammal, AM (2010). Coupled visual and kinematic manifold models for tracking. International Journal of Computer Vision, 87(1–2), 118–139.
Lepetit, V., & Fua, P. (2006). Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1465–1479.
Lin, Z., & Davis, L. S. (2010). Shape-based human detection and segmentation via hierarchical part-template matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4), 604–618.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Ma, Y., & Ding, X. (2005). Real-time multi-view face detection and pose estimation based on cost-sensitive adaboost. Tsinghua Science and Technology, 10(2), 152–157.
Moosmann, F., Nowak, E., & Jurie, F. (2008). Randomized clustering forests for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9), 1632–1646.
Mori, G., & Malik, J. (2006). Recovering 3d human body configurations using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(7), 1052–1062.
Navaratnam, R., Thayananthan, A., Torr, P., & Cipolla, R. (2005). Hierarchical part-based human body pose estimation. In BMVC.
Okada, R., & Soatto, S. (2008). Relevant feature selection for human pose estimation and localization in cluttered images. In ECCV (pp. 434–445).
Okada, R., & Stenger, B. (2008). A single camera motion capture system for human-computer interaction. IEICE Transactions on Information and Systems, 91(7), 1855–1862.
Orrite, C., Gañán, A., & Rogez, G. (2009). Hog-based decision tree for facial expression classification. In IbPRIA (pp. 176–183).
Roberts, T., McKenna, S., & Ricketts, I. (2004). Human pose estimation using learnt probabilistic region similarities and partial configurations. In ECCV (pp. 291–303).
Rogez, G., Orrite, C., & Martínez, J. (2008a). A spatio-temporal 2d-models framework for human pose recovery in monocular sequences. Pattern Recognition.
Rogez, G., Rihan, J., Ramalingam, S., Orrite, C., & Torr, P. H. (2008b). Randomized trees for human pose detection. In CVPR (pp. 1–8).
Sabzmeydani, P., & Mori, G. (2007). Detecting pedestrians by learning shapelet features. In CVPR07.
Shakhnarovich, G., Viola, P., & Darrell, R. (2003). Fast pose estimation with parameter-sensitive hashing. In ICCV.
Shotton, J., Johnson, M., Cipolla, R., Center, T., & Kawasaki, J. (2008). Semantic texton forests for image categorization and segmentation. In CVPR.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR.
Sigal, L., & Black, M. J. (2010). Guest editorial: State of the art in image- and video-based human pose and motion estimation. International Journal of Computer Vision, 87(1–2), 1–3.
Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2), 4–27.
Sminchisescu, C., Kanaujia, A., & Metaxas, D. N. (2006). Learning joint top-down and bottom-up processes for 3d visual inference. In CVPR (2) (pp. 1743–1752).
Stenger, B. (2004). Model-based hand tracking using a hierarchical bayesian filter. PhD thesis, Department of Engineering, University of Cambridge.
Sugano, H., & Miyamoto, R. (2007). A real-time object recognition system on cell broadband engine. In Proc. of the 2nd Pacific Rim conference on advances in image and video technology (pp. 932–943).
Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P. H. S., & Cipolla, R. (2006). Multivariate relevance vector machines for tracking. In ECCV (3) (pp. 124–138).
Toyama, K., & Blake, A. (2002). Probabilistic tracking with exemplars in a metric space. International Journal of Computer Vision, 48(1), 9–19.
Villamizar, M., Sanfeliu, A., & Andrade-Cetto, J. (2009). Local boosted features for pedestrian detection. In IbPRIA (pp. 128–135).
Viola, P., & Jones, M. (2002). Robust real-time object detection. International Journal of Computer Vision.
Viola, P., Jones, M. J., & Snow, D. (2005). Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, 63(2), 153–161.
Wu, B., & Nevatia, R. (2005). Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In ICCV (pp. 90–97).
Zehnder, P., Koller-Meier, E., & Van Gool, L. (2005). A hierarchical system for recognition, tracking and pose estimation. In MLMI (pp. 329–340).
Zhang, J., Zhou, S., McMillan, L., & Comaniciu, D. (2007). Joint real-time object detection and pose estimation using probabilistic boosting network. In CVPR (pp. 1–8).
Zhang, Z., Zhu, L., Li, S., & Zhang, H. (2002). Real-time multi-view face detection. In Proc. int’l conf. automatic face and gesture recognition (pp. 149–154).
Zhu, Q., Avidan, S., Yeh, M. C., & Cheng, K. T. (2006). Fast human detection using a cascade of histograms of oriented gradients. In CVPR (pp. 1491–1498).
Author information
Authors and Affiliations
Corresponding author
Additional information
Part of this work was conducted while the first author was a research fellow at Oxford Brookes University. This work was partly supported by the EPSRC grant GR/T21790/01(P) and by Sony Entertainment Europe (SCEE). G. Rogez and C. Orrite would like to acknowledge support provided by: “Departamento de Ciencia, Tecnología y Universidad del Gobierno de Aragón”, “Fondo Social Europeo” and “Ministerio de Ciencia e Innovación (TIN2010-20177)”. Prof. Torr is in receipt of a Royal Society Wolfson Research Merit Award.
Rights and permissions
About this article
Cite this article
Rogez, G., Rihan, J., Orrite-Uruñuela, C. et al. Fast Human Pose Detection Using Randomized Hierarchical Cascades of Rejectors. Int J Comput Vis 99, 25–52 (2012). https://doi.org/10.1007/s11263-012-0516-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0516-9