Max Margin Learning of Hierarchical Configural Deformable Templates (HCDTs) for Efficient Object Parsing and Pose Estimation

  • Long (Leo) ZhuEmail author
  • Yuanhao Chen
  • Chenxi Lin
  • Alan Yuille
Open Access


In this paper we formulate a hierarchical configurable deformable template (HCDT) to model articulated visual objects—such as horses and baseball players—for tasks such as parsing, segmentation, and pose estimation. HCDTs represent an object by an AND/OR graph where the OR nodes act as switches which enables the graph topology to vary adaptively. This hierarchical representation is compositional and the node variables represent positions and properties of subparts of the object. The graph and the node variables are required to obey the summarization principle which enables an efficient compositional inference algorithm to rapidly estimate the state of the HCDT. We specify the structure of the AND/OR graph of the HCDT by hand and learn the model parameters discriminatively by extending Max-Margin learning to AND/OR graphs. We illustrate the three main aspects of HCDTs—representation, inference, and learning—on the tasks of segmenting, parsing, and pose (configuration) estimation for horses and humans. We demonstrate that the inference algorithm is fast and that max-margin learning is effective. We show that HCDTs gives state of the art results for segmentation and pose estimation when compared to other methods on benchmarked datasets.


Hierarchy Shape representation Object parsing Segmentation Structure learning Max margin 


  1. Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden Markov support vector machines. In ICML (pp. 3–10). Google Scholar
  2. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis Machine Intelligence, 24(4), 509–522. CrossRefGoogle Scholar
  3. Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In ECCV (2) (pp. 109–124). Google Scholar
  4. Borenstein, E., & Malik, J. (2006). Shape guided object segmentation. In CVPR (1) (pp. 969–976). Google Scholar
  5. Chen, X., & Yuille, A. (2005). A time-efficient cascade for real-time object detection: with applications for the visually impaired. In CVPR. Google Scholar
  6. Chen, H., Xu, Z., Liu, Z., & Zhu, S. C. (2006). Composite templates for cloth modeling and sketching. In CVPR (1) (pp. 943–950). Google Scholar
  7. Chen, Y., Zhu, L., Lin, C., Yuille, A. L., & Zhang, H. (2007). Rapid inference on a novel and/or graph for object detection, segmentation and parsing. In NIPS. Google Scholar
  8. Chui, H., & Rangarajan, A. (2000). A new algorithm for non-rigid point matching. In CVPR (pp. 2044–2051). Google Scholar
  9. Coughlan, J. M., Yuille, A. L., English, C., & Snow, D. (1998). Efficient optimization of a deformable template using dynamic programming. In CVPR. Google Scholar
  10. Coughlan, J. M., Yuille, A. L., English, C., & Snow, D. (2000). Efficient deformable template detection and localization without user initialization. Computer Vision and Image Understanding, 78(3), 303–319. CrossRefGoogle Scholar
  11. Coughlan, J. M., & Ferreira, S. J. (2002). Finding deformable shapes using loopy belief propagation. In ECCV (3) (pp. 453–468). Google Scholar
  12. Cour, T., & Shi, J. (2007). Recognizing objects by piecing together the segmentation puzzle. In CVPR. Google Scholar
  13. Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292. CrossRefGoogle Scholar
  14. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines: and other kernel-based learning methods. New York: Cambridge University Press. Google Scholar
  15. Dechter, R., & Mateescu, R. (2007). And/or search spaces for graphical models. Artifical Intelligence, 171(2–3), 73–106. CrossRefzbMATHMathSciNetGoogle Scholar
  16. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE CVPR. Google Scholar
  17. Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In CVPR (2) (pp. 2145–2152). Google Scholar
  18. Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005). Obj cut. In CVPR (1) (pp. 18–25). Google Scholar
  19. Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML (pp. 282–289). Google Scholar
  20. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B, 50(2), 157–224. zbMATHMathSciNetGoogle Scholar
  21. Lee, M. W., & Cohen, I. (2004). Proposal maps driven mcmc for estimating human body pose in static images. In CVPR (2) (pp. 334–341). Google Scholar
  22. Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV’04 workshop on statistical learning in computer vision, Prague, Czech Republic, May 2004 (pp. 17–32). Google Scholar
  23. Levin, A., & Weiss, Y. (2006). Learning to combine bottom-up and top-down segmentation. In ECCV (4) (pp. 581–594). Google Scholar
  24. Manning, C., & Schuetze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press. zbMATHGoogle Scholar
  25. Meila, M., & Jordan, M. I. (2000). Learning with mixtures of trees. Journal of Machine Learning Research, 1, 1–48. CrossRefMathSciNetGoogle Scholar
  26. Mori, G., Ren, X., Efros, A. A., & Malik, J. (2004). Recovering human body configurations: Combining segmentation and recognition. In CVPR (2) (pp. 326–333). Google Scholar
  27. Mori, G. (2005). Guiding model search using segmentation. In ICCV (pp. 1417–1423). Google Scholar
  28. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., & Poggio, T. (1997). Pedestrian detection using wavelet templates. In Proc. computer vision and pattern recognition (pp. 193–199), Puerto Rico, June 16–20 1997. Google Scholar
  29. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: an application to face detection. In CVPR (pp. 130–136). Google Scholar
  30. Platt, J. C. (1998). Using analytic qp and sparseness to speed training of support vector machines. In NIPS (pp. 557–563). Google Scholar
  31. Ramanan, D. (2006). Learning to parse images of articulated bodies. In NIPS (pp. 1129–1136). Google Scholar
  32. Ren, X., Berg, A. C., & Malik, J. (2005). Recovering human body configurations using pairwise constraints between parts. In ICCV (pp. 824–831). Google Scholar
  33. Ren, X., Fowlkes, C., & Malik, J. (2005). Cue integration for figure/ground labeling. In NIPS. Google Scholar
  34. Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In ECCV (4) (pp. 700–714). Google Scholar
  35. Rother, C., Kolmogorov, V., & Blake, A. (2004). “grabcut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314. CrossRefGoogle Scholar
  36. Sigal, L., & Black, M. J. (2006). Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In CVPR (2) (pp. 2041–2048). Google Scholar
  37. Srinivasan, P., & Shi, J. (2007). Bottom-up recognition and parsing of the human body. In EMMCVPR (pp. 153–168). Google Scholar
  38. Srinivasan, P., & Shi, J. (2007). Bottom-up recognition and parsing of the human body. In CVPR. Google Scholar
  39. Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin Markov networks. In NIPS. Google Scholar
  40. Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004). Max-margin parsing. In EMNLP. Google Scholar
  41. Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML. Google Scholar
  42. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer. zbMATHGoogle Scholar
  43. Viola, P. A., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154. CrossRefGoogle Scholar
  44. Viola, P., Platt, J. C., & Zhang, C. (2005). Multiple Instance Boosting for Object detection. In NIPS. Google Scholar
  45. Winn, J. M., & Jojic, N. (2005). Locus: learning object classes with unsupervised segmentation. In ICCV (pp. 756–763). Google Scholar
  46. Zhang, J., Luo, J., Collins, R. T., & Liu, Y. (2006). Body localization in still images using hierarchical models and hybrid search. In CVPR (2) (pp. 1536–1543). Google Scholar
  47. Zhu, L., Chen, Y., & Yuille, A. L. (2006). Unsupervised learning of a probabilistic grammar for object detection and parsing. In NIPS (pp. 1617–1624). Google Scholar
  48. Zhu, L., & Yuille, A. L. (2005). A hierarchical compositional system for rapid object detection. In NIPS. Google Scholar
  49. Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-Markov models for object categories. IEEE Transactions on Pattern Analysis Machine Intelligence. Google Scholar
  50. Zhu, L., Lin, C., Huang, H., Chen, Y., & Yuille, A. (2008). Unsupervised structure learning: hierarchical recursive composition, suspicious coincidence and competitive exclusion. In ECCV. Google Scholar
  51. Zhu, L., Chen, Y., Lu, Y., Lin, C., & Yuille, A. (2008). Max margin AND/OR graph learning for parsing the human body. In CVPR. Google Scholar
  52. Zhu, S., & Mumford, D. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362. CrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s) 2010

Open AccessThis is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  • Long (Leo) Zhu
    • 1
    Email author
  • Yuanhao Chen
    • 2
  • Chenxi Lin
    • 3
  • Alan Yuille
    • 4
  1. 1.Department of StatisticsUniversity of California at Los AngelesLos AngelesUSA
  2. 2.University of Science and Technology of ChinaHefeiP.R. China
  3. 3.Alibaba Group R&DHangzhouP.R. China
  4. 4.Department of StatisticsPsychology and Computer Science, University of California at Los AngelesLos AngelesUSA

Personalised recommendations