Abstract
In this paper we formulate a hierarchical configurable deformable template (HCDT) to model articulated visual objects—such as horses and baseball players—for tasks such as parsing, segmentation, and pose estimation. HCDTs represent an object by an AND/OR graph where the OR nodes act as switches which enables the graph topology to vary adaptively. This hierarchical representation is compositional and the node variables represent positions and properties of subparts of the object. The graph and the node variables are required to obey the summarization principle which enables an efficient compositional inference algorithm to rapidly estimate the state of the HCDT. We specify the structure of the AND/OR graph of the HCDT by hand and learn the model parameters discriminatively by extending Max-Margin learning to AND/OR graphs. We illustrate the three main aspects of HCDTs—representation, inference, and learning—on the tasks of segmenting, parsing, and pose (configuration) estimation for horses and humans. We demonstrate that the inference algorithm is fast and that max-margin learning is effective. We show that HCDTs gives state of the art results for segmentation and pose estimation when compared to other methods on benchmarked datasets.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden Markov support vector machines. In ICML (pp. 3–10).
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis Machine Intelligence, 24(4), 509–522.
Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In ECCV (2) (pp. 109–124).
Borenstein, E., & Malik, J. (2006). Shape guided object segmentation. In CVPR (1) (pp. 969–976).
Chen, X., & Yuille, A. (2005). A time-efficient cascade for real-time object detection: with applications for the visually impaired. In CVPR.
Chen, H., Xu, Z., Liu, Z., & Zhu, S. C. (2006). Composite templates for cloth modeling and sketching. In CVPR (1) (pp. 943–950).
Chen, Y., Zhu, L., Lin, C., Yuille, A. L., & Zhang, H. (2007). Rapid inference on a novel and/or graph for object detection, segmentation and parsing. In NIPS.
Chui, H., & Rangarajan, A. (2000). A new algorithm for non-rigid point matching. In CVPR (pp. 2044–2051).
Coughlan, J. M., Yuille, A. L., English, C., & Snow, D. (1998). Efficient optimization of a deformable template using dynamic programming. In CVPR.
Coughlan, J. M., Yuille, A. L., English, C., & Snow, D. (2000). Efficient deformable template detection and localization without user initialization. Computer Vision and Image Understanding, 78(3), 303–319.
Coughlan, J. M., & Ferreira, S. J. (2002). Finding deformable shapes using loopy belief propagation. In ECCV (3) (pp. 453–468).
Cour, T., & Shi, J. (2007). Recognizing objects by piecing together the segmentation puzzle. In CVPR.
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines: and other kernel-based learning methods. New York: Cambridge University Press.
Dechter, R., & Mateescu, R. (2007). And/or search spaces for graphical models. Artifical Intelligence, 171(2–3), 73–106.
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE CVPR.
Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In CVPR (2) (pp. 2145–2152).
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005). Obj cut. In CVPR (1) (pp. 18–25).
Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML (pp. 282–289).
Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B, 50(2), 157–224.
Lee, M. W., & Cohen, I. (2004). Proposal maps driven mcmc for estimating human body pose in static images. In CVPR (2) (pp. 334–341).
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV’04 workshop on statistical learning in computer vision, Prague, Czech Republic, May 2004 (pp. 17–32).
Levin, A., & Weiss, Y. (2006). Learning to combine bottom-up and top-down segmentation. In ECCV (4) (pp. 581–594).
Manning, C., & Schuetze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
Meila, M., & Jordan, M. I. (2000). Learning with mixtures of trees. Journal of Machine Learning Research, 1, 1–48.
Mori, G., Ren, X., Efros, A. A., & Malik, J. (2004). Recovering human body configurations: Combining segmentation and recognition. In CVPR (2) (pp. 326–333).
Mori, G. (2005). Guiding model search using segmentation. In ICCV (pp. 1417–1423).
Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., & Poggio, T. (1997). Pedestrian detection using wavelet templates. In Proc. computer vision and pattern recognition (pp. 193–199), Puerto Rico, June 16–20 1997.
Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: an application to face detection. In CVPR (pp. 130–136).
Platt, J. C. (1998). Using analytic qp and sparseness to speed training of support vector machines. In NIPS (pp. 557–563).
Ramanan, D. (2006). Learning to parse images of articulated bodies. In NIPS (pp. 1129–1136).
Ren, X., Berg, A. C., & Malik, J. (2005). Recovering human body configurations using pairwise constraints between parts. In ICCV (pp. 824–831).
Ren, X., Fowlkes, C., & Malik, J. (2005). Cue integration for figure/ground labeling. In NIPS.
Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In ECCV (4) (pp. 700–714).
Rother, C., Kolmogorov, V., & Blake, A. (2004). “grabcut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314.
Sigal, L., & Black, M. J. (2006). Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In CVPR (2) (pp. 2041–2048).
Srinivasan, P., & Shi, J. (2007). Bottom-up recognition and parsing of the human body. In EMMCVPR (pp. 153–168).
Srinivasan, P., & Shi, J. (2007). Bottom-up recognition and parsing of the human body. In CVPR.
Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin Markov networks. In NIPS.
Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004). Max-margin parsing. In EMNLP.
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Viola, P. A., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.
Viola, P., Platt, J. C., & Zhang, C. (2005). Multiple Instance Boosting for Object detection. In NIPS.
Winn, J. M., & Jojic, N. (2005). Locus: learning object classes with unsupervised segmentation. In ICCV (pp. 756–763).
Zhang, J., Luo, J., Collins, R. T., & Liu, Y. (2006). Body localization in still images using hierarchical models and hybrid search. In CVPR (2) (pp. 1536–1543).
Zhu, L., Chen, Y., & Yuille, A. L. (2006). Unsupervised learning of a probabilistic grammar for object detection and parsing. In NIPS (pp. 1617–1624).
Zhu, L., & Yuille, A. L. (2005). A hierarchical compositional system for rapid object detection. In NIPS.
Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-Markov models for object categories. IEEE Transactions on Pattern Analysis Machine Intelligence.
Zhu, L., Lin, C., Huang, H., Chen, Y., & Yuille, A. (2008). Unsupervised structure learning: hierarchical recursive composition, suspicious coincidence and competitive exclusion. In ECCV.
Zhu, L., Chen, Y., Lu, Y., Lin, C., & Yuille, A. (2008). Max margin AND/OR graph learning for parsing the human body. In CVPR.
Zhu, S., & Mumford, D. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Zhu, L.(., Chen, Y., Lin, C. et al. Max Margin Learning of Hierarchical Configural Deformable Templates (HCDTs) for Efficient Object Parsing and Pose Estimation. Int J Comput Vis 93, 1–21 (2011). https://doi.org/10.1007/s11263-010-0375-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-010-0375-1