Object Detection Using Strongly-Supervised Deformable Part Models

  • Hossein Azizpour
  • Ivan Laptev
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7572)


Deformable part-based models [1, 2] achieve state-of-the-art performance for object detection, but rely on heuristic initialization during training due to the optimization of non-convex cost function. This paper investigates limitations of such an initialization and extends earlier methods using additional supervision. We explore strong supervision in terms of annotated object parts and use it to (i) improve model initialization, (ii) optimize model structure, and (iii) handle partial occlusions. Our method is able to deal with sub-optimal and incomplete annotations of object parts and is shown to benefit from semi-supervised learning setups where part-level annotation is provided for a fraction of positive examples only. Experimental results are reported for the detection of six animal classes in PASCAL VOC 2007 and 2010 datasets. We demonstrate significant improvements in detection performance compared to the LSVM [1] and the Poselet [3] object detectors.


Object Detection Minimum Span Tree Star Model Object Part Stochastic Gradient Descent 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 32, 1627–1645 (2010)CrossRefGoogle Scholar
  2. 2.
    Zhu, L., Chen, Y., Yuille, A., Freeman, W.: Latent hierarchical structural learning for object detection. In: CVPR, pp. 1062–1069 (2010)Google Scholar
  3. 3.
    Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3D human pose annotations. In: ICCV (2009)Google Scholar
  4. 4.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  5. 5.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A. (The PASCAL Visual Object Classes Challenge 2010 VOC 2010, Results) (2010)Google Scholar
  6. 6.
    Fischler, M., Elschlager, R.: The representation and matching of pictorial structures. TC 22, 67–92 (1973)Google Scholar
  7. 7.
    Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV 61, 55–79 (2005)CrossRefGoogle Scholar
  8. 8.
    Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS (2006)Google Scholar
  9. 9.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Pose search: Retrieving people using their pose. In: CVPR (2009)Google Scholar
  10. 10.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR, pp. 1385–1392 (2011)Google Scholar
  11. 11.
    Yang, W., Wang, Y., Mori, G.: Recognizing human actions from still images with latent poses. In: CVPR, pp. 2030–2037 (2010)Google Scholar
  12. 12.
    Cristinacce, D., Cootes, T.: Feature detection and tracking with constrained local models. In: BMVC (2006)Google Scholar
  13. 13.
    Naderi Parizi, S., Oberlin, J., Felzenszwalb, P.: Reconfigurable models for scene recognition. In: CVPR (2012)Google Scholar
  14. 14.
    Ott, P., Everingham, M.: Shared parts for deformable part-based models. In: CVPR (2011)Google Scholar
  15. 15.
    Wang, Y., Tran, D., Liao, Z.: Learning hierarchical poselets for human parsing. In: CVPR, pp. 1705–1712 (2011)Google Scholar
  16. 16.
    Branson, S., Belongie, S., Perona, P.: Strong supervision from weak annotation: Interactive training of deformable part models. In: ICCV (2011)Google Scholar
  17. 17.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. I:886–I:893 (2005)Google Scholar
  18. 18.
    Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV, pp. 1470–1477 (2003)Google Scholar
  19. 19.
    Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. IJCV 73, 213–238 (2007)CrossRefGoogle Scholar
  20. 20.
    Chen, Y., Zhu, L(L.), Yuille, A.: Active Mask Hierarchies for Object Detection. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 43–56. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  21. 21.
    Parkhi, O., Vedaldi, A., Jawahar, C.V., Zisserman, A.: The truth about cats and dogs. In: ICCV (2011)Google Scholar
  22. 22.
    Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting People Using Mutually Consistent Poselet Activations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 168–181. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  23. 23.
    Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR (2011)Google Scholar
  24. 24.
    Sun, M., Savarese, S.: Articulated part-based model for joint object detection and pose estimation. In: ICCV (2011)Google Scholar
  25. 25.
    Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: CVPR (2012)Google Scholar
  26. 26.
    Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press (2009)Google Scholar
  27. 27.
    Harris, C., Stephens, C.: A combined corner and edge detector. In: Alvey Vision Conference (1998)Google Scholar
  28. 28.
    Girshick, A., Felzenszwalb, P., McAllester, D.: LSVM Release 4 Notes,
  29. 29.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Hossein Azizpour
    • 1
  • Ivan Laptev
    • 2
  1. 1.Computer Vision and Active Perception Laboratory (CVAP)KTHSweden
  2. 2.WILLOW, Laboratoire d’Informatique de l’Ecole Normale SuperieureINRIAFrance

Personalised recommendations