Robust Object Detection with Interleaved Categorization and Segmentation

  • Bastian Leibe
  • Aleš Leonardis
  • Bernt Schiele


This paper presents a novel method for detecting and localizing objects of a visual category in cluttered real-world scenes. Our approach considers object categorization and figure-ground segmentation as two interleaved processes that closely collaborate towards a common goal. As shown in our work, the tight coupling between those two processes allows them to benefit from each other and improve the combined performance.

The core part of our approach is a highly flexible learned representation for object shape that can combine the information observed on different training examples in a probabilistic extension of the Generalized Hough Transform. The resulting approach can detect categorical objects in novel images and automatically infer a probabilistic segmentation from the recognition result. This segmentation is then in turn used to again improve recognition by allowing the system to focus its efforts on object pixels and to discard misleading influences from the background. Moreover, the information from where in the image a hypothesis draws its support is employed in an MDL based hypothesis verification stage to resolve ambiguities between overlapping hypotheses and factor out the effects of partial occlusion.

An extensive evaluation on several large data sets shows that the proposed system is applicable to a range of different object categories, including both rigid and articulated objects. In addition, its flexible representation allows it to achieve competitive object detection performance already from training sets that are between one and two orders of magnitude smaller than those used in comparable systems.


Object categorization Object detection Segmentation Clustering Hough transform Hypothesis selection MDL 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Agarwal, S., Atwan, A., & Roth, D. (2004). Learning to detect objects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1475–1490. CrossRefGoogle Scholar
  2. Bajcsy, R., Solina, F., & Gupta, A. (1990). Segmentation versus object representation—are they separable? In Analysis and interpretation of range images (pp. 207–223). New York: Springer. Google Scholar
  3. Ballard, D. H. (1981). Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 111–122. zbMATHCrossRefGoogle Scholar
  4. Belongie, S., Malik, J., & Puchiza, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522. CrossRefGoogle Scholar
  5. Benzécri, J. P. (1982). Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Cahiers de l’Analyse des Données, 7(2), 209–218. zbMATHGoogle Scholar
  6. Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In Lecture notes in computer science : Vol. 2353. ECCV’02 (pp. 109–122). Berlin: Springer. Google Scholar
  7. Borenstein, E., Sharon, E., & Ullman, S. (2004). Combining top-down and bottom-up segmentations. In Workshop on perceptual organization in computer vision, Washington, DC, June 2004. Google Scholar
  8. Bruynooghe, M. (1977). Méthodes nouvelles en classification automatique des données taxinomiques nombreuses. Statistique et Analyse des Données, 3, 24–42. Google Scholar
  9. Burl, M. C., Weber, M., & Perona, P. (1998). A probabilistic approach to object recognition using local photometry and global geometry. In ECCV’98. Google Scholar
  10. Cheng, Y. (1995). Mean shift mode seeking and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 790–799. CrossRefGoogle Scholar
  11. Collins, R. (2003). Mean-shift blob tracking through scale space. In CVPR’03. Google Scholar
  12. Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619. CrossRefGoogle Scholar
  13. Comaniciu, D., Ramesh, V., & Meer, P. (2001). The variable bandwidth mean shift and data-driven scale selection. In ICCV’01. Google Scholar
  14. Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models. In ECCV’98. Google Scholar
  15. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR’05. Google Scholar
  16. Day, W. H. E., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1, 7–24. zbMATHCrossRefGoogle Scholar
  17. de Rham, C. (1980). La classification hiérarchique ascendante selon la méthode des voisins réciproques. Cahiers de l’Analyse des Données, 5(2), 135–144. Google Scholar
  18. Deselaers, T., Keysers, D., & Ney, H. (2005). Improving a discriminative approach to object recognition using image patches. In DAGM’05. Google Scholar
  19. Dorko, G., & Schmid, C. (2003). Selection of scale invariant parts for object class recognition. In ICCV’03. Google Scholar
  20. Everingham, M., et al.(2006). The 2005 PASCAL visual object class challenge. In J. Quinonero-Candela, I. Dagan, B. Magnini, & F. d’Alche-Buc (Eds.), Lecture notes in artificial intelligence : Vol. 3944. Machine learning challenges. Evaluating predictive uncertainity, visual object classification, and recognising textual entailment. Berlin: Springer. Google Scholar
  21. Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1). Google Scholar
  22. Fergus, R., Perona, P., & Zisserman, A. (2005). A sparse object category model for efficient learning and exhaustive recognition. In CVPR’05. Google Scholar
  23. Fergus, R., Zisserman, A., & Perona, P. (2003). Object class recognition by unsupervised scale-invariant learning. In CVPR’03. Google Scholar
  24. Ferrari, V., Tuytelaars, T., & van Gool, L. (2004). Simultaneous recognition and segmentation by image exploration. In ECCV’04. Google Scholar
  25. Garcia, C., & Delakis, M. (2004). Convolutional face finder: a neural architecture for fast and robust face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1408–1423. CrossRefGoogle Scholar
  26. Garg, A., Agarwal, S., & Huang, T. (2002). Fusion of global and local information for object detection. In ICPR’02. Google Scholar
  27. Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Alvey vision conference (pp. 147–151). Google Scholar
  28. Heisele, B., Serre, T., Pontil, M., & Poggio, T. (2001). Component-based face detection. In CVPR’01 (pp. 657–662). Google Scholar
  29. Hough, P. V. C. (1962). Method and means for recognizing complex patterns. U.S. Patent 3069654. Google Scholar
  30. Jones, M., & Poggio, T. (1996). Model-based matching by linear combinations of prototypes. MIT AI Memo 1583, MIT. Google Scholar
  31. Jones, M. J., & Poggio, T. (1998). Multidimensional morphable models: a framework for representing and matching object classes. International Journal of Computer Vision, 29(2), 107–131. CrossRefGoogle Scholar
  32. Kadir, T., & Brady, M. (2001). Scale, saliency, and image description. International Journal of Computer Vision, 45(2), 83–105. zbMATHCrossRefGoogle Scholar
  33. Leibe, B., & Schiele, B. (2003). Interleaved object categorization and segmentation. In BMVC’03 (pp. 759–768), Norwich, UK, September 2003. Google Scholar
  34. Leibe, B., & Schiele, B. (2004). Scale invariant object categorization using a scale-adaptive mean-shift search. In Lecture notes in computer science : Vol. 3175. DAGM’04 (pp. 145–153). Berlin: Springer. Google Scholar
  35. Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV’04 workshop on statistical learning in computer vision. Google Scholar
  36. Leibe, B., Seemann, E., & Schiele, B. (2005). Pedestrian detection in crowded scenes. In CVPR’05. Google Scholar
  37. Leonardis, A., Gupta, A., & Bajcsy, R. (1995). Segmentation of range images as the search for geometric parametric models. International Journal of Computer Vision, 14, 253–277. CrossRefGoogle Scholar
  38. Li, F.-F., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV’03. Google Scholar
  39. Lindeberg, T. (1998). Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2), 79–116. CrossRefGoogle Scholar
  40. Lowe, D. G. (1999). Object recognition from local scale invariant features. In ICCV’99. Google Scholar
  41. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. CrossRefGoogle Scholar
  42. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley symposium on mathematical statistics and probability (pp. 281–297). Google Scholar
  43. Magee, D., & Boyle, R. (2002). Detecting lameness using ‘re-sampling condensation’ and ‘multi-stream cyclic hidden Markov models’. Image and Vision Computing, 20(8), 581–594. CrossRefGoogle Scholar
  44. Malik, J., Belongie, S., Leung, T., & Shi, J. (2001). Contour and texture analysis for image segmentation. International Journal of Computer Vision, 43(1), 7–27. zbMATHCrossRefGoogle Scholar
  45. Marr, D. (1982). Vision. San Francisco: Freeman. Google Scholar
  46. Matas, J., Chum, O., Martin, U., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In BMVC’02 (pp. 384–393). Google Scholar
  47. Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10). Google Scholar
  48. Mikolajczyk, C., Schmid, C., & Zisserman, A. (2004). Human detection based on a probabilistic assembly of robust part detectors. In Lecture notes in computer science : Vol. 3021. ECCV’04 (pp. 69–82). Berlin: Springer. Google Scholar
  49. Mikolajczyk, K., Leibe, B., & Schiele, B. (2005a). Local features for object class recognition. In ICCV’05. Google Scholar
  50. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & Van Gool, L. (2005b). A comparison of affine region detectors. International Journal of Computer Vision, 65(1/2), 43–72. CrossRefGoogle Scholar
  51. Mohan, A., Papageorgiou, C., & Poggio, T. (2001). Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(4), 349–361. CrossRefGoogle Scholar
  52. Mutch, J., & Lowe, D. (2006). Multiclass object recognition with sparse, localized features. In CVPR’06. Google Scholar
  53. Needham, A. (2001). Object recognition and object segregation in 4.5-month-old infants. Journal of Experimental Child Psychology, 78(3), 3–24. CrossRefGoogle Scholar
  54. Opelt, A., Fussenegger, M., Pinz, A., & Auer, P. (2004). Weak hypotheses and boosting for generic object detection and recognition. In ECCV’04. Google Scholar
  55. Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection. International Journal of Computer Vision, 38(1), 15–33. zbMATHCrossRefGoogle Scholar
  56. Peterson, M. A. (1994). Object recognition processes can and do operate before figure-ground organization. Current Directions in Psychological Science, 3, 105–111. CrossRefGoogle Scholar
  57. Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In ECCV’02 (pp. 700–714). Google Scholar
  58. Rowley, H., Baluja, S., & Kanade, T. (1998). Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), 23–38. CrossRefGoogle Scholar
  59. Russell, B., Torralba, A., & Freeman, W. T. (2005). The MIT LabelMe database.
  60. Schmid, C., & Mohr, R. (1996). Combining greyvalue invariants with local constraints for object recognition. In CVPR’96. Google Scholar
  61. Schneiderman, H., & Kanade, T. (2004). Object detection using the statistics of parts. International Journal of Computer Vision, 56(3), 151–177. CrossRefGoogle Scholar
  62. Sclaroff, S. (1997). Deformable prototypes for encoding shape categories in image databases. Pattern Recognition, 30(4). Google Scholar
  63. Seemann, E., Leibe, B., Mikolajczyk, K., & Schiele, B. (2005). An evaluation of local shape-based features for pedestrian detection. In BMVC’05, Oxford, UK. Google Scholar
  64. Sharon, E., Brandt, A., & Basri, R. (2000). Fast multiscale image segmentation. In CVPR’00 (pp. 70–77). Google Scholar
  65. Shi, J., & Malik, J. (1997). Normalized cuts and image segmentation. In CVPR’97 (pp. 731–737). Google Scholar
  66. Stauffer, C., & Grimson, W. E. L. (1999). Adaptive background mixture models for realtime tracking. In CVPR’99. Google Scholar
  67. Thureson, J., & Carlsson, S. (2004). Appearance based qualitative image description for object class recognition. In ECCV’04. Google Scholar
  68. Torralba, A., Murphy, K., & Freeman, W. (2004). Sharing features: efficient boosting procedures for multiclass object detection. In CVPR’04. Google Scholar
  69. Tu, Z., Chen, X., Yuille, A. L., & Zhu, S.-C. (2003). Image parsing: Unifying segmentation, detection, and recognition. In ICCV’03. Google Scholar
  70. Tuytelaars, T., & van Gool, L. (2004). Matching widely separated views based on affinely invariant neighbourhoods. International Journal of Computer Vision, 59(1), 61–85. CrossRefGoogle Scholar
  71. Ullman, S. (1998). Three-dimensional object recognition based on the combination of views. Cognition, 67(1), 21–44. CrossRefGoogle Scholar
  72. Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7), 682–687. Google Scholar
  73. Vecera, S. P., & O’Reilly, R. C. (1998). Figure-ground organization and object recognition processes: an interactive account. Journal of Experimental Psychology: Human Perception and Performance, 24(2), 441–462. CrossRefGoogle Scholar
  74. Viola, P., & Jones, M. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154. CrossRefGoogle Scholar
  75. Weber, M. (2000). Unsupervised learning of models for object recognition. PhD thesis, California Institute of Technology, Pasadena, CA. Google Scholar
  76. Weber, M., Welling, M., & Perona, P. (2000). Towards automatic discovery of object categories. In CVPR’00. Google Scholar
  77. Wiskott, L., Fellous, J. M., Krueger, N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775–779. CrossRefGoogle Scholar
  78. Wu, B., & Nevatia, R. (2005). Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors. In ICCV’05. Google Scholar
  79. Yu, S. X., & Shi, J. (2003). Object-specific figure-ground segregation. In CVPR’03. Google Scholar
  80. Yuille, A. L., Cohen, D. S., & Hallinan, P. W. (1989). Feature extraction from faces using deformable templates. In CVPR’89. Google Scholar
  81. Zhang, W., Yu, B., Zelinsky, G. J., & Samaras, D. (2005). Object class recognition using multiple layer boosting with heterogeneous features. In CVPR’05. Google Scholar
  82. Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision, 73(2), 213–238. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Computer Vision LaboratoryETH ZurichZurichSwitzerland
  2. 2.Faculty of Computer and Information ScienceUniversity of LjubljanaLjubljanaSlovenia
  3. 3.Department of Computer ScienceTU DarmstadtDarmstadtGermany

Personalised recommendations