Abstract
This paper presents a novel method for detecting and localizing objects of a visual category in cluttered real-world scenes. Our approach considers object categorization and figure-ground segmentation as two interleaved processes that closely collaborate towards a common goal. As shown in our work, the tight coupling between those two processes allows them to benefit from each other and improve the combined performance.
The core part of our approach is a highly flexible learned representation for object shape that can combine the information observed on different training examples in a probabilistic extension of the Generalized Hough Transform. The resulting approach can detect categorical objects in novel images and automatically infer a probabilistic segmentation from the recognition result. This segmentation is then in turn used to again improve recognition by allowing the system to focus its efforts on object pixels and to discard misleading influences from the background. Moreover, the information from where in the image a hypothesis draws its support is employed in an MDL based hypothesis verification stage to resolve ambiguities between overlapping hypotheses and factor out the effects of partial occlusion.
An extensive evaluation on several large data sets shows that the proposed system is applicable to a range of different object categories, including both rigid and articulated objects. In addition, its flexible representation allows it to achieve competitive object detection performance already from training sets that are between one and two orders of magnitude smaller than those used in comparable systems.
Similar content being viewed by others
References
Agarwal, S., Atwan, A., & Roth, D. (2004). Learning to detect objects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1475–1490.
Bajcsy, R., Solina, F., & Gupta, A. (1990). Segmentation versus object representation—are they separable? In Analysis and interpretation of range images (pp. 207–223). New York: Springer.
Ballard, D. H. (1981). Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 111–122.
Belongie, S., Malik, J., & Puchiza, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522.
Benzécri, J. P. (1982). Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Cahiers de l’Analyse des Données, 7(2), 209–218.
Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In Lecture notes in computer science : Vol. 2353. ECCV’02 (pp. 109–122). Berlin: Springer.
Borenstein, E., Sharon, E., & Ullman, S. (2004). Combining top-down and bottom-up segmentations. In Workshop on perceptual organization in computer vision, Washington, DC, June 2004.
Bruynooghe, M. (1977). Méthodes nouvelles en classification automatique des données taxinomiques nombreuses. Statistique et Analyse des Données, 3, 24–42.
Burl, M. C., Weber, M., & Perona, P. (1998). A probabilistic approach to object recognition using local photometry and global geometry. In ECCV’98.
Cheng, Y. (1995). Mean shift mode seeking and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 790–799.
Collins, R. (2003). Mean-shift blob tracking through scale space. In CVPR’03.
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619.
Comaniciu, D., Ramesh, V., & Meer, P. (2001). The variable bandwidth mean shift and data-driven scale selection. In ICCV’01.
Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models. In ECCV’98.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR’05.
Day, W. H. E., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1, 7–24.
de Rham, C. (1980). La classification hiérarchique ascendante selon la méthode des voisins réciproques. Cahiers de l’Analyse des Données, 5(2), 135–144.
Deselaers, T., Keysers, D., & Ney, H. (2005). Improving a discriminative approach to object recognition using image patches. In DAGM’05.
Dorko, G., & Schmid, C. (2003). Selection of scale invariant parts for object class recognition. In ICCV’03.
Everingham, M., et al.(2006). The 2005 PASCAL visual object class challenge. In J. Quinonero-Candela, I. Dagan, B. Magnini, & F. d’Alche-Buc (Eds.), Lecture notes in artificial intelligence : Vol. 3944. Machine learning challenges. Evaluating predictive uncertainity, visual object classification, and recognising textual entailment. Berlin: Springer. http://www.pascal-network.org/challenges/VOC/.
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1).
Fergus, R., Perona, P., & Zisserman, A. (2005). A sparse object category model for efficient learning and exhaustive recognition. In CVPR’05.
Fergus, R., Zisserman, A., & Perona, P. (2003). Object class recognition by unsupervised scale-invariant learning. In CVPR’03.
Ferrari, V., Tuytelaars, T., & van Gool, L. (2004). Simultaneous recognition and segmentation by image exploration. In ECCV’04.
Garcia, C., & Delakis, M. (2004). Convolutional face finder: a neural architecture for fast and robust face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1408–1423.
Garg, A., Agarwal, S., & Huang, T. (2002). Fusion of global and local information for object detection. In ICPR’02.
Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Alvey vision conference (pp. 147–151).
Heisele, B., Serre, T., Pontil, M., & Poggio, T. (2001). Component-based face detection. In CVPR’01 (pp. 657–662).
Hough, P. V. C. (1962). Method and means for recognizing complex patterns. U.S. Patent 3069654.
Jones, M., & Poggio, T. (1996). Model-based matching by linear combinations of prototypes. MIT AI Memo 1583, MIT.
Jones, M. J., & Poggio, T. (1998). Multidimensional morphable models: a framework for representing and matching object classes. International Journal of Computer Vision, 29(2), 107–131.
Kadir, T., & Brady, M. (2001). Scale, saliency, and image description. International Journal of Computer Vision, 45(2), 83–105.
Leibe, B., & Schiele, B. (2003). Interleaved object categorization and segmentation. In BMVC’03 (pp. 759–768), Norwich, UK, September 2003.
Leibe, B., & Schiele, B. (2004). Scale invariant object categorization using a scale-adaptive mean-shift search. In Lecture notes in computer science : Vol. 3175. DAGM’04 (pp. 145–153). Berlin: Springer.
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV’04 workshop on statistical learning in computer vision.
Leibe, B., Seemann, E., & Schiele, B. (2005). Pedestrian detection in crowded scenes. In CVPR’05.
Leonardis, A., Gupta, A., & Bajcsy, R. (1995). Segmentation of range images as the search for geometric parametric models. International Journal of Computer Vision, 14, 253–277.
Li, F.-F., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV’03.
Lindeberg, T. (1998). Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2), 79–116.
Lowe, D. G. (1999). Object recognition from local scale invariant features. In ICCV’99.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley symposium on mathematical statistics and probability (pp. 281–297).
Magee, D., & Boyle, R. (2002). Detecting lameness using ‘re-sampling condensation’ and ‘multi-stream cyclic hidden Markov models’. Image and Vision Computing, 20(8), 581–594.
Malik, J., Belongie, S., Leung, T., & Shi, J. (2001). Contour and texture analysis for image segmentation. International Journal of Computer Vision, 43(1), 7–27.
Marr, D. (1982). Vision. San Francisco: Freeman.
Matas, J., Chum, O., Martin, U., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In BMVC’02 (pp. 384–393).
Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10).
Mikolajczyk, C., Schmid, C., & Zisserman, A. (2004). Human detection based on a probabilistic assembly of robust part detectors. In Lecture notes in computer science : Vol. 3021. ECCV’04 (pp. 69–82). Berlin: Springer.
Mikolajczyk, K., Leibe, B., & Schiele, B. (2005a). Local features for object class recognition. In ICCV’05.
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & Van Gool, L. (2005b). A comparison of affine region detectors. International Journal of Computer Vision, 65(1/2), 43–72.
Mohan, A., Papageorgiou, C., & Poggio, T. (2001). Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(4), 349–361.
Mutch, J., & Lowe, D. (2006). Multiclass object recognition with sparse, localized features. In CVPR’06.
Needham, A. (2001). Object recognition and object segregation in 4.5-month-old infants. Journal of Experimental Child Psychology, 78(3), 3–24.
Opelt, A., Fussenegger, M., Pinz, A., & Auer, P. (2004). Weak hypotheses and boosting for generic object detection and recognition. In ECCV’04.
Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection. International Journal of Computer Vision, 38(1), 15–33.
Peterson, M. A. (1994). Object recognition processes can and do operate before figure-ground organization. Current Directions in Psychological Science, 3, 105–111.
Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In ECCV’02 (pp. 700–714).
Rowley, H., Baluja, S., & Kanade, T. (1998). Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), 23–38.
Russell, B., Torralba, A., & Freeman, W. T. (2005). The MIT LabelMe database. http://people.csail.mit.edu/brussell/research/LabelMe.
Schmid, C., & Mohr, R. (1996). Combining greyvalue invariants with local constraints for object recognition. In CVPR’96.
Schneiderman, H., & Kanade, T. (2004). Object detection using the statistics of parts. International Journal of Computer Vision, 56(3), 151–177.
Sclaroff, S. (1997). Deformable prototypes for encoding shape categories in image databases. Pattern Recognition, 30(4).
Seemann, E., Leibe, B., Mikolajczyk, K., & Schiele, B. (2005). An evaluation of local shape-based features for pedestrian detection. In BMVC’05, Oxford, UK.
Sharon, E., Brandt, A., & Basri, R. (2000). Fast multiscale image segmentation. In CVPR’00 (pp. 70–77).
Shi, J., & Malik, J. (1997). Normalized cuts and image segmentation. In CVPR’97 (pp. 731–737).
Stauffer, C., & Grimson, W. E. L. (1999). Adaptive background mixture models for realtime tracking. In CVPR’99.
Thureson, J., & Carlsson, S. (2004). Appearance based qualitative image description for object class recognition. In ECCV’04.
Torralba, A., Murphy, K., & Freeman, W. (2004). Sharing features: efficient boosting procedures for multiclass object detection. In CVPR’04.
Tu, Z., Chen, X., Yuille, A. L., & Zhu, S.-C. (2003). Image parsing: Unifying segmentation, detection, and recognition. In ICCV’03.
Tuytelaars, T., & van Gool, L. (2004). Matching widely separated views based on affinely invariant neighbourhoods. International Journal of Computer Vision, 59(1), 61–85.
Ullman, S. (1998). Three-dimensional object recognition based on the combination of views. Cognition, 67(1), 21–44.
Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7), 682–687.
Vecera, S. P., & O’Reilly, R. C. (1998). Figure-ground organization and object recognition processes: an interactive account. Journal of Experimental Psychology: Human Perception and Performance, 24(2), 441–462.
Viola, P., & Jones, M. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.
Weber, M. (2000). Unsupervised learning of models for object recognition. PhD thesis, California Institute of Technology, Pasadena, CA.
Weber, M., Welling, M., & Perona, P. (2000). Towards automatic discovery of object categories. In CVPR’00.
Wiskott, L., Fellous, J. M., Krueger, N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775–779.
Wu, B., & Nevatia, R. (2005). Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors. In ICCV’05.
Yu, S. X., & Shi, J. (2003). Object-specific figure-ground segregation. In CVPR’03.
Yuille, A. L., Cohen, D. S., & Hallinan, P. W. (1989). Feature extraction from faces using deformable templates. In CVPR’89.
Zhang, W., Yu, B., Zelinsky, G. J., & Samaras, D. (2005). Object class recognition using multiple layer boosting with heterogeneous features. In CVPR’05.
Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision, 73(2), 213–238.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Leibe, B., Leonardis, A. & Schiele, B. Robust Object Detection with Interleaved Categorization and Segmentation. Int J Comput Vis 77, 259–289 (2008). https://doi.org/10.1007/s11263-007-0095-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-007-0095-3