International Journal of Computer Vision

, Volume 100, Issue 3, pp 275–293 | Cite as

Weakly Supervised Localization and Learning with Generic Knowledge

  • Thomas Deselaers
  • Bogdan Alexe
  • Vittorio Ferrari


Learning a new object class from cluttered training images is very challenging when the location of object instances is unknown, i.e. in a weakly supervised setting. Many previous works require objects covering a large portion of the images. We present a novel approach that can cope with extensive clutter as well as large scale and appearance variations between object instances. To make this possible we exploit generic knowledge learned beforehand from images of other classes for which location annotation is available. Generic knowledge facilitates learning any new class from weakly supervised images, because it reduces the uncertainty in the location of its object instances. We propose a conditional random field that starts from generic knowledge and then progressively adapts to the new class. Our approach simultaneously localizes object instances while learning an appearance model specific for the class. We demonstrate this on several datasets, including the very challenging Pascal VOC 2007. Furthermore, our method allows training any state-of-the-art object detector in a weakly supervised fashion, although it would normally require object location annotations.


Object detection Weakly supervised learning Transfer learning Conditional random fields 



The authors gratefully acknowledge support from the Swiss National Science Foundation.


  1. Alexe, B., Deselaers, T., & Ferrari, V. (2010a). ClassCut for unsupervised class segmentation. In ECCV. Google Scholar
  2. Alexe, B., Deselaers, T., & Ferrari, V. (2010b). What is an object? In CVPR. Google Scholar
  3. Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence. Google Scholar
  4. Andrews, S., Tsochantaridis, I., & Hofmann, T. (2002). Support vector machines for multiple-instance learning. In NIPS. Google Scholar
  5. Arora, H., Loeff, N., Forsyth, D., & Ahuja, N. (2007). Unsupervised segmentation of objects using efficient learning. In CVPR. Google Scholar
  6. Babenko, B., Branson, S., & Belongie, S. (2009). Similarity metrics for categorization: From monolithic to category specific. In ICCV. Google Scholar
  7. Bagon, S., Brostovski, O., Galun, M., & Irani, M. (2010). Detecting and sketching the common. In CVPR. Google Scholar
  8. Bay, H., Ess, A., Tuytelaars, T., & van Gool, L. (2008). SURF: speeded up robust features. In CVIU. Google Scholar
  9. Blaschko, B., Vedaldi, A., & Zisserman, A. (2010). Simultaneous object detection and ranking with weak supervision. In NIPS. Google Scholar
  10. Borenstein, E., & Ullman, S. (2004). Learning to segment. In ECCV. Google Scholar
  11. Cao, L., & Li, F. F. (2007). Spatially coherent latent topic model for concurrent segmentation and classification of objects and scene. In ICCV. Google Scholar
  12. Carreira, J., Li, F., & Sminchisescu, C. (2010). Constrained parametric min cuts for automatic object segmentation. In CVPR. Google Scholar
  13. Chen, Y., Bi, J., & Wang, J. Z. (2006). MILES: multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12), 1931–1947. CrossRefGoogle Scholar
  14. Chum, O., & Zisserman, A. (2007). An exemplar model for learning object classes. In CVPR. Google Scholar
  15. Crandall, D. J., & Huttenlocher, D. (2006). Weakly supervised learning of part-based spatial models for visual object recognition. In ECCV. Google Scholar
  16. Dalal, N., & Triggs, B. (2005). Histogram of Oriented Gradients for human detection. In CVPR. Google Scholar
  17. Deselaers, T., & Ferrari, V. (2010). A conditional random field for multiple-instance learning. In ICML. Google Scholar
  18. Deselaers, T., Alexe, B., & Ferrari, V. (2010). Localizing objects while learning their appearance. In ECCV. Google Scholar
  19. Dorkó, G., & Schmid, C. (2005). Object class recognition using discriminative local features. Tech. Rep. RR-5497, INRIA, Rhone-Alpes. Google Scholar
  20. Endres, I., & Hoiem, D. (2010). Category independent object proposals. In ECCV. Google Scholar
  21. Everingham, M., Van Gool, L., Williams, C. K. I., & Zisserman, A. (2006). The PASCAL Visual Object Classes Challenge 2006 (VOC2006).
  22. Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2007). The PASCAL Visual Object Classes Challenge 2007 Results. Google Scholar
  23. Everingham, M., et al. (2010). The PASCAL Visual Object Classes Challenge 2010 Results. Google Scholar
  24. Fei-Fei, L., Fergus, R., & Perona, P. (2003). A bayesian approach to unsupervised one-shot learning of object categories. In ICCV (pp. 1134–1141). Google Scholar
  25. Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In CVPR workshop of generative model based vision. Google Scholar
  26. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645. CrossRefGoogle Scholar
  27. Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In CVPR. Google Scholar
  28. Finley, T., & Joachims, T. (2008). Training structural svms when exact inference is intractable. In ICML. Google Scholar
  29. Fritz, M., & Schiele, B. (2006). Towards unsupervised discovery of visual categories. In DAGM. Google Scholar
  30. Frome, A., Singer, Y., Sha, F., & Malik, J. (2007). Learning globally-consistent local distance functions for shape-based image retrieval and classification. In ICCV. Google Scholar
  31. Gaidon, A., Marszalek, M., & Schmid, C. (2009). Mining visual actions from movies. In BMVC. Google Scholar
  32. Galleguillos, C., Babenko, B., Rabinovich, A., & Belongie, S. (2008). Weakly supervised object localization with stable segmentations. In ECCV. Google Scholar
  33. Grauman, K., & Darrell, T. (2006). Unsupervised learning of categories from sets of partially matching image features. In CVPR. Google Scholar
  34. Kim, G., & Torralba, A. (2009). Unsupervised detection of regions of interest using iterative link analysis. In NIPS. Google Scholar
  35. Kolmogorov, V. (2006a). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583. CrossRefGoogle Scholar
  36. Kolmogorov, V. (2006b). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583. CrossRefGoogle Scholar
  37. Lampert, C., Nickisch, H., & Harmeling, S. (2009a). Learning to detect unseen object classes by between-class attribute transfer. In CVPR. Google Scholar
  38. Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2009b). Efficient subwindow search: A branch and bound framework for object localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12), 2129–2142. CrossRefGoogle Scholar
  39. Lando, M., & Edelman, S. (1995). Generalization from a single view in face recognition. (technical report cs-tr 95-02). The Weizmann Institute of Science. Google Scholar
  40. Lee, Y., & Grauman, K. (2009a). Shape discovery from unlabeled image collections. In CVPR. Google Scholar
  41. Lee, Y. J., & Grauman, K. (2009b). Foreground focus: unsupervised learning from partially matching images. International Journal of Computer Vision, 85, 143–166. CrossRefGoogle Scholar
  42. Malisiewicz, T., & Efros, A. A. (2008). Recognition by association via learning per-exemplar distances. In CVPR. Google Scholar
  43. Nguyen, M., Torresani, L., de la Torre, F., & Rother, C. (2009). Weakly supervised discriminative localization and classification: a joint learning process. In ICCV. Google Scholar
  44. Nowak, E., & Jurie, F. (2007). Learning visual similarity measures for comparing never seen objects. In CVPR. Google Scholar
  45. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175. zbMATHCrossRefGoogle Scholar
  46. Payet, N., & Todorovic, S. (2010). From a set of shapes to object discovery. In ECCV. Google Scholar
  47. Quattoni, A., Collins, M., & Darrell, T. (2008). Transfer learning for image classification with sparse prototype representations. In CVPR. Google Scholar
  48. Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. (2007). Self-taught learning: transfer learning from unlabeled data. In ICML. Google Scholar
  49. Ramanan, D. (2006). Learning to parse images of articulated bodies. In NIPS. Google Scholar
  50. Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where—and why? semantic relatedness for knowledge transfer. In CVPR. Google Scholar
  51. Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: interactive foreground extraction using iterated graph cuts. Computer Graphics, 23(3), 309–314. Google Scholar
  52. Russel, B. C., & Torralba, A. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173. CrossRefGoogle Scholar
  53. Russell, B. C., Efros, A. A., Sivic, J., Freeman, W. T., & Zisserman, A. (2006). Using multiple segmentations to discover objects and their extent in image collections. In CVPR. Google Scholar
  54. Stark, M., Goesele, M., & Schiele, B. (2009). A shape-based object class model for knowledge transfer. In ICCV. Google Scholar
  55. Szummer, M., Kohli, P., & Hoiem, D. (2008). Learning CRFs using graph cuts. In ECCV. Google Scholar
  56. Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? In NIPS. Google Scholar
  57. Todorovic, S., & Ahuja, N. (2006). Extracting subimages of an unknown category from a set of images. In CVPR. Google Scholar
  58. Tommasi, T., & Caputo, B. (2009). The more you know, the less you learn: from knowledge transfer to one-shot learning of object categories. In BMVC. Google Scholar
  59. Tommasi, T., Orabona, F., & Caputo, B. (2010). Safety in numbers: learning categories from few examples with multi model knowledge transfer. In CVPR. Google Scholar
  60. Torresani, L., Szummer, M., & Fitzgibbon, A. (2010). Efficient object category recognition using classemes. In ECCV. Google Scholar
  61. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484. MathSciNetzbMATHGoogle Scholar
  62. Viola, P. A., Platt, J., & Zhang, C. (2005). Multiple instance boosting for object detection. In NIPS. Google Scholar
  63. Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2005). Distance metric learning for large margin nearest neighbor classification. In NIPS. Google Scholar
  64. Winn, J., & Jojic, N. (2005a). LOCUS: learning object classes with unsupervised segmentation. In ICCV. Google Scholar
  65. Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision, 73(2), 213–238 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Thomas Deselaers
    • 1
    • 2
  • Bogdan Alexe
    • 1
  • Vittorio Ferrari
    • 1
  1. 1.Computer Vision LaboratoryETH ZurichZurichSwitzerland
  2. 2.GoogleZurichSwitzerland

Personalised recommendations