Unsupervised Discovery of Mid-Level Discriminative Patches

  • Saurabh Singh
  • Abhinav Gupta
  • Alexei A. Efros
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7573)


The goal of this paper is to discover a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation. The desired patches need to satisfy two requirements: 1) to be representative, they need to occur frequently enough in the visual world; 2) to be discriminative, they need to be different enough from the rest of the visual world. The patches could correspond to parts, objects, “visual phrases”, etc. but are not restricted to be any one of them. We pose this as an unsupervised discriminative clustering problem on a huge dataset of image patches. We use an iterative procedure which alternates between clustering and training discriminative classifiers, while applying careful cross-validation at each step to prevent overfitting. The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks. Furthermore, discriminative patches can also be used in a supervised regime, such as scene classification, where they demonstrate state-of-the-art performance on the MIT Indoor-67 dataset.


Visual Word Image Patch Visual World Spatial Pyramid Unlabeled Image 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: ICCV (2003)Google Scholar
  2. 2.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)Google Scholar
  3. 3.
    Li, L.-J., Su, H., Xing, E.P., Fei-fei, L.: Object bank: A high-level image representation for scene classification and semantic feature sparsification. In: NIPS (2010)Google Scholar
  4. 4.
    Pandey, M., Lazebnik, S.: Scene recognition and weakly supervised object localization with deformable part-based models. In: ICCV (2011)Google Scholar
  5. 5.
    Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: CVPR (2009)Google Scholar
  6. 6.
    Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large database for non-parametric object and scene recognition. PAMI (2008)Google Scholar
  7. 7.
    Hays, J., Efros, A.A.: im2gps: estimating geographic information from a single image. In: CVPR (2008)Google Scholar
  8. 8.
    Ullman, S., Vidal-Naquet, M., Sali, E.: Visual features of intermediate complexity and their use in classification. Nature America (2002)Google Scholar
  9. 9.
    Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons (2001)Google Scholar
  10. 10.
    Brown, M., Szeliski, R., Winder, S.: Multi-image matching using multi-scale oriented patches. In: CVPR (2005)Google Scholar
  11. 11.
    Berg, A.C., Malik, J.: Geometric blur for template matching. In: CVPR (2001)Google Scholar
  12. 12.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)Google Scholar
  13. 13.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  14. 14.
    Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)Google Scholar
  15. 15.
    Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient Object Category Recognition Using Classemes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 776–789. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Payet, N., Todorovic, S.: Scene shape from texture of objects. In: CVPR (2011)Google Scholar
  17. 17.
    Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annotations. In: ICCV (2009)Google Scholar
  18. 18.
    Farhadi, A., Endres, I., Hoiem, D.: Attribute-centric recognition for cross-category generalization. In: CVPR (2010)Google Scholar
  19. 19.
    Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: CVPR (2011)Google Scholar
  20. 20.
    Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 1–15. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    Choi, M.J., Lim, J.J., Torralba, A., Willsky, A.S.: Exploiting hierarchical context on a large database of object categories. In: CVPR (2010)Google Scholar
  22. 22.
    Yao, B., Fei-Fei, L.: Grouplet: A structured image representation for recognizing human and object interactions. In: CVPR (2010)Google Scholar
  23. 23.
    Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR (2003)Google Scholar
  24. 24.
    Russell, B.C., Freeman, W.T., Efros, A.A., Sivic, J., Zisserman, A.: Using multiple segmentations to discover objects and their extent in image collections. In: CVPR (2006)Google Scholar
  25. 25.
    Todorovic, S., Ahuja, N.: Unsupervised category modeling, recognition, and segmentation in images. PAMI (2008)Google Scholar
  26. 26.
    Kim, G., Faloutsos, C., Hebert, M.: Unsupervised Modeling of Object Categories Using Link Analysis Techniques. In: CVPR (2008)Google Scholar
  27. 27.
    Lee, Y.J., Grauman, K.: Foreground focus: Unsupervised learning from partially matching images. IJCV (2009)Google Scholar
  28. 28.
    Lee, Y.J., Grauman, K.: Object-graphs for context-aware category discovery. In: CVPR (2010)Google Scholar
  29. 29.
    Lee, Y.J., Grauman, K.: Learning the easy things first: Self-paced visual category discovery. In: CVPR (2011)Google Scholar
  30. 30.
    Kim, G., Torralba, A.: Unsupervised Detection of Regions of Interest using Iterative Link Analysis. In: NIPS (2009)Google Scholar
  31. 31.
    Kang, H., Hebert, M., Kanade, T.: Discovering object instances from scenes of daily living. In: ICCV (2011)Google Scholar
  32. 32.
    Shrivastava, A., Malisiewicz, T., Gupta, A., Efros, A.A.: Data-driven visual similarity for cross-domain image matching. ACM ToG (SIGGRAPH Asia) (2011)Google Scholar
  33. 33.
    Ye, J., Zhao, Z., Wu, M.: Discriminative k-means for clustering. In: NIPS (2007)Google Scholar
  34. 34.
    Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering object categories in image collections. In: ICCV (2005)Google Scholar
  35. 35.
    Karlinsky, L., Dinerstein, M., Ullman, S.: Unsupervised feature optimization (ufo): Simultaneous selection of multiple features with their detection parameters. In: CVPR (2009)Google Scholar
  36. 36.
    Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge (2007)Google Scholar
  37. 37.
    Zhu, J., Li, L.-J., Li, F.-F., Xing, E.P.: Large margin learning of upstream scene understanding models. In: NIPS (2010)Google Scholar
  38. 38.
    Doersch, C., Singh, S., Gupta, A., Sivic, J., Efros, A.A.: What makes paris look like paris? ACM Transactions on Graphics (SIGGRAPH) 31 (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Saurabh Singh
    • 1
  • Abhinav Gupta
    • 1
  • Alexei A. Efros
    • 1
  1. 1.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations