International Journal of Computer Vision

, Volume 120, Issue 2, pp 111–133 | Cite as

Learning Dictionary of Discriminative Part Detectors for Image Categorization and Cosegmentation

  • Jian SunEmail author
  • Jean Ponce


This paper proposes a novel approach to learning mid-level image models for image categorization and cosegmentation. We represent each image class by a dictionary of part detectors that best discriminate that class from the background. We learn category-specific part detectors in a weakly supervised setting in which the training images are only annotated with category labels without part/object location information. We use a latent SVM model regularized using the \(\ell _{2,1}\) group sparsity norm to learn the part detectors. Starting from a large set of initial parts, the group sparsity regularizer forces the model to jointly select and optimize a set of discriminative part detectors in a max-margin framework. We propose a stochastic version of a proximal algorithm to solve the corresponding optimization problem. We apply the learned part detectors to image classification and cosegmentation, and present extensive comparative experiments with standard benchmarks.


Discriminative parts Discriminative learning Image classification Image cosegmentation 



Jian Sun was supported by NSFC (No. 61472313, 11131006), the 973 program (2013CB329404), NCET-12-0442, and NSFC (No. 61303121). Jean Ponce’s work was supported in part by European Research Council (VideoWorld project) and the Institut Universitaire de France.


  1. Ahmed, E., Shakhnarovich, G., & Maji, S. (2014). Knowing a good hog filter when you see it: Efficient selection of filters for detection. In ECCV.Google Scholar
  2. Arbeláez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L., & Malik, J. (2012). Semantic segmentation using regions and parts. In CVPR.Google Scholar
  3. Azizpour, H., & Laptev, I. (2012). Object detection using strongly-supervised deformable part models. In ECCV.Google Scholar
  4. Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Bo, L., Ren, X., & Fox, D. (2013). Multipath sparse coding using hierarchical matching pursuit. In CVPR.Google Scholar
  6. Bo, L., & Sminchisescu, C. (2009). Efficient match kernel between sets of features for visual recognition. In NIPS.Google Scholar
  7. Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010) Detecting people using mutually consistent poselet activations. In ECCV (pp. 168–181).Google Scholar
  8. Bourdev, L., & Malik, J. (2009) Poselets: Body part detectors trained using 3d human pose annotations. In ICCV.Google Scholar
  9. Boureau, Y., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In CVPR.Google Scholar
  10. Boureau, Y., Le Roux, N., Bach, F., Ponce, J., & LeCun, Y. (2011). Ask the locals: Multi-way local pooling for image recognition. In ICCV.Google Scholar
  11. Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1124–1137.CrossRefzbMATHGoogle Scholar
  12. Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV.Google Scholar
  13. Chen, X., Shrivastava, A., & Gupta, A. (2015). Enriching visual knowledge bases via object discovery and segmentation. In CVPR.Google Scholar
  14. Cheng, M. M., Zhang, G. X., Mitra, N. J., Huang, X., & Hu, S. M. (2011). Global contrast based salient region detection. In CVPR.Google Scholar
  15. Cimpoi, M., Maji, S., & Vedaldi, A. (2015). Deep filter banks for texture recognition and segmentation. In CVPR.Google Scholar
  16. Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning in Computer Vision.Google Scholar
  17. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.Google Scholar
  18. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
  19. Doersch, C., Gupta, A., & Efros, A. A. (2013). Mid-level visual element discovery as discriminative mode seeking. In NIPS.Google Scholar
  20. Doersch, C., Singh, S., Gupta, A., Sivic, J., & Efros, A. (2012). What makes paris look like Paris? ACM Transactions on Graphics, 31(4), 101:1–101:9.CrossRefGoogle Scholar
  21. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In ICML.Google Scholar
  22. Duchenne, O., Joulin, A., & Ponce, J. (2011). A graph-matching kernel for object categorization. In ICCV.Google Scholar
  23. Duchi, J., & Singer, Y. (2009). Efficient learning using forward-backward splitting. In NIPS.Google Scholar
  24. Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12), 3736–3745.MathSciNetCrossRefGoogle Scholar
  25. Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR Workshop on Generative-Model Based Vision.Google Scholar
  26. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRefGoogle Scholar
  27. Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable part models are convolutional neural networks. In CVPR.Google Scholar
  28. Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In ECCV.Google Scholar
  29. Griffin, G., & Holub, A. (2007). Perona, P.: Caltech-256 object category data set.Google Scholar
  30. Hariharan, B., Malik, J., & Ramanan, D. (2012). Discriminative decorrelation for clustering and classification. In ECCV.Google Scholar
  31. Jiang, Z., Lin, Z., & Davis, L. S. (2011). Learning a discriminative dictionary for sparse coding via label consistent k-svd. In CVPR.Google Scholar
  32. Jiang, Z., Lin, Z., & Davis, L. S. (2013). Label consistent k-svd: Learning a discriminative dictionary for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2651–2664.CrossRefGoogle Scholar
  33. Joulin, A., Bach, F., & Ponce, J. (2010). Discriminative clustering for image co-segmentation. In CVPR.Google Scholar
  34. Joulin, A., Bach, F., & Ponce, J. (2012). Multi-class cosegmentation. In CVPR.Google Scholar
  35. Juneja, M., Vedaldi, A., Jawahar, C., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR.Google Scholar
  36. Kim, G., & Xing, E. P. (2012). On multiple foreground cosegmentation. In CVPR.Google Scholar
  37. Kim, G., Xing, E. P., Fei-Fei, L., & Kanade, T. (2011). Distributed cosegmentation via submodular optimization on anisotropic diffusion. In ICCV.Google Scholar
  38. Kim, J., Liu, C., Sha, F., & Grauman, K. (2013). Deformable spatial pyramid matching for fast dense correspondences. In CVPR.Google Scholar
  39. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1097–1105).Google Scholar
  40. Kuettel, D., Guillaumin, M., & Ferrari, V. (2012). Segmentation propagation in ImageNet. In ECCV.Google Scholar
  41. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR .Google Scholar
  42. Li, L., Su, H., Xing, E., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS.Google Scholar
  43. Li, L. J., & Fei-Fei, L. (2007). What, where and who? Classifying events by scene and object recognition. In ICCV.Google Scholar
  44. Lin, D., Lu, C., Liao, R., & Jia, J. (2014). Learning important spatial pooling regions for scene classification. In CVPR.Google Scholar
  45. Liu, L., Wang, L., & Liu, X. (2011). In defense of soft-assignment coding. In ICCV.Google Scholar
  46. Lowe, D. G. (1999). Object recognition from local scale-invariant features. In CVPR.Google Scholar
  47. Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2009). Online dictionary learning for sparse coding. In ICML.Google Scholar
  48. Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2008). Discriminative learned dictionaries for local image analysis. In CVPR.Google Scholar
  49. M. Juneja, Vedaldi, A., Jawahar, C. V., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR.Google Scholar
  50. Mukherjee, L., Singh, V., & Peng, J. (2011). Scale invariant cosegmentation for image groups. In CVPR.Google Scholar
  51. Mukherjee, L., Singh, V., Xu, J., & Collins, M. D. (2012). Analyzing the subspace structure of related images:concurrent segmentation of image sets. In ECCV.Google Scholar
  52. Oliva, A., & Torralba, A. (2010). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.CrossRefzbMATHGoogle Scholar
  53. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23), 3311–3325.CrossRefGoogle Scholar
  54. Pandey, M., & Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV.Google Scholar
  55. Parizi, S. N., Oberlin, J. G., & Felzenszwalb, P. F. (2012). Reconfigurable models for scene recognition. In CVPR.Google Scholar
  56. Parizi, S. N., Vedaldi, A., Zisserman, A., & Felzenszwalb, P. (2015). Automatic discovery and optimization of parts for image classification. In ICLR.Google Scholar
  57. Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV.Google Scholar
  58. Quattoni, A., & A. Torralba (2009). Recognizing indoor scenes. In CVPR.Google Scholar
  59. Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314.CrossRefGoogle Scholar
  60. Rubinstein, M., Joulin, A., Kopf, J., & Liu, C. (2013). Unsupervised joint object discovery and segmentation in internet images. In CVPR.Google Scholar
  61. Sadeghi, F., & Tappen, M. F. (2012). Latent pyramidal regions for recognizing scenes. In ECCV.Google Scholar
  62. Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.MathSciNetCrossRefzbMATHGoogle Scholar
  63. Santosh, K., Divvala, A. A. E., & Hebert, M. (2012). How important are deformable parts in the deformable parts model? In ECCV Workshop on Parts and Attributes.Google Scholar
  64. Seidenari, L., Serra, G., Bagdanov, A. D., & Bimbo, A. D. (2014). Local pyramidal descriptors for image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 1033–1040.CrossRefGoogle Scholar
  65. Sharma, G., Jurie, F., & Schmid, C. (2012). Discriminative spatial saliency for image classification. In CVPR.Google Scholar
  66. Singh, S., Gupta, A., & Efros, A. (2012). Unsupervised discovery of mid-level discriminative patches. In ECCV.Google Scholar
  67. Siva, P., Russell, C., & Xiang, T. (2012). In defence of negative mining for annotating weakly labelled data. In ECCV.Google Scholar
  68. Su, Y., & Jurie, F. (2011). Visual word disambiguation by semantic contexts. In ICCV.Google Scholar
  69. Sun, J., & Ponce, J. (2013). Learning discriminative part detectors for image classification and cosegmentation. In ICCV.Google Scholar
  70. Todorovic, S., & Ahuja, N. (2008). Learning subcategory relevances for category recognition. In CVPR.Google Scholar
  71. Vezhnevets, A., & Buhmann, J. M. (2012). Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning. In CVPR.Google Scholar
  72. Vezhnevets, A., Ferrari, V., & Buhmann, J. M. (2012). Weakly supervised structured output learning for semantic segmentation. In CVPR.Google Scholar
  73. Vicente, S., Rother, C., & Kolmogorov, V. (2011). Object cosegmentation. In CVPR.Google Scholar
  74. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.Google Scholar
  75. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality constrained linear coding for image classification. In CVPR.Google Scholar
  76. Wang, X., Wang, B., Bai, X., Liu, W., & Tu, Z. (2013). Max-margin multiple-instance dictionary learning. In ICML.Google Scholar
  77. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR.Google Scholar
  78. Yan, S., Xu, X., Xu, D., Lin, S., & Li, X. (2012). Beyond spatial pyramids: A new feature extraction framework with dense spatial sampling for image classification. In ECCV.Google Scholar
  79. Yang, J., Li, Y., Tian, Y., Duan, L., & Gao, W. (2009). Group-sensitive multiple kernel learning for object categorization. In CVPR.Google Scholar
  80. Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.Google Scholar
  81. Yao, B., Jiang, X., Khosla, A., Lin, A., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV.Google Scholar
  82. Yuan, M., & Lin, Y. (2005). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B, 68(1), 49–67.MathSciNetCrossRefzbMATHGoogle Scholar
  83. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.Google Scholar
  84. Zheng, Y., Jiang, Y. G., & Xue, X. (2012). Learning hybrid part filters for scene recognition. In ECCV.Google Scholar
  85. Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q., & Jiang, X. (2014). Learning discriminative and shareable features for scene classification. In ECCV.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Xi’an Jiaotong UniversityXi’anPeople’s Republic of China
  2. 2.École Normale Supérieure / PSL Research UniversityParisFrance

Personalised recommendations