Skip to main content

Learning Dictionary of Discriminative Part Detectors for Image Categorization and Cosegmentation


This paper proposes a novel approach to learning mid-level image models for image categorization and cosegmentation. We represent each image class by a dictionary of part detectors that best discriminate that class from the background. We learn category-specific part detectors in a weakly supervised setting in which the training images are only annotated with category labels without part/object location information. We use a latent SVM model regularized using the \(\ell _{2,1}\) group sparsity norm to learn the part detectors. Starting from a large set of initial parts, the group sparsity regularizer forces the model to jointly select and optimize a set of discriminative part detectors in a max-margin framework. We propose a stochastic version of a proximal algorithm to solve the corresponding optimization problem. We apply the learned part detectors to image classification and cosegmentation, and present extensive comparative experiments with standard benchmarks.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9




  3. In our approach, image correspondence cues can be disabled by setting \(\alpha _m = 0.\)



  • Ahmed, E., Shakhnarovich, G., & Maji, S. (2014). Knowing a good hog filter when you see it: Efficient selection of filters for detection. In ECCV.

  • Arbeláez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L., & Malik, J. (2012). Semantic segmentation using regions and parts. In CVPR.

  • Azizpour, H., & Laptev, I. (2012). Object detection using strongly-supervised deformable part models. In ECCV.

  • Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.

    MathSciNet  Article  MATH  Google Scholar 

  • Bo, L., Ren, X., & Fox, D. (2013). Multipath sparse coding using hierarchical matching pursuit. In CVPR.

  • Bo, L., & Sminchisescu, C. (2009). Efficient match kernel between sets of features for visual recognition. In NIPS.

  • Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010) Detecting people using mutually consistent poselet activations. In ECCV (pp. 168–181).

  • Bourdev, L., & Malik, J. (2009) Poselets: Body part detectors trained using 3d human pose annotations. In ICCV.

  • Boureau, Y., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In CVPR.

  • Boureau, Y., Le Roux, N., Bach, F., Ponce, J., & LeCun, Y. (2011). Ask the locals: Multi-way local pooling for image recognition. In ICCV.

  • Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1124–1137.

    Article  MATH  Google Scholar 

  • Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV.

  • Chen, X., Shrivastava, A., & Gupta, A. (2015). Enriching visual knowledge bases via object discovery and segmentation. In CVPR.

  • Cheng, M. M., Zhang, G. X., Mitra, N. J., Huang, X., & Hu, S. M. (2011). Global contrast based salient region detection. In CVPR.

  • Cimpoi, M., Maji, S., & Vedaldi, A. (2015). Deep filter banks for texture recognition and segmentation. In CVPR.

  • Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning in Computer Vision.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Doersch, C., Gupta, A., & Efros, A. A. (2013). Mid-level visual element discovery as discriminative mode seeking. In NIPS.

  • Doersch, C., Singh, S., Gupta, A., Sivic, J., & Efros, A. (2012). What makes paris look like Paris? ACM Transactions on Graphics, 31(4), 101:1–101:9.

    Article  Google Scholar 

  • Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In ICML.

  • Duchenne, O., Joulin, A., & Ponce, J. (2011). A graph-matching kernel for object categorization. In ICCV.

  • Duchi, J., & Singer, Y. (2009). Efficient learning using forward-backward splitting. In NIPS.

  • Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12), 3736–3745.

    MathSciNet  Article  Google Scholar 

  • Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR Workshop on Generative-Model Based Vision.

  • Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable part models are convolutional neural networks. In CVPR.

  • Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In ECCV.

  • Griffin, G., & Holub, A. (2007). Perona, P.: Caltech-256 object category data set.

  • Hariharan, B., Malik, J., & Ramanan, D. (2012). Discriminative decorrelation for clustering and classification. In ECCV.

  • Jiang, Z., Lin, Z., & Davis, L. S. (2011). Learning a discriminative dictionary for sparse coding via label consistent k-svd. In CVPR.

  • Jiang, Z., Lin, Z., & Davis, L. S. (2013). Label consistent k-svd: Learning a discriminative dictionary for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2651–2664.

    Article  Google Scholar 

  • Joulin, A., Bach, F., & Ponce, J. (2010). Discriminative clustering for image co-segmentation. In CVPR.

  • Joulin, A., Bach, F., & Ponce, J. (2012). Multi-class cosegmentation. In CVPR.

  • Juneja, M., Vedaldi, A., Jawahar, C., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR.

  • Kim, G., & Xing, E. P. (2012). On multiple foreground cosegmentation. In CVPR.

  • Kim, G., Xing, E. P., Fei-Fei, L., & Kanade, T. (2011). Distributed cosegmentation via submodular optimization on anisotropic diffusion. In ICCV.

  • Kim, J., Liu, C., Sha, F., & Grauman, K. (2013). Deformable spatial pyramid matching for fast dense correspondences. In CVPR.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1097–1105).

  • Kuettel, D., Guillaumin, M., & Ferrari, V. (2012). Segmentation propagation in ImageNet. In ECCV.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR .

  • Li, L., Su, H., Xing, E., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS.

  • Li, L. J., & Fei-Fei, L. (2007). What, where and who? Classifying events by scene and object recognition. In ICCV.

  • Lin, D., Lu, C., Liao, R., & Jia, J. (2014). Learning important spatial pooling regions for scene classification. In CVPR.

  • Liu, L., Wang, L., & Liu, X. (2011). In defense of soft-assignment coding. In ICCV.

  • Lowe, D. G. (1999). Object recognition from local scale-invariant features. In CVPR.

  • Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2009). Online dictionary learning for sparse coding. In ICML.

  • Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2008). Discriminative learned dictionaries for local image analysis. In CVPR.

  • M. Juneja, Vedaldi, A., Jawahar, C. V., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR.

  • Mukherjee, L., Singh, V., & Peng, J. (2011). Scale invariant cosegmentation for image groups. In CVPR.

  • Mukherjee, L., Singh, V., Xu, J., & Collins, M. D. (2012). Analyzing the subspace structure of related images:concurrent segmentation of image sets. In ECCV.

  • Oliva, A., & Torralba, A. (2010). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.

    Article  MATH  Google Scholar 

  • Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23), 3311–3325.

    Article  Google Scholar 

  • Pandey, M., & Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV.

  • Parizi, S. N., Oberlin, J. G., & Felzenszwalb, P. F. (2012). Reconfigurable models for scene recognition. In CVPR.

  • Parizi, S. N., Vedaldi, A., Zisserman, A., & Felzenszwalb, P. (2015). Automatic discovery and optimization of parts for image classification. In ICLR.

  • Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV.

  • Quattoni, A., & A. Torralba (2009). Recognizing indoor scenes. In CVPR.

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314.

    Article  Google Scholar 

  • Rubinstein, M., Joulin, A., Kopf, J., & Liu, C. (2013). Unsupervised joint object discovery and segmentation in internet images. In CVPR.

  • Sadeghi, F., & Tappen, M. F. (2012). Latent pyramidal regions for recognizing scenes. In ECCV.

  • Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.

    MathSciNet  Article  MATH  Google Scholar 

  • Santosh, K., Divvala, A. A. E., & Hebert, M. (2012). How important are deformable parts in the deformable parts model? In ECCV Workshop on Parts and Attributes.

  • Seidenari, L., Serra, G., Bagdanov, A. D., & Bimbo, A. D. (2014). Local pyramidal descriptors for image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 1033–1040.

    Article  Google Scholar 

  • Sharma, G., Jurie, F., & Schmid, C. (2012). Discriminative spatial saliency for image classification. In CVPR.

  • Singh, S., Gupta, A., & Efros, A. (2012). Unsupervised discovery of mid-level discriminative patches. In ECCV.

  • Siva, P., Russell, C., & Xiang, T. (2012). In defence of negative mining for annotating weakly labelled data. In ECCV.

  • Su, Y., & Jurie, F. (2011). Visual word disambiguation by semantic contexts. In ICCV.

  • Sun, J., & Ponce, J. (2013). Learning discriminative part detectors for image classification and cosegmentation. In ICCV.

  • Todorovic, S., & Ahuja, N. (2008). Learning subcategory relevances for category recognition. In CVPR.

  • Vezhnevets, A., & Buhmann, J. M. (2012). Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning. In CVPR.

  • Vezhnevets, A., Ferrari, V., & Buhmann, J. M. (2012). Weakly supervised structured output learning for semantic segmentation. In CVPR.

  • Vicente, S., Rother, C., & Kolmogorov, V. (2011). Object cosegmentation. In CVPR.

  • Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.

  • Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality constrained linear coding for image classification. In CVPR.

  • Wang, X., Wang, B., Bai, X., Liu, W., & Tu, Z. (2013). Max-margin multiple-instance dictionary learning. In ICML.

  • Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR.

  • Yan, S., Xu, X., Xu, D., Lin, S., & Li, X. (2012). Beyond spatial pyramids: A new feature extraction framework with dense spatial sampling for image classification. In ECCV.

  • Yang, J., Li, Y., Tian, Y., Duan, L., & Gao, W. (2009). Group-sensitive multiple kernel learning for object categorization. In CVPR.

  • Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.

  • Yao, B., Jiang, X., Khosla, A., Lin, A., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV.

  • Yuan, M., & Lin, Y. (2005). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B, 68(1), 49–67.

    MathSciNet  Article  MATH  Google Scholar 

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.

  • Zheng, Y., Jiang, Y. G., & Xue, X. (2012). Learning hybrid part filters for scene recognition. In ECCV.

  • Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q., & Jiang, X. (2014). Learning discriminative and shareable features for scene classification. In ECCV.

Download references


Jian Sun was supported by NSFC (No. 61472313, 11131006), the 973 program (2013CB329404), NCET-12-0442, and NSFC (No. 61303121). Jean Ponce’s work was supported in part by European Research Council (VideoWorld project) and the Institut Universitaire de France.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jian Sun.

Additional information

Communicated by Derek Hoiem.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sun, J., Ponce, J. Learning Dictionary of Discriminative Part Detectors for Image Categorization and Cosegmentation. Int J Comput Vis 120, 111–133 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Discriminative parts
  • Discriminative learning
  • Image classification
  • Image cosegmentation