Joint Dictionary and Classifier Learning for Categorization of Images Using a Max-margin Framework

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8333)


The Bag-of-Visual-Words (BoVW) model is a popular approach for visual recognition. Used successfully in many different tasks, simplicity and good performance are the main reasons for its popularity. The central aspect of this model, the visual dictionary, is used to build mid-level representations based on low level image descriptors. Classifiers are then trained using these mid-level representations to perform categorization. While most works based on BoVW models have been focused on learning a suitable dictionary or on proposing a suitable pooling strategy, little effort has been devoted to explore and improve the coupling between the dictionary and the top-level classifiers, in order to generate more discriminative models. This problem can be highly complex due to the large dictionary size usually needed by these methods. Also, most BoVW based systems usually perform multiclass categorization using a one-vs-all strategy, ignoring relevant correlations among classes. To tackle the previous issues, we propose a novel approach that jointly learns dictionary words and a proper top-level multiclass classifier. We use a max-margin learning framework to minimize a regularized energy formulation, allowing us to propagate labeled information to guide the commonly unsupervised dictionary learning process. As a result we produce a dictionary that is more compact and discriminative. We test our method on several popular datasets, where we demonstrate that our joint optimization strategy induces a word sharing behavior among the target classes, being able to achieve state-of-the-art performance using far less visual words than previous approaches.


Visual Word Sparse Code Dictionary Learning Dictionary Word Spatial Pyramid Match 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV (2003)Google Scholar
  2. 2.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)CrossRefGoogle Scholar
  3. 3.
    Yang, L., Jin, R., Sukthankar, R., Jurie, F.: Unifying discriminative visual codebook generation with classifier training for object category reorganization. In: CVPR (2008)Google Scholar
  4. 4.
    Niebles, J.C., Wang, H., Li, F.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79(3), 299–318 (2008)CrossRefGoogle Scholar
  5. 5.
    Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: In Workshop on Statistical Learning in Computer Vision, ECCV (2004)Google Scholar
  6. 6.
    Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal visual dictionary. In: ICCV, pp. 1800–1807 (2005)Google Scholar
  7. 7.
    Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: ICCV (2005)Google Scholar
  8. 8.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2169–2178 (2006)Google Scholar
  9. 9.
    Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Neural Information Processing Systems, NIPS (2007)Google Scholar
  10. 10.
    Lazebnik, S., Raginsky, M.: Supervised learning of quantizer codebooks by information loss minimization. PAMI 31(7), 1294–1309 (2009)CrossRefGoogle Scholar
  11. 11.
    Singaraju, D., Vidal, R.: Using global bag of features models in random fields for joint categorization and segmentation of objects. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2011)Google Scholar
  12. 12.
    Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)Google Scholar
  13. 13.
    Boureau, Y., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: CVPR (2010)Google Scholar
  14. 14.
    Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Supervised dictionary learning. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1033–1040 (2008)Google Scholar
  15. 15.
    Lian, X.-C., Li, Z., Lu, B.-L., Zhang, L.: Max-margin dictionary learning for multiclass image categorization. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 157–170. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Hinton, G., Osindero, S.: A fast learning algorithm for deep belief nets. Neural Computation 18, 2006 (2006)CrossRefGoogle Scholar
  17. 17.
    Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  18. 18.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  19. 19.
    Wang, Y., Mori, G.: Hidden part models for human action recognition: Probabilistic versus max margin. PAMI 33(7), 1310–1323 (2011)CrossRefGoogle Scholar
  20. 20.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893 (2005)Google Scholar
  21. 21.
    Ahonen, T., Hadid, A., Pietikainen, M.: Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(12), 2037–2041 (2006)CrossRefGoogle Scholar
  22. 22.
    Jain, A., Zappella, L., McClure, P., Vidal, R.: Visual dictionary learning for joint object categorization and segmentation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 718–731. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  23. 23.
    Li, L., Su, H., Xing, E., Fei-Fei, L.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: Neural Information Processing Systems (NIPS), Vancouver, Canada (December 2010)Google Scholar
  24. 24.
    Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: ICML (2004)Google Scholar
  25. 25.
    Waechter, A., Biegler, L.: On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming 106, 25–57 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  26. 26.
    Wang, X., Han, T.X., Yan, S.: An hog-lbp human detector with partial occlusion handling. In: IEEE International Conference on Computer Vision (ICCV), pp. 32–39 (2009)Google Scholar
  27. 27.
    Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42, 145–175 (2001)CrossRefzbMATHGoogle Scholar
  28. 28.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2010)Google Scholar
  29. 29.
    Shabou, A., Le-Borgne, H.: Locality-constrained and spatially regularized coding for scene categorization. In: CVPR (2012)Google Scholar
  30. 30.
    Parizi, S., Oberlin, J., Felzenszwalb, P.: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2775–2782 (2012)Google Scholar
  31. 31.
    Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 73–86. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  32. 32.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Department of Computer SciencePonficia Universidad Católica de ChileChile
  2. 2.Center for Imaging ScienceJohns Hopkins UniversityUSA

Personalised recommendations