Abstract
The Bag-of-Visual-Words (BoVW) model is a popular approach for visual recognition. Used successfully in many different tasks, simplicity and good performance are the main reasons for its popularity. The central aspect of this model, the visual dictionary, is used to build mid-level representations based on low level image descriptors. Classifiers are then trained using these mid-level representations to perform categorization. While most works based on BoVW models have been focused on learning a suitable dictionary or on proposing a suitable pooling strategy, little effort has been devoted to explore and improve the coupling between the dictionary and the top-level classifiers, in order to generate more discriminative models. This problem can be highly complex due to the large dictionary size usually needed by these methods. Also, most BoVW based systems usually perform multiclass categorization using a one-vs-all strategy, ignoring relevant correlations among classes. To tackle the previous issues, we propose a novel approach that jointly learns dictionary words and a proper top-level multiclass classifier. We use a max-margin learning framework to minimize a regularized energy formulation, allowing us to propagate labeled information to guide the commonly unsupervised dictionary learning process. As a result we produce a dictionary that is more compact and discriminative. We test our method on several popular datasets, where we demonstrate that our joint optimization strategy induces a word sharing behavior among the target classes, being able to achieve state-of-the-art performance using far less visual words than previous approaches.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV (2003)
Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)
Yang, L., Jin, R., Sukthankar, R., Jurie, F.: Unifying discriminative visual codebook generation with classifier training for object category reorganization. In: CVPR (2008)
Niebles, J.C., Wang, H., Li, F.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79(3), 299–318 (2008)
Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: In Workshop on Statistical Learning in Computer Vision, ECCV (2004)
Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal visual dictionary. In: ICCV, pp. 1800–1807 (2005)
Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: ICCV (2005)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2169–2178 (2006)
Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Neural Information Processing Systems, NIPS (2007)
Lazebnik, S., Raginsky, M.: Supervised learning of quantizer codebooks by information loss minimization. PAMI 31(7), 1294–1309 (2009)
Singaraju, D., Vidal, R.: Using global bag of features models in random fields for joint categorization and segmentation of objects. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2011)
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)
Boureau, Y., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: CVPR (2010)
Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Supervised dictionary learning. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1033–1040 (2008)
Lian, X.-C., Li, Z., Lu, B.-L., Zhang, L.: Max-margin dictionary learning for multiclass image categorization. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 157–170. Springer, Heidelberg (2010)
Hinton, G., Osindero, S.: A fast learning algorithm for deep belief nets. Neural Computation 18, 2006 (2006)
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9), 1627–1645 (2010)
Wang, Y., Mori, G.: Hidden part models for human action recognition: Probabilistic versus max margin. PAMI 33(7), 1310–1323 (2011)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893 (2005)
Ahonen, T., Hadid, A., Pietikainen, M.: Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(12), 2037–2041 (2006)
Jain, A., Zappella, L., McClure, P., Vidal, R.: Visual dictionary learning for joint object categorization and segmentation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 718–731. Springer, Heidelberg (2012)
Li, L., Su, H., Xing, E., Fei-Fei, L.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: Neural Information Processing Systems (NIPS), Vancouver, Canada (December 2010)
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: ICML (2004)
Waechter, A., Biegler, L.: On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming 106, 25–57 (2006)
Wang, X., Han, T.X., Yan, S.: An hog-lbp human detector with partial occlusion handling. In: IEEE International Conference on Computer Vision (ICCV), pp. 32–39 (2009)
Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42, 145–175 (2001)
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2010)
Shabou, A., Le-Borgne, H.: Locality-constrained and spatially regularized coding for scene categorization. In: CVPR (2012)
Parizi, S., Oberlin, J., Felzenszwalb, P.: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2775–2782 (2012)
Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 73–86. Springer, Heidelberg (2012)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lobel, H., Vidal, R., Mery, D., Soto, A. (2014). Joint Dictionary and Classifier Learning for Categorization of Images Using a Max-margin Framework. In: Klette, R., Rivera, M., Satoh, S. (eds) Image and Video Technology. PSIVT 2013. Lecture Notes in Computer Science, vol 8333. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53842-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-53842-1_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53841-4
Online ISBN: 978-3-642-53842-1
eBook Packages: Computer ScienceComputer Science (R0)