International Journal of Computer Vision

, Volume 105, Issue 3, pp 222–245 | Cite as

Image Classification with the Fisher Vector: Theory and Practice

  • Jorge SánchezEmail author
  • Florent Perronnin
  • Thomas Mensink
  • Jakob Verbeek


A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature. The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements. This leads to the popular Bag-of-Visual words representation. In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an “universal” generative Gaussian mixture model. This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization. We report experimental results on five standard datasets—PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K—with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique.


Image classification Large-scale classification Bag-of-Visual words Fisher vector Fisher kernel Product quantization 


  1. Amari, S., & Nagaoka, H. (2000). Methods of information geometry, translations of mathematical monographs (Vol. 191). Oxford: Oxford University Press.Google Scholar
  2. Berg, A., Deng, J., & Fei-Fei, L. (2010). ILSVRC 2010. Retrieved from
  3. Bergamo, A., & Torresani, L. (2012). Meta-class features for large-scale object categorization on a budget. In CVPR.Google Scholar
  4. Bishop, C. (1995). Training with noise is equivalent to tikhonov regularization. In Neural computation (Vol 7).Google Scholar
  5. Bo, L., & Sminchisescu, C. (2009). Efficient match kernels between sets of features for visual recognition. In NIPS.Google Scholar
  6. Bo, L., Ren, X., & Fox, D. (2012). Multipath sparse coding using hierarchical matching pursuit. In NIPS workshop on deep learning.Google Scholar
  7. Boiman, O., Shechtman, E., & Irani, M. (2008). In defense of nearest-neighbor based image classification. In CVPR.Google Scholar
  8. Bottou, L. (2011). Stochastic gradient descent. Retrieved from
  9. Bottou, L., & Bousquet, O. (2007). The tradeoffs of large scale learning. In NIPS.Google Scholar
  10. Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In CVPR.Google Scholar
  11. Boureau, Y. L., LeRoux, N., Bach, F., Ponce, J., & LeCun, Y. (2011). Ask the locals: Multi-way local pooling for image recognition. In ICCV.Google Scholar
  12. Burrascano, P. (1991). A norm selection criterion for the generalized delta rule. IEEE Transactions on Neural Networks, 2(1), 125–30.CrossRefGoogle Scholar
  13. Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of recent feature encoding methods. In BMVC.Google Scholar
  14. Cinbis, G., Verbeek, J., & Schmid, C. (2012). Image categorization using Fisher kernels of non-iid image models. In CVPR.Google Scholar
  15. Clinchant, S., Csurka, G., Perronnin, F., & Renders, J. M. (2007). XRCEs participation to imageval. In ImageEval workshop at CVIR.Google Scholar
  16. Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV SLCV workshop.Google Scholar
  17. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.Google Scholar
  18. Deng, J., Berg, A., Li, K., & Fei-Fei, L. (2010). What does classifying more than 10,000 image categories tell us?. In ECCV.Google Scholar
  19. Everingham, M., Gool, L.V., Williams, C., Winn, J. & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results.Google Scholar
  20. Everingham, M., Gool, L.V., Williams, C., Winn, J., Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results.Google Scholar
  21. Everingham, M., van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.Google Scholar
  22. Farquhar, J., Szedmak, S., Meng, H., & Shawe-Taylor, J. (2005). Improving “bag-of-keypoints” image categorisation. Technical report. Southampton: University of Southampton.Google Scholar
  23. Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric \(\ell _p\)-norm feature pooling for image classification. In CVPR.Google Scholar
  24. Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In ICCV.Google Scholar
  25. Gray, R., & Neuhoff, D. (1998). Quantization. IEEE Transactions on Information Theory, 44(6), 2724–2742.MathSciNetCrossRefGoogle Scholar
  26. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset.. California Institute of Technology. Retrieved from
  27. Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In CVPR.Google Scholar
  28. Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In ICCV.Google Scholar
  29. Haussler, D. (1999). Convolution kernels on discrete structures. Technical report. Santa Cruz: UCSC.Google Scholar
  30. Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In NIPS.Google Scholar
  31. Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In CVPR.Google Scholar
  32. Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR.Google Scholar
  33. Jégou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. In IEEE PAMI.Google Scholar
  34. Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.CrossRefGoogle Scholar
  35. Krapac, J., Verbeek, J., & Jurie, F. (2011). Modeling spatial layout with fisher vectors for image categorization. In ICCV.Google Scholar
  36. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Image classification with deep convolutional neural networks. In NIPS.Google Scholar
  37. Kulkarni, N., & Li, B. (2011). Discriminative affine sparse codes for image classification. In CVPR.Google Scholar
  38. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.Google Scholar
  39. Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., et al. (2012). Building high-level features using large scale unsupervised learning. In ICML.Google Scholar
  40. Lin, Y., Lv, F., Zhu, S., Yu, K., Yang, M., & Cour, T. (2011). Large-scale image classification: Fast feature extraction and svm training. In CVPR.Google Scholar
  41. Liu, Y., & Perronnin, F. (2008). A similarity measure between unordered vector sets with application to image categorization. In CVPR.Google Scholar
  42. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  43. Lyu, S. (2005). Mercer kernels for object recognition with local features. In CVPR.Google Scholar
  44. Maji, S., & Berg, A. (2009). Max-margin additive classifiers for detection. In ICCV.Google Scholar
  45. Maji, S., Berg, A., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In CVPR.Google Scholar
  46. Mensink, T., Verbeek, J., Csurka, G., & Perronnin, F. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.Google Scholar
  47. Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.Google Scholar
  48. Perronnin, F., Dance, C., Csurka, G., & Bressan, M. (2006). Adapted vocabularies for generic visual categorization. In ECCV.Google Scholar
  49. Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010a). Large-scale image retrieval with compressed Fisher vectors. In CVPR.Google Scholar
  50. Perronnin, F., Sánchez, J., & Liu, Y. (2010b). Large-scale image categorization with explicit data embedding. In CVPR.Google Scholar
  51. Perronnin, F., Sánchez, J., & Mensink, T. (2010c). Improving the Fisher kernel for large-scale image classification. In ECCV.Google Scholar
  52. Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.Google Scholar
  53. Sabin, M., & Gray, R. (1984). Product code vector quantizers for waveform and voice coding. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(3), 474–488.CrossRefGoogle Scholar
  54. Sánchez, J., & Perronnin, F. (2011). High-dimensional signature compression for large-scale image classification. In CVPR.Google Scholar
  55. Sánchez, J., Perronnin, F., & de Campos, T. (2012). Modeling the spatial layout of images beyond spatial pyramids. Pattern Recognition Letters, 33(16), 2216–2223.CrossRefGoogle Scholar
  56. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimate sub-gradient solver for SVM. In ICML.Google Scholar
  57. Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In ICCV.Google Scholar
  58. Smith, N., & Gales, M. (2001). Speech recognition using SVMs. In NIPS.Google Scholar
  59. Song, D., & Gupta, A. K. (1997). Lp-norm uniform distribution. Proceedings of American Mathematical Society, 125, 595–601.MathSciNetzbMATHCrossRefGoogle Scholar
  60. Spruill, M. (2007). Asymptotic distribution of coordinates on high dimensional spheres. In Electronic communications in probability (Vol. 12).Google Scholar
  61. Sreekanth, V., Vedaldi, A., Jawahar, C., & Zisserman, A. (2010). Generalized rbf feature maps for efficient detection. In BMVC.Google Scholar
  62. Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. New York: John Wiley.zbMATHGoogle Scholar
  63. Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR.Google Scholar
  64. Uijlings, J., Smeulders, A., & Scha, R. (2009). What is the spatial extent of an object? In CVPR.Google Scholar
  65. van de Sande, K., Gevers, T., & Snoek, C. (2010). Evaluating color descriptors for object and scene recognition. IEEE PAMI, 32(9), 1582–1596.CrossRefGoogle Scholar
  66. VanGemert, J., Veenman, C., Smeulders, A., & Geusebroek, J. (2010). Visual word ambiguity. In IEEE TPAMI.Google Scholar
  67. Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In CVPR.Google Scholar
  68. Vedaldi, A., & Zisserman, A. (2012). Sparse kernel approximations for efficient classification and detection. In CVPR.Google Scholar
  69. Wallraven, C., Caputo, B., & Graf, A. (2003). Recognition with local features: the kernel recipe. In ICCV.Google Scholar
  70. Wang, G., Hoiem, D., & Forsyth, D. (2009). Learning image similarity from flickr groups using stochastic intersection kernel machines. In ICCV.Google Scholar
  71. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.Google Scholar
  72. Winn, J., Criminisi, A., & Minka, T. (2005). Object categorization by learned visual dictionary. In ICCV.Google Scholar
  73. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.Google Scholar
  74. Yan, S., Zhou, X., Liu, M., Hasegawa-Johnson, M., & Huang, T. (2008). Regression from patch-kernel. In CVPR.Google Scholar
  75. Yang, J., Li, Y., Tian, Y., Duan, L., & Gao, W. (2009). Group sensitive multiple kernel learning for object categorization. In ICCV.Google Scholar
  76. Yang, J., Yu, K., Gong, Y., & Huang, T. (2009b). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.Google Scholar
  77. Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, S., Valtchev V. & Woodland P. (2002). The HTK book (version 3.2.1). Cambridge: Cambridge University Engineering Department.Google Scholar
  78. Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 123–138.CrossRefGoogle Scholar
  79. Zhou, Z., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Jorge Sánchez
    • 1
    Email author
  • Florent Perronnin
    • 2
  • Thomas Mensink
    • 3
  • Jakob Verbeek
    • 4
  1. 1.CIEM-CONICET, FaMAFUniversidad Nacional de CórdobaCórdobaArgentina
  2. 2.Xerox Research Centre EuropeMeylanFrance
  3. 3.Inteligent Systems Lab AmsterdamUniversity of AmsterdamAmsterdamThe Netherlands
  4. 4.LEAR TeamINRIA GrenobleMontbonnotFrance

Personalised recommendations