# Image Classification with the Fisher Vector: Theory and Practice

- 9.4k Downloads
- 583 Citations

## Abstract

A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature. The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements. This leads to the popular Bag-of-Visual words representation. In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an “universal” generative Gaussian mixture model. This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization. We report experimental results on five standard datasets—PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K—with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique.

## Keywords

Image classification Large-scale classification Bag-of-Visual words Fisher vector Fisher kernel Product quantization## References

- Amari, S., & Nagaoka, H. (2000).
*Methods of information geometry, translations of mathematical monographs*(Vol. 191). Oxford: Oxford University Press.Google Scholar - Berg, A., Deng, J., & Fei-Fei, L. (2010).
*ILSVRC 2010*. Retrieved from http://www.image-net.org/challenges/LSVRC/2010/index. - Bergamo, A., & Torresani, L. (2012). Meta-class features for large-scale object categorization on a budget. In
*CVPR*.Google Scholar - Bishop, C. (1995). Training with noise is equivalent to tikhonov regularization. In
*Neural computation*(Vol 7).Google Scholar - Bo, L., & Sminchisescu, C. (2009). Efficient match kernels between sets of features for visual recognition. In
*NIPS*.Google Scholar - Bo, L., Ren, X., & Fox, D. (2012). Multipath sparse coding using hierarchical matching pursuit. In
*NIPS workshop on deep learning*.Google Scholar - Boiman, O., Shechtman, E., & Irani, M. (2008). In defense of nearest-neighbor based image classification. In
*CVPR*.Google Scholar - Bottou, L. (2011).
*Stochastic gradient descent*. Retrieved from http://leon.bottou.org/projects/sgd. - Bottou, L., & Bousquet, O. (2007). The tradeoffs of large scale learning. In
*NIPS*.Google Scholar - Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In
*CVPR*.Google Scholar - Boureau, Y. L., LeRoux, N., Bach, F., Ponce, J., & LeCun, Y. (2011). Ask the locals: Multi-way local pooling for image recognition. In
*ICCV*.Google Scholar - Burrascano, P. (1991). A norm selection criterion for the generalized delta rule.
*IEEE Transactions on Neural Networks*,*2*(1), 125–30.CrossRefGoogle Scholar - Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of recent feature encoding methods. In
*BMVC*.Google Scholar - Cinbis, G., Verbeek, J., & Schmid, C. (2012). Image categorization using Fisher kernels of non-iid image models. In
*CVPR*.Google Scholar - Clinchant, S., Csurka, G., Perronnin, F., & Renders, J. M. (2007). XRCEs participation to imageval. In
*ImageEval workshop at CVIR*.Google Scholar - Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In
*ECCV SLCV workshop*.Google Scholar - Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In
*CVPR*.Google Scholar - Deng, J., Berg, A., Li, K., & Fei-Fei, L. (2010). What does classifying more than 10,000 image categories tell us?. In
*ECCV*.Google Scholar - Everingham, M., Gool, L.V., Williams, C., Winn, J. & Zisserman, A. (2007).
*The PASCAL visual object classes challenge 2007 (VOC2007) results*.Google Scholar - Everingham, M., Gool, L.V., Williams, C., Winn, J., Zisserman, A. (2008).
*The PASCAL visual object classes challenge 2008 (VOC2008) results*.Google Scholar - Everingham, M., van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge.
*International Journal of Computer Vision*,*88*(2), 303–338.Google Scholar - Farquhar, J., Szedmak, S., Meng, H., & Shawe-Taylor, J. (2005). Improving “bag-of-keypoints” image categorisation.
*Technical report*. Southampton: University of Southampton.Google Scholar - Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric \(\ell _p\)-norm feature pooling for image classification. In
*CVPR*.Google Scholar - Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In
*ICCV*.Google Scholar - Gray, R., & Neuhoff, D. (1998). Quantization.
*IEEE Transactions on Information Theory*,*44*(6), 2724–2742.MathSciNetCrossRefGoogle Scholar - Griffin, G., Holub, A., & Perona, P. (2007).
*Caltech-256 object category dataset.*. California Institute of Technology. Retrieved from http://authors.library.caltech.edu/7694. - Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In
*CVPR*.Google Scholar - Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In
*ICCV*.Google Scholar - Haussler, D. (1999). Convolution kernels on discrete structures.
*Technical report*. Santa Cruz: UCSC.Google Scholar - Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In
*NIPS*.Google Scholar - Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In
*CVPR*.Google Scholar - Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In
*CVPR*.Google Scholar - Jégou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. In
*IEEE PAMI*.Google Scholar - Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*34*(9), 1704–1716.CrossRefGoogle Scholar - Krapac, J., Verbeek, J., & Jurie, F. (2011). Modeling spatial layout with fisher vectors for image categorization. In
*ICCV*.Google Scholar - Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Image classification with deep convolutional neural networks. In
*NIPS*.Google Scholar - Kulkarni, N., & Li, B. (2011). Discriminative affine sparse codes for image classification. In
*CVPR*.Google Scholar - Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In
*CVPR*.Google Scholar - Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., et al. (2012). Building high-level features using large scale unsupervised learning. In
*ICML*.Google Scholar - Lin, Y., Lv, F., Zhu, S., Yu, K., Yang, M., & Cour, T. (2011). Large-scale image classification: Fast feature extraction and svm training. In
*CVPR*.Google Scholar - Liu, Y., & Perronnin, F. (2008). A similarity measure between unordered vector sets with application to image categorization. In
*CVPR*.Google Scholar - Lowe, D. (2004). Distinctive image features from scale-invariant keypoints.
*International Journal of Computer Vision*,*60*(2), 91–110.CrossRefGoogle Scholar - Lyu, S. (2005). Mercer kernels for object recognition with local features. In
*CVPR*.Google Scholar - Maji, S., & Berg, A. (2009). Max-margin additive classifiers for detection. In
*ICCV*.Google Scholar - Maji, S., Berg, A., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In
*CVPR*.Google Scholar - Mensink, T., Verbeek, J., Csurka, G., & Perronnin, F. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In
*ECCV*.Google Scholar - Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In
*CVPR*.Google Scholar - Perronnin, F., Dance, C., Csurka, G., & Bressan, M. (2006). Adapted vocabularies for generic visual categorization. In
*ECCV*.Google Scholar - Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010a). Large-scale image retrieval with compressed Fisher vectors. In
*CVPR*.Google Scholar - Perronnin, F., Sánchez, J., & Liu, Y. (2010b). Large-scale image categorization with explicit data embedding. In
*CVPR*.Google Scholar - Perronnin, F., Sánchez, J., & Mensink, T. (2010c). Improving the Fisher kernel for large-scale image classification. In
*ECCV*.Google Scholar - Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In
*CVPR*.Google Scholar - Sabin, M., & Gray, R. (1984). Product code vector quantizers for waveform and voice coding.
*IEEE Transactions on Acoustics, Speech and Signal Processing*,*32*(3), 474–488.CrossRefGoogle Scholar - Sánchez, J., & Perronnin, F. (2011). High-dimensional signature compression for large-scale image classification. In
*CVPR*.Google Scholar - Sánchez, J., Perronnin, F., & de Campos, T. (2012). Modeling the spatial layout of images beyond spatial pyramids.
*Pattern Recognition Letters*,*33*(16), 2216–2223.CrossRefGoogle Scholar - Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimate sub-gradient solver for SVM. In
*ICML*.Google Scholar - Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In
*ICCV*.Google Scholar - Smith, N., & Gales, M. (2001). Speech recognition using SVMs. In
*NIPS*.Google Scholar - Song, D., & Gupta, A. K. (1997). Lp-norm uniform distribution.
*Proceedings of American Mathematical Society*,*125*, 595–601.MathSciNetzbMATHCrossRefGoogle Scholar - Spruill, M. (2007). Asymptotic distribution of coordinates on high dimensional spheres. In
*Electronic communications in probability*(Vol. 12).Google Scholar - Sreekanth, V., Vedaldi, A., Jawahar, C., & Zisserman, A. (2010). Generalized rbf feature maps for efficient detection. In
*BMVC*.Google Scholar - Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985).
*Statistical analysis of finite mixture distributions*. New York: John Wiley.zbMATHGoogle Scholar - Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In
*CVPR*.Google Scholar - Uijlings, J., Smeulders, A., & Scha, R. (2009). What is the spatial extent of an object? In
*CVPR*.Google Scholar - van de Sande, K., Gevers, T., & Snoek, C. (2010). Evaluating color descriptors for object and scene recognition.
*IEEE PAMI*,*32*(9), 1582–1596.CrossRefGoogle Scholar - VanGemert, J., Veenman, C., Smeulders, A., & Geusebroek, J. (2010). Visual word ambiguity. In
*IEEE TPAMI*.Google Scholar - Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In
*CVPR*.Google Scholar - Vedaldi, A., & Zisserman, A. (2012). Sparse kernel approximations for efficient classification and detection. In
*CVPR*.Google Scholar - Wallraven, C., Caputo, B., & Graf, A. (2003). Recognition with local features: the kernel recipe. In
*ICCV*.Google Scholar - Wang, G., Hoiem, D., & Forsyth, D. (2009). Learning image similarity from flickr groups using stochastic intersection kernel machines. In
*ICCV*.Google Scholar - Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In
*CVPR*.Google Scholar - Winn, J., Criminisi, A., & Minka, T. (2005). Object categorization by learned visual dictionary. In
*ICCV*.Google Scholar - Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In
*CVPR*.Google Scholar - Yan, S., Zhou, X., Liu, M., Hasegawa-Johnson, M., & Huang, T. (2008). Regression from patch-kernel. In
*CVPR*.Google Scholar - Yang, J., Li, Y., Tian, Y., Duan, L., & Gao, W. (2009). Group sensitive multiple kernel learning for object categorization. In
*ICCV*.Google Scholar - Yang, J., Yu, K., Gong, Y., & Huang, T. (2009b). Linear spatial pyramid matching using sparse coding for image classification. In
*CVPR*.Google Scholar - Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, S., Valtchev V. & Woodland P. (2002). The HTK book (version 3.2.1). Cambridge: Cambridge University Engineering Department.Google Scholar
- Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study.
*International Journal of Computer Vision*,*73*(2), 123–138.CrossRefGoogle Scholar - Zhou, Z., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In
*ECCV*.Google Scholar