## Abstract

A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature. The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements. This leads to the popular Bag-of-Visual words representation. In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an “universal” generative Gaussian mixture model. This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization. We report experimental results on five standard datasets—PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K—with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique.

### Similar content being viewed by others

### Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.## Notes

See Appendix A.2 in the extended version of Jaakkola and Haussler (1998) which is available at: http://people.csail.mit.edu/tommi/papers/gendisc.ps

Xiao et al. (2010) also report results with one training sample per class. However, a single sample does not provide any way to perform cross-validation which is the reason why we do not report results in this setting.

Actually, any continuous distribution can be approximated with arbitrary precision by a GMM with isotropic covariance matrices.

Note that since \(q\) draws values in a finite set, we could replace the \(\int \nolimits _q\) by \(\sum _q\) in the following equations but we will keep the integral notation for simplicity.

While it is standard practice to report per-class accuracy on this dataset (see Deng et al. 2010; Sánchez and Perronnin 2011), Krizhevsky et al. (2012); Le et al. (2012) report a per-image accuracy. This results in a more optimistic number since those classes which are over-represented in the test data also have more training samples and therefore have (on average) a higher accuracy than those classes which are under-represented. This was clarified through a personal correspondence with the first authors of Krizhevsky et al. (2012) and Le et al. (2012).

Available at: http://htk.eng.cam.ac.uk/.

## References

Amari, S., & Nagaoka, H. (2000).

*Methods of information geometry, translations of mathematical monographs*(Vol. 191). Oxford: Oxford University Press.Berg, A., Deng, J., & Fei-Fei, L. (2010).

*ILSVRC 2010*. Retrieved from http://www.image-net.org/challenges/LSVRC/2010/index.Bergamo, A., & Torresani, L. (2012). Meta-class features for large-scale object categorization on a budget. In

*CVPR*.Bishop, C. (1995). Training with noise is equivalent to tikhonov regularization. In

*Neural computation*(Vol 7).Bo, L., & Sminchisescu, C. (2009). Efficient match kernels between sets of features for visual recognition. In

*NIPS*.Bo, L., Ren, X., & Fox, D. (2012). Multipath sparse coding using hierarchical matching pursuit. In

*NIPS workshop on deep learning*.Boiman, O., Shechtman, E., & Irani, M. (2008). In defense of nearest-neighbor based image classification. In

*CVPR*.Bottou, L. (2011).

*Stochastic gradient descent*. Retrieved from http://leon.bottou.org/projects/sgd.Bottou, L., & Bousquet, O. (2007). The tradeoffs of large scale learning. In

*NIPS*.Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In

*CVPR*.Boureau, Y. L., LeRoux, N., Bach, F., Ponce, J., & LeCun, Y. (2011). Ask the locals: Multi-way local pooling for image recognition. In

*ICCV*.Burrascano, P. (1991). A norm selection criterion for the generalized delta rule.

*IEEE Transactions on Neural Networks*,*2*(1), 125–30.Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of recent feature encoding methods. In

*BMVC*.Cinbis, G., Verbeek, J., & Schmid, C. (2012). Image categorization using Fisher kernels of non-iid image models. In

*CVPR*.Clinchant, S., Csurka, G., Perronnin, F., & Renders, J. M. (2007). XRCEs participation to imageval. In

*ImageEval workshop at CVIR*.Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In

*ECCV SLCV workshop*.Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In

*CVPR*.Deng, J., Berg, A., Li, K., & Fei-Fei, L. (2010). What does classifying more than 10,000 image categories tell us?. In

*ECCV*.Everingham, M., Gool, L.V., Williams, C., Winn, J. & Zisserman, A. (2007).

*The PASCAL visual object classes challenge 2007 (VOC2007) results*.Everingham, M., Gool, L.V., Williams, C., Winn, J., Zisserman, A. (2008).

*The PASCAL visual object classes challenge 2008 (VOC2008) results*.Everingham, M., van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge.

*International Journal of Computer Vision*,*88*(2), 303–338.Farquhar, J., Szedmak, S., Meng, H., & Shawe-Taylor, J. (2005). Improving “bag-of-keypoints” image categorisation.

*Technical report*. Southampton: University of Southampton.Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric \(\ell _p\)-norm feature pooling for image classification. In

*CVPR*.Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In

*ICCV*.Gray, R., & Neuhoff, D. (1998). Quantization.

*IEEE Transactions on Information Theory*,*44*(6), 2724–2742.Griffin, G., Holub, A., & Perona, P. (2007).

*Caltech-256 object category dataset.*. California Institute of Technology. Retrieved from http://authors.library.caltech.edu/7694.Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In

*CVPR*.Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In

*ICCV*.Haussler, D. (1999). Convolution kernels on discrete structures.

*Technical report*. Santa Cruz: UCSC.Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In

*NIPS*.Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In

*CVPR*.Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In

*CVPR*.Jégou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. In

*IEEE PAMI*.Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*34*(9), 1704–1716.Krapac, J., Verbeek, J., & Jurie, F. (2011). Modeling spatial layout with fisher vectors for image categorization. In

*ICCV*.Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Image classification with deep convolutional neural networks. In

*NIPS*.Kulkarni, N., & Li, B. (2011). Discriminative affine sparse codes for image classification. In

*CVPR*.Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In

*CVPR*.Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., et al. (2012). Building high-level features using large scale unsupervised learning. In

*ICML*.Lin, Y., Lv, F., Zhu, S., Yu, K., Yang, M., & Cour, T. (2011). Large-scale image classification: Fast feature extraction and svm training. In

*CVPR*.Liu, Y., & Perronnin, F. (2008). A similarity measure between unordered vector sets with application to image categorization. In

*CVPR*.Lowe, D. (2004). Distinctive image features from scale-invariant keypoints.

*International Journal of Computer Vision*,*60*(2), 91–110.Lyu, S. (2005). Mercer kernels for object recognition with local features. In

*CVPR*.Maji, S., & Berg, A. (2009). Max-margin additive classifiers for detection. In

*ICCV*.Maji, S., Berg, A., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In

*CVPR*.Mensink, T., Verbeek, J., Csurka, G., & Perronnin, F. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In

*ECCV*.Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In

*CVPR*.Perronnin, F., Dance, C., Csurka, G., & Bressan, M. (2006). Adapted vocabularies for generic visual categorization. In

*ECCV*.Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010a). Large-scale image retrieval with compressed Fisher vectors. In

*CVPR*.Perronnin, F., Sánchez, J., & Liu, Y. (2010b). Large-scale image categorization with explicit data embedding. In

*CVPR*.Perronnin, F., Sánchez, J., & Mensink, T. (2010c). Improving the Fisher kernel for large-scale image classification. In

*ECCV*.Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In

*CVPR*.Sabin, M., & Gray, R. (1984). Product code vector quantizers for waveform and voice coding.

*IEEE Transactions on Acoustics, Speech and Signal Processing*,*32*(3), 474–488.Sánchez, J., & Perronnin, F. (2011). High-dimensional signature compression for large-scale image classification. In

*CVPR*.Sánchez, J., Perronnin, F., & de Campos, T. (2012). Modeling the spatial layout of images beyond spatial pyramids.

*Pattern Recognition Letters*,*33*(16), 2216–2223.Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimate sub-gradient solver for SVM. In

*ICML*.Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In

*ICCV*.Smith, N., & Gales, M. (2001). Speech recognition using SVMs. In

*NIPS*.Song, D., & Gupta, A. K. (1997). Lp-norm uniform distribution.

*Proceedings of American Mathematical Society*,*125*, 595–601.Spruill, M. (2007). Asymptotic distribution of coordinates on high dimensional spheres. In

*Electronic communications in probability*(Vol. 12).Sreekanth, V., Vedaldi, A., Jawahar, C., & Zisserman, A. (2010). Generalized rbf feature maps for efficient detection. In

*BMVC*.Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985).

*Statistical analysis of finite mixture distributions*. New York: John Wiley.Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In

*CVPR*.Uijlings, J., Smeulders, A., & Scha, R. (2009). What is the spatial extent of an object? In

*CVPR*.van de Sande, K., Gevers, T., & Snoek, C. (2010). Evaluating color descriptors for object and scene recognition.

*IEEE PAMI*,*32*(9), 1582–1596.VanGemert, J., Veenman, C., Smeulders, A., & Geusebroek, J. (2010). Visual word ambiguity. In

*IEEE TPAMI*.Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In

*CVPR*.Vedaldi, A., & Zisserman, A. (2012). Sparse kernel approximations for efficient classification and detection. In

*CVPR*.Wallraven, C., Caputo, B., & Graf, A. (2003). Recognition with local features: the kernel recipe. In

*ICCV*.Wang, G., Hoiem, D., & Forsyth, D. (2009). Learning image similarity from flickr groups using stochastic intersection kernel machines. In

*ICCV*.Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In

*CVPR*.Winn, J., Criminisi, A., & Minka, T. (2005). Object categorization by learned visual dictionary. In

*ICCV*.Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In

*CVPR*.Yan, S., Zhou, X., Liu, M., Hasegawa-Johnson, M., & Huang, T. (2008). Regression from patch-kernel. In

*CVPR*.Yang, J., Li, Y., Tian, Y., Duan, L., & Gao, W. (2009). Group sensitive multiple kernel learning for object categorization. In

*ICCV*.Yang, J., Yu, K., Gong, Y., & Huang, T. (2009b). Linear spatial pyramid matching using sparse coding for image classification. In

*CVPR*.Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, S., Valtchev V. & Woodland P. (2002). The HTK book (version 3.2.1). Cambridge: Cambridge University Engineering Department.

Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study.

*International Journal of Computer Vision*,*73*(2), 123–138.Zhou, Z., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In

*ECCV*.

## Author information

### Authors and Affiliations

### Corresponding author

## Appendices

### Appendix 1 : An Approximation of the Fisher Information Matrix

In this appendix we show that, under the assumption that the posterior distribution \(\gamma _x(k) = w_k u_k(x) / u_\lambda (x)\) is sharply peaked, the normalization with the FIM takes a diagonal form. Throughout this appendix we assume the data \(x\) to be one dimensional. The extension to the multidimensional data case is immediate for the mixtures of Gaussians with diagonal covariance matrices that we are interested in.

Under some mild regularity conditions on \(u_\lambda (x)\), the entries of the FIM can be expressed as:

First, let us consider the partial derivatives of the posteriors w.r.t. the mean and variance parameters. If we use \(\theta _s\) to denote one such parameter associated with \(u_s(x)\), i.e. mixture component number \(s\), then:

where \([\![\cdot ]\!]\) is the Iverson bracket notation which equals one if the argument is true, and zero otherwise. It is easy to verify that the assumption that the posterior is sharply peaked implies that the partial derivative is approximately zero, since the assumption implies that (i) \(\gamma _x(k)\gamma _x(s)\approx 0\) if \(k\ne s\) and (ii) \(\gamma _x(k) \approx \gamma _x(k)\gamma _x(s)\) if \(k=s\).

From this result and Eqs. (12), (13), and (14), it is then easy to see that second order derivatives are zero if (i) they involve mean or variance parameters corresponding to different mixture components (\(k\ne s\)), or if (ii) they involve a mixing weight parameter and a mean or variance parameter (possibly from the same component).

To see that the cross terms for mean and variance of the same mixture component are zero, we again rely on the observation that \(\partial \gamma _x(k) / \partial \theta _s\approx 0\) to obtain:

Then by integration we obtain:

We now compute the second order derivatives w.r.t. the means:

Integration then gives:

and the corresponding entry in \(L_\lambda \) equals \(\sigma _k / \sqrt{w_k}\). This leads to the normalized gradients as presented in (17).

Similarly, for the variance parameters we obtain:

Integration then gives:

which leads to a corresponding entry in \(L_\lambda \) of \(\sigma _k / \sqrt{2w_k}\). This leads to the normalized gradients as presented in (18).

Finally, the computation of the normalization coefficients for the mixing weights is somewhat more involved. To compute the second order derivatives involving mixing weight parameters only, we will make use of the partial derivative of the posterior probabilities \(\gamma _x(k)\):

where the approximation follows from the same observations as used in (62). Using this approximation, the second order derivatives w.r.t. mixing weights are:

Since this result is independent of \(x\), the corresponding block of the FIM is simply obtained by collecting the negative second order gradients in matrix form:

where we used \(w\) and \(\alpha \) to denote the vector of all mixing weights, and mixing weight parameters respectively.

Since the mixing weights sum to one, it is easy to show that this matrix is non-invertible by verifying that the constant vector is an eigenvector of this matrix with associated eigenvalue zero. In fact, since there are only \(K-1\) degrees of freedom in the mixing weights, we can fix \(\alpha _K=0\) without loss of generality, and work with a reduced set of \(K-1\) mixing weight parameters. Now, let us make the following definitions: let \(\tilde{\alpha } = (\alpha _1, \ldots , \alpha _{K-1})^T\) denote the vector of the first \(K-1\) mixing weight parameters, let \(G_{\tilde{\alpha }}^x\) denote the gradient vector with respect to these, and \(F_{\tilde{\alpha }}\) the corresponding matrix of second order derivatives. Using this definition \(F_{\tilde{\alpha }}\) is invertible, and using Woodburry’s matrix inversion lemma, we can show that

The last form shows that the inner product, normalized by the inverse of the non-diagonal \(K-1\) dimensional square matrix \(F_{\tilde{\alpha }}\), can in fact be obtained as a simple inner product between the normalized version of the \(K\) dimensional gradient vectors as defined in (12), i.e. with entries \(\big (\gamma _x(k)-w_k\big )/\sqrt{w_k}\). This leads to the normalized gradients as presented in (16).

Note also that, if we consider the complete data likelihood:

the Fisher information decomposes as:

where \(F_{c}, F_\lambda \) and \(F_{r}\) denote the FIM of the complete, marginal (observed) and conditional terms. Using the \(1\mathrm -of- K\) formulation for \(z\), it can be shown that \(F_c\) has a diagonal form with entries given by (68), (71) and (76), respectively. Therefore, \(F_r\) can be seen as the amount of “information” lost by not knowing the true mixture component generating each of the \(x\)’s Titterington et al. (1985). By requiring the distribution of the \(\gamma _x(k)\) to be “sharply peaked” we are making the approximation \(z_k\approx \gamma _x(k)\).

From this derivation we conclude that the assumption of sharply peaked posteriors leads to a diagonal approximation of the FIM, which can therefore be taken into account by a coordinate-wise normalization of the gradient vectors.

### Appendix 2 : Good Practices for Gaussian Mixture Modeling

We now provide some good practices for GMM. For a public GMM implementation and for more details on how to train and test GMMs, we refer the reader to the excellent HMM ToolKit (HTK) Young et al. (2002)^{Footnote 10}.

*Computation in the Log Domain* We first describe how to compute in practice the likelihood (8) and the soft-assignment (15). Since the low-level descriptors are quite high-dimensional (typically \(D=64\) in our experiments), the likelihood values \(u_k(x)\) for each Gaussian can be extremely small (and even fall below machine precision if using floating point values) because of the \(\exp \) of Eq. (9). Hence, for a stable implementation, it is of utmost importance to *perform all computations in the log domain*. In practice, for descriptor \(x\), one never computes \(u_k(x)\) but

where the subscript \(d\) denotes the \(d\)-th dimension of a vector. To compute the log-likelihood \(\log u_\lambda (x) = \log \sum \nolimits _{k=1}^K w_k u_k(x)\), one does so incrementally by writing \(\log u_\lambda (x) = \log \big ( w_1 u_1(x) + \sum \nolimits _{k=2}^K w_k u_k(x) \big )\) and by using the fact that \(\log (a+b) = \log (a) + \log \left( 1+ \exp (\log (b) - \log (a)\right) \) to remain in the log domain.

Similarly, to compute the posterior probability (15), one writes \(\gamma _k = \exp \left[ \log (w_k u_k(x)) - \log (u_\lambda (x)) \right] \) to operate in the log domain.

*Variance Flooring* Because the variance \(\sigma _k^2\) appears in a \(\log \) and as a denominator in Eq. (79), too small values of the variance can lead to instabilities in the Gaussian computations. In our case, this is even more likely to happen since we extract patches densely and we do not discard uniform patches. In our experience, such patches tend to cluster in a Gaussian mixture component with a tiny variance. To avoid this issue, we use variance flooring: we compute the global covariance matrix over all our training set and we enforce the variance of each Gaussian to be no smaller than a constant \(\alpha \) times the global variance. Such an operation is referred to as variance flooring. HTK suggest a value \(\alpha = 0.01\).

*Posterior Thresholding* To reduce the cost of training GMMs as well as the cost of computing FVs, we assume that all the posteriors \(\gamma (k)\) which are below a given threshold \(\theta \) are equal to exactly zero. In practice, we use a quite conservative threshold \(\theta =10^{-4}\) and for a GMM with 256 Gaussians, 5–10 Gaussians maximum exceed this threshold. After discarding some of the \(\gamma (k)\) values, we renormalize the \(\gamma ^{\prime }\)s to ensure that we still have \(\sum \nolimits _{k=1}^K \gamma (k) = 1\).

Note that this operation does not only reduce the computational cost, it also sparsifies the FV (see Sect. 4.2). Without such a posterior thresholding, the FV (or the soft-BOV) would be completely dense.

*Incremental Training* It is well known that the ML estimation of a GMM is a non-convex optimization problem for more than one Gaussian. Hence, different initializations might lead to different solutions. While in our experience, we have never observed a drastic influence of the initialization on the end result, we strongly advise the use of an iterative process as suggested for instance in Young et al. (2002). This iterative procedure consists in starting with a single Gaussian (for which a closed-form formula exists), splitting all Gaussians by slightly perturbing the mean and then re-estimating the GMM parameters with EM. This iterative splitting-training strategy enables cross-validating and monitoring the influence of the number of Gaussians \(K\) in a consistent way.

## Rights and permissions

## About this article

### Cite this article

Sánchez, J., Perronnin, F., Mensink, T. *et al.* Image Classification with the Fisher Vector: Theory and Practice.
*Int J Comput Vis* **105**, 222–245 (2013). https://doi.org/10.1007/s11263-013-0636-x

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11263-013-0636-x