# Understanding bag-of-words model: a statistical framework

## Abstract

The bag-of-words model is one of the most popular representation methods for object categorization. The key idea is to quantize each extracted key point into one of visual words, and then represent each image by a histogram of the visual words. For this purpose, a clustering algorithm (e.g., K-means), is generally used for generating the visual words. Although a number of studies have shown encouraging results of the bag-of-words representation for object categorization, theoretical studies on properties of the bag-of-words model is almost untouched, possibly due to the difficulty introduced by using a heuristic clustering process. In this paper, we present a statistical framework which generalizes the bag-of-words representation. In this framework, the visual words are generated by a statistical process rather than using a clustering algorithm, while the empirical performance is competitive to clustering-based method. A theoretical analysis based on statistical consistency is presented for the proposed framework. Moreover, based on the framework we developed two algorithms which do not rely on clustering, while achieving competitive performance in object categorization when compared to clustering-based bag-of-words representations.

### Keywords

Object recognition Bag of words model Rademacher complexity## 1 Introduction

Inspired by the success of text categorization [6,10], a bag-of-words representation becomes one of the most popular methods for representing image content and has been successfully applied to object categorization. In a typical bag-of-words representation, “interesting” local patches are first identified from an image, either by densely sampling [14,25] or by a interest point detector [9]. These local patches, represented by vectors in a high dimensional space [9], are often referred to as the key points.

To efficiently handle these key points, the key idea is to quantize each extracted key point into one of *visual words*, and then represent each image by a histogram of the visual words. This vector quantization procedure allows us to represent each image by a histogram of the visual words, which is often referred to as the bag-of-words representation, and consequently converts the object categorization problem into a text categorization problem. A clustering procedure (e.g., K-means) is often applied to group key points from all the training images into a large number of clusters, with the center of each cluster corresponding to a different visual word. Studies [3,20] have shown promising performance of bag-of-words representation in object categorization. Various methods [7,8,12,13,17,21,25] have been proposed for the visual vocabulary construction to improve both the computational efficiency and the classification accuracy of object categorization. However, to the best of our knowledge, there is no theoretical analysis on the statistical properties of vector quantization for object categorization.

- 1.
each visual word as a

*quantization function*\(f_k({\user2{x}})\) that is randomly sampled from a class of functions \({\mathcal{F}}\) by an unknown distribution \({\mathcal{P}}_{{\mathcal{F}}},\) and - 2.
each key point of an image as a random sample from an unknown distribution \(q_i({\user2{x}}).\)

The rest of this paper is organized as follows. Section 2 presents the overview of existing approaches for key point quantization that were used by object recognition. Section 3 presents a statistical framework that generalizes the classical bag-of-words representation, and two random algorithms for vector quantization based on the proposed framework. We show that both algorithms are statistically consistent in estimating the similarity between two images. Empirical study with object recognition reported in Sect. 4 shows encouraging results of the proposed algorithms for vector quantization, which in return validates the proposed statistical framework for the bag-of-words representation. Section 5 concludes this work.

## 2 Related work

In object recognition and texture analysis, a number of algorithms have been proposed for key point quantization. Among them, K-means is probably the most popular one. To reduce the high computational cost of K-means, hierarchical K-means is proposed in [13] for more efficient vector quantization. In [25], a supervised learning algorithm is proposed to reduce the visual vocabulary that is initially obtained by K-means, into a more descriptive and compact one. Farquhar et al. [5] model the problem as Gaussian mixture model where each visual words corresponds to a Gaussian component and use the maximum a posterior (MAP) approach to learn the parameter. A method based on mean-shift is proposed in [7] for vector quantization to resolve the problem that K-means tends to ‘starve’ medium density regions in feature space and each key point is allocated to the first visual word similar to it. Moosmann et al. [12] use extremely randomized clustering forests to efficiently generate a highly discriminative coding of visual words. To minimize the loss of information in vector quantization, Lazebnik and Raginsky [8] try to seek a compressed representation of vectors that preserve the sufficient statistics of features. In [16], images are characterized using a set of category-specific histograms describing whether the content can best be modeled by the universal vocabulary or the specific vocabulary. Tuytelaars and Schmid [21] propose a quantization method that discretizes a feature space by a regular lattice. van Gemert et al. [22] use kernel density estimation to avoid the problem of ‘codeword uncertainty’ and ‘codeword plausibility’.

Although many studies have shown encouraging results of the bag-of-words representation for object categorization, none of them provide statistical consistency analysis, which reveals the asymptotic behavior of the bag-of-words model for object recognition. Unlike the existing statistical approaches for key point quantization that are designed to reduce the training error, the proposed framework generalizes the bag-of-words model by the statistical expectation, making it possible to analyze the statistical consistency of the bag-of-words model. Finally, we would like to point out that although several randomized approaches [12,14,24] have been proposed for key point quantization, none of them provides theoretical analysis on statistical consistency. In contrast, we present not only the theoretic results for the two proposed random algorithms for vector quantization, but also the results of the empirical study with object recognition that support the theoretic claim.

## 3 A statistical framework for bag-of-words representation

In this section, we first present a statistical framework for the bag-of-words representation in object categorization, followed by two random algorithms that are derived from the proposed framework. The analysis of statistical consistency is also presented for the two proposed algorithms.

### 3.1 A statistical framework

We consider the bag-of-words representation for images, with each image being represented by a collection of local descriptors. We denote by *N* the number of training images, and by \(X_i = ({\user2{x}}_i^1, \ldots, {\user2{x}}_i^{n_i})\) the collection of key points used to represent image \({\mathcal{I}}_i\) where \({\user2{x}}_i^l \in {\mathcal{X}}, l=1,\ldots, n_i\) is a key point in feature space \({\mathcal{X}}.\) To facilitate statistical analysis, we assume that each key point \({\user2{x}}_i^l\) in *X*_{i} is randomly drawn from an unknown distribution \(q_i({\user2{x}})\) associated with image \({\mathcal{I}}_i.\)

*I*(

*z*) outputs 1 when

*z*is true, or 0 otherwise. In the above definition, each quantization function \(f({\user2{x}}; {\user2{v}})\) is essentially a ball of radius ρ centered at \({\user2{v}}.\) It outputs 1 when a point \({\user2{x}}\) is within the ball, and 0 if \({\user2{x}}\) is outside the ball. This definition of quantization function is clearly related to the vector quantization by data clustering.

*m*visual words, or

*m*quantization functions \(\{f_k({\user2{x}})\}_{k=1}^m\) that are sampled from \({\mathcal{F}}, {\hat{h}}_i^k\) is computed as

### 3.2 Random algorithms for key point quantization and their statistical consistency

We emphasize that the pairwise similarity in Eq. 6 can not be computed directly. This is because both distributions \(q_i({\user2{x}})\) and \(q({\user2{v}})\) are unknown, which makes it intractable to compute \({\mathbb{E}}_i[\cdot]\) and \({\mathbb{E}}_{{\user2{v}}}[\cdot].\) In real applications, approximations are needed. In this section, we study how approximations will affect the estimation of pairwise similarity. In particular, given the pairwise similarity estimated by different kinds of approximated distributions, we aim to bound its difference to the underlying true similarity. To simplify our analysis, we assume that each image has at least *n* key points.

*m*visual words are used as prototypes for the key points in all the images. Let \({\user2{v}}_1, \ldots, {\user2{v}}_m\) be the

*m*visual words randomly sampled from the key points in all the images. The empirical distribution \({\hat{q}}({\user2{v}})\) is

#### 3.2.1 Empirically estimated density function for \(q_i({\user2{x}})\)

*s*

_{ij}defined in Eq. 6. For Eq. 9, the pairwise similarity, denoted by \({\hat{s}}_{ij},\) is computed as

We first state the McDiarmid inequality [11], which is used throughout our analysis.

**Theorem 1** (McDiarmid Inequality)

*Given independent random variables*

*v*

_{1},

*v*

_{2},...,

*v*

_{n},

*v*′

_{i}∈

*V*,

*and a function*\(f : V^n \,\mapsto\, {\mathbb{R}}\)

*satisfying*

*where*\({\user2{v}} = (v_1, v_2, \ldots, v_n)\)

*and*\({\user2{v}}^{\prime} = (v_1, v_2, \ldots, v_{i-1}, v^{\prime}_i, v_{i+1}, \ldots, v_n),\)

*then the following statement holds*

Using the McDiarmid inequality, we have the following theorem which bounds \(|{\bar{s}}_{ij} - {s}_{ij}|.\)

**Theorem 2**

*Assuming*\(f_k({\user2{x}}), k = 1, \ldots ,m\)

*are randomly drawn from class*\({\mathcal{F}}\)

*according to an unknown distribution. And further assuming that any function in*\({\mathcal{F}}\)

*is universally bounded between 0 and 1. With probability*1 − δ,

*the following inequality holds for any two training images*\({\mathcal{I}}_i\)

*and*\({\mathcal{I}}_j\)

*Proof*

*k*,

*c*

_{k}≤ 1/

*m*. By setting

The above theorem indicates that, if we have the true distribution \(q_i({\user2{x}})\) of each image \({\mathcal{I}}_i,\) with a large number of sampled quantization functions \(f_k({\user2{x}}),\) we have a very good chance to recover the true similarity *s*_{ij} with a small error. The next theorem bounds \(|{\hat{s}}_{ij} - s_{ij}|.\)

**Theorem 3**

*Assuming each image has at least n randomly sampled key points. Also assuming that*\(f_k({\user2{x}}), k = 1, \ldots, m\)

*randomly drawn from an unknown distribution over class*\({\mathcal{F}}.\)

*With probability*1 − δ,

*the following inequality is satisfied for any two images*\({\mathcal{I}}_i\)

*and*\({\mathcal{I}}_j\)

*Proof*

*Remark*

Theorem 3 reveals an interesting relationship between the estimation error \(|s_{ij} - {\hat{s}}_{ij}|\) and the number of quantization functions (or the number of visual words). The upper bound in Theorem 3 consists of two terms: the first term decreases at a rate of \(O(1/\sqrt{m})\) while the second term increases at a rate of \(O(\ln m).\) When the number of visual words *m* is small, the first term dominates the upper bound, and therefore increasing *m* will reduce the difference \(|{\hat{s}}_{ij} - s_{ij}|.\) As *m* becomes significantly larger than *n*, the second term will dominate the upper bound, and therefore increasing *m* will lead to a larger \(|{\hat{s}}_{ij} - s_{ij}|.\) This result appears to be consistent with the observations on the size of the visual vocabulary: a large vocabulary tends to performance well in object categorization; but, too many visual words could deteriorate the classification accuracy.

Finally, we emphasize that although the idea of vector quantization by randomly sampled centers was already discussed in [7,24], to the best of our knowledge, this is the first work that presents its statistical consistency analysis.

#### 3.2.2 Kernel density function estimation for \(q_i({\user2{x}})\)

*B*controls the functional norm of \(q({\user2{x}})\) in the reproducing kernel Hilbert space \({\mathcal{H}}_\kappa.\) An example of \(\kappa({\user2{x}}, {\user2{x}}^{\prime})\) is RBF function, i.e. \(\kappa({\user2{x}}, {\user2{x}}^{\prime}) \propto \exp(-\lambda d({\user2{x}}, {\user2{x}}^{\prime})^2),\) where \(d({\user2{x}}, {\user2{x}}^{\prime}) = \|{\user2{x}} - {\user2{x}}^{\prime}\|_2.\) Then, the distribution \(q_i({\user2{x}})\) is approximated by a kernel density estimation \(\tilde{q}_i({\user2{x}})\) defined as follows

_{i}

^{l}(1 ≤

*l*≤

*n*

_{i}) are the combination weight that satisfy (i) α

_{i}

^{l}≥ 0, (ii) \(\sum_{l=1}^{n_i} \alpha_i^l = 1,\) and (iii) α

_{i}

*K*

_{i}α

_{i}≤

*B*

^{2}, where \(K_i = [\kappa({\user2{x}}_i^l, {\user2{x}}_i^{l'})]_{n_i \times n_i}.\)

*s*

_{ij}, we follow the analysis [19] by viewing \({\mathbb{E}}_i [f({\user2{x}})] {\mathbb{E}}_j[f({\user2{x}})]\) as a mapping, denoted by \(g : {\mathcal{F}}\, \mapsto\, {\mathbb{R}}_+,\) i.e.,

*g*, denoted by \({\mathcal{G}},\) is defined as

*Randemacher complexity*[2]:

**Definition 1** (*Randemacher complexity*)

*x*

_{1},...,

*x*

_{n}are sampled from a set \(\mathcal {X}\) with i.i.d. Let \({\mathcal{F}}\) be a class of functions mapping from \(\mathcal {X}\) to \({\mathbb{R}}.\) The Randemacher complexity of \({\mathcal{F}}\) is defined as

_{i}is independent uniform ± 1-valued random variables.

Assuming at least *n* key points are randomly sampled from each image, we have the following lemmas that bounds the complexity of domain \({\mathcal{G}}:\)

**Lemma 1**

*The Rademacher complexity of function class*\({\mathcal{G}},\)

*denoted by*\(R_m({\mathcal{G}}),\)

*is bounded as*

*where*\(C_{\kappa} = \max_{{\user2{x}}, {\user2{z}}}{\sqrt{\kappa({\user2{x}}, {\user2{z}})}}\)

*Proof*

*F*= {

*f*

_{1},...,

*f*

_{m}}, according to the definition, we have

From [2], we have the following lemmas:

**Lemma 2** (Theorem 12 in [2])

*For*1 ≤

*q*< ∞,

*let*\({\mathcal{L}} = \{|f-h|^q : f \in {\mathcal{F}}\},\)

*where*

*h*

*and*\(\|f - h\|_{\infty}\)

*is uniformly bounded. We have*

**Lemma 3** (Theorem 8 in [2])

*With probability*1 − δ

*the following inequality holds*

*x*,

*y*) is the loss function,

*n*is the number of samples and \(\phi \circ {\mathcal{F}} = \{(x,y) \,\mapsto\, \phi(y,f(x)) - \phi(y,0):f \in {\mathcal{F}}\}.\)

Based on the above lemmas, we have the following theorem

**Theorem 4**

*Assume that the density function*\(q_i({\user2{x}}), q_j({\user2{x}}) \in {\mathcal{F}}_D.\)

*Let*\(\tilde{q}_i({\user2{x}}), \tilde{q}_j({\user2{x}}) \in {\mathcal{F}}_D\)

*be an estimated density function from n sampled key points. We have, with probability*1 − δ,

*the following inequality holds*

*Proof*

*g*(

*f*;

*q*

_{i},

*q*

_{j}) ≤ 1, using the results in Lemma 1 and 2, we have

In our empirical study, we will use RBF kernel function for \(\kappa({\user2{x}}, {\user2{x}}^{\prime})\) with α_{i}^{l} = 1/*n*_{i}. The corollary below shows the bound for this choice of kernel density estimation.

**Corollary 5**

*When the kernel function*\(\kappa({\user2{x}}, {\user2{x}}^{\prime}) = {(1/(2\pi\sigma^2))}^{d/2} \exp(-\|{\user2{x}} - {\user2{x}}^{\prime}\|_2^2/(2\sigma^2))\)

*and*α

_{i}

^{l}= 1/

*n*

_{i},

*the bound in Theorem*4

*becomes*

*Remark*

Theorem 4 bounds the true expectation of the difference between the similarity estimated by kernel density function and the true similarity. Similar to Theorem 3, this bound also consists of a term decreasing at a rate of \(O(1/\sqrt{m})\) and a term increasing at a rate of \(O(\ln m).\) What’s more, we can see in order to minimize the true expectation of the difference between the similarity estimated by kernel density function and the true similarity, we need to minimize the empirical expectation of the difference between the similarity estimated by kernel density function and the similarity estimated by empirical density function. If \(\kappa({\user2{x}},{\user2{v}})\) decreases exponentially as \(d({\user2{x}}, {\user2{v}})\) decreases, such as Gaussian kernel, we have \(\theta({\user2{x}}, {\user2{v}})\) close to 1 when \(d({\user2{x}}, {\user2{v}}) \leq \rho\) while \(\theta({\user2{x}}, {\user2{v}})\) close to 0 when \(d({\user2{x}}, {\user2{v}}) > \rho.\) In such circumstance, setting α_{i}^{l} = 1/*n*_{i} for all 1 ≤ *l* ≤ *n*_{i} is a good choice for the approximation and is also very efficient since we do not need to learn α.

Note that although the idea of kernel density estimation was already proposed in some studies [e.g., 22], to the best of our knowledge, this is the first work that reveals the statistical consistency of kernel density estimation for the bag-of-words representation.

## 4 Empirical study

*QEE*for short, and to the algorithm based on kernel density estimation as “Quantization via Kernel Estimation”, or

*QKE*for short. Note that since both vector quantization algorithms do not rely on the clustering algorithms to identify visual words, they are in general computationally more efficient. In addition, both algorithms have error bounds decreases at the rate of \(O(1/\sqrt{m})\) when the number of key points

*n*is large, indicating that they are robust to the number of visual words

*m*. We emphasize that although similar random algorithms for vector quantization have been discussed in [5,17,22,14,24], the purpose of this empirical study is to verify that

simple random algorithms deliver similar performance of object recognition as the clustering based algorithm, and

the random algorithms are robust to the number of visual words, as predicted by the statistical consistency analysis.

Two data sets are used in our study: PASCAL VOC Challenge 2006 data set [4] and Graz02 data set [15]. *PASCAL06* contains 5,304 images from 10 classes. We randomly select 100 images for training and 500 for testing. The *Graz02* data set contains 365 bike images, 420 car images, 311 people images and 380 background images. We randomly select 100 images from each class for training, and use the remaining for testing. By using a relatively small number of examples for training, we are able to examine the sensitivity of a vector quantization algorithm to the number of visual words. On average 1,000 key points are extracted from each image, and each key point is represented by the SIFT local descriptor [23]. For *PASCAL06* data set, the binary classification performance for each object class is measured by the area under the ROC curve (AUC). For *Graz02* data set, the binary classification performance for each object class is measured by the accuracy. Results averaged over ten random trials are reported.

First, we observe that the proposed algorithms for vector quantization yield comparable if not better performance than the K-means clustering algorithm. This confirms the proposed statistical framework for key point quantization is effective. Second, we observe that the clustering based approach for vector quantization tends to perform worse, sometimes very significantly, when the number of visual words is large. We attribute this instability to the fact that K-means requires each interest point belongs to exactly one visual word. If the number of clusters is not appropriate, for example, too large compared to the number of instances, two relevant key points may be separated into different clusters although they are both very near to the boundary. It will lead to a poor estimation of pairwise similarity. The problem of “hard assignment” was also observed in [17,22]. In contrast, for the proposed algorithms, we observe a rather stable improvement as the number of visual words increases, consistent with our analysis in statistical consistency.

## 5 Conclusion

The bag-of-words model is one of the most popular representation methods for object categorization. The key idea is to quantize each extracted key point into one of visual words, and then represent each image by a histogram of the visual words. For this purpose, a clustering algorithm (e.g., K-means), is generally used for generating the visual words. Although a number of studies have shown encouraging results of the bag-of-words representation for object categorization, theoretical studies on properties of the bag-of-words model is almost untouched, possibly due to the difficulty introduced by using a heuristic clustering process. In this paper, we present a statistical framework which generalizes the bag-of-words representation. In this framework, the visual words are generated by a statistical process rather than using a clustering algorithm, while the empirical performance is competitive to clustering-based method. A theoretical analysis based on statistical consistency is presented for the proposed framework. Moreover, based on the framework we developed two algorithms which do not rely on clustering, while achieving competitive performance in object categorization when compared to clustering-based bag-of-words representations.

Bag-of-words representation is a popular approach to object categorization. Despite its success, few studies are devoted to the theoretic analysis of the bag-of-words representation. In this work, we present a statistical framework for key point quantization that generalizes the bag-of-words model by statistical expectation. We present two random algorithms for vector quantization where the visual words are generated by a statistical process rather than using a clustering algorithm. A theoretical analysis of their statistical consistency is presented. We also verify the efficacy and the robustness of the proposed framework by applying it to object recognition. In the future, we plan to examine the dependence of the proposed algorithms on the threshold ρ, and extend QKE to weighted kernel density estimation.

## Notes

### Acknowledgments

We want to thank the reviewers for helpful comments and suggestions. This research is partially supported by the National Fundamental Research Program of China (2010CB327903), the Jiangsu 333 High-Level Talent Cultivation Program and the National Science Foundation (IIS-0643494). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

### References

- 1.Abramowitz M, Stegun IA (eds) (1972) Handbook of mathematical functions with formulas, graphs, and mathematical tables. Dover, New YorkMATHGoogle Scholar
- 2.Bartlett PL, Wang M (2002) Rademacher and Gaussian complexities: risk bounds and structural results. J Mach Learn Res 3:463–482CrossRefMathSciNetGoogle Scholar
- 3.Csurka G, Dance C, Fan L, Williamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: ECCV workshop on statistical learning in computer vision, Prague, Czech Republic, 2004Google Scholar
- 4.Everingham M, Zisserman A, Williams CKI, Van Gool L (2006) The PASCAL visual object classes challenge 2006 (VOC2006) results. http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf
- 5.Farquhar J, Szedmak S, Meng H, Shawe-Taylor J (2005) Improving “bag-of-keypoints” image categorisation. Technical report, University of SouthamptonGoogle Scholar
- 6.Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning. Chemnitz, Germany, pp 137–142Google Scholar
- 7.Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: Proceedings of the 10th IEEE international conference on computer vision, Beijing, China, 2005, pp 604–610Google Scholar
- 8.Lazebnik S, Raginsky M (2009) Supervised learning of quantizer codebooks by information loss minimization. IEEE Trans Pattern Anal Mach Intell 31(7):1294–1309CrossRefGoogle Scholar
- 9.Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
- 10.McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI workshop on learning for text categorization, Madison, WIGoogle Scholar
- 11.McDiarmid C (1989) On the method of bounded differences. In: Surveys in combinatorics 1989, pp 148–188Google Scholar
- 12.Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual codebooks using randomized clustering forests. In: Schölkopf B, Platt J, Hoffman T (eds) Advances in neural information processing systems, vol 19. MIT Press, Cambridge, pp 985–992Google Scholar
- 13.Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, New York, NY, pp 2161–2168Google Scholar
- 14.Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: Proceedings of the 9th European conference on computer vision, Graz, Austria, pp 490–503Google Scholar
- 15.Opelt A, Pinz A, Fussenegger M, Auer P (2006) Generic object recognition with boosting. IEEE Trans Pattern Anal Mach Intell 28(3):416–431CrossRefGoogle Scholar
- 16.Perronnin F, Dance C, Csurka G, Bressian M (2006) Adapted vocabularies for generic visual categorization. In: Proceedings of the 9th European conference on computer vision, Graz, Austria, pp 464–475Google Scholar
- 17.Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Anchorage, AKGoogle Scholar
- 18.Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, CambridgeGoogle Scholar
- 19.Shawe-Taylor J, Dolia A (2007) A framework for probability density estimation. In: Proceedings of the 11th international conference on artificial intelligence and statistics, San Juan, Puerto Rico, pp 468–475Google Scholar
- 20.Sivic J, Zisserman A (2003) Video Google: A text retrieval approach to object matching in videos. In: Proceedings of the 9th IEEE international conference on computer vision, Nice, France, pp 1470–1477Google Scholar
- 21.Tuytelaars T, Schmid C (2007) Vector quantizing feature space with a regular lattice. In: Proceedings of the 11th IEEE international conference on computer vision, Rio de Janeiro, Brazil, pp 1–8Google Scholar
- 22.van Gemert JC, Geusebroek J-M, Veenman CJ, Smeulders AWM (2008) Kernel codebooks for scene categorization. In: Proceedings of the 10th European conference on computer vision, Marseille, France, pp 696–709Google Scholar
- 23.Vedaldi A, Fulkerson B (2008) VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/
- 24.Viitaniemi V, Laaksonen J (2008) Experiments on selection of codebooks for local image feature histograms. In: Proceedings of the 10th international conference series on visual information systems, Salerno, Italy, pp 126–137Google Scholar
- 25.Winn J, Criminisi A, Minka T (2005) Object categorization by learned universal visual dictionary. In: Proceedings of the 10th IEEE international conference on computer vision, Beijing, China, pp 1800–1807Google Scholar