Understanding bag-of-words model: a statistical framework

Original Article

Abstract

The bag-of-words model is one of the most popular representation methods for object categorization. The key idea is to quantize each extracted key point into one of visual words, and then represent each image by a histogram of the visual words. For this purpose, a clustering algorithm (e.g., K-means), is generally used for generating the visual words. Although a number of studies have shown encouraging results of the bag-of-words representation for object categorization, theoretical studies on properties of the bag-of-words model is almost untouched, possibly due to the difficulty introduced by using a heuristic clustering process. In this paper, we present a statistical framework which generalizes the bag-of-words representation. In this framework, the visual words are generated by a statistical process rather than using a clustering algorithm, while the empirical performance is competitive to clustering-based method. A theoretical analysis based on statistical consistency is presented for the proposed framework. Moreover, based on the framework we developed two algorithms which do not rely on clustering, while achieving competitive performance in object categorization when compared to clustering-based bag-of-words representations.

Keywords

Object recognition Bag of words model Rademacher complexity 

1 Introduction

Inspired by the success of text categorization [6,10], a bag-of-words representation becomes one of the most popular methods for representing image content and has been successfully applied to object categorization. In a typical bag-of-words representation, “interesting” local patches are first identified from an image, either by densely sampling [14,25] or by a interest point detector [9]. These local patches, represented by vectors in a high dimensional space [9], are often referred to as the key points.

To efficiently handle these key points, the key idea is to quantize each extracted key point into one of visual words, and then represent each image by a histogram of the visual words. This vector quantization procedure allows us to represent each image by a histogram of the visual words, which is often referred to as the bag-of-words representation, and consequently converts the object categorization problem into a text categorization problem. A clustering procedure (e.g., K-means) is often applied to group key points from all the training images into a large number of clusters, with the center of each cluster corresponding to a different visual word. Studies [3,20] have shown promising performance of bag-of-words representation in object categorization. Various methods [7,8,12,13,17,21,25] have been proposed for the visual vocabulary construction to improve both the computational efficiency and the classification accuracy of object categorization. However, to the best of our knowledge, there is no theoretical analysis on the statistical properties of vector quantization for object categorization.

In this paper, we present a statistical framework which generalizes the bag-of-words representation and aim to provide a theoretical understanding for vector quantization and its effect on object categorization from the viewpoint of statistical consistency. In particular, we view
  1. 1.

    each visual word as a quantization function\(f_k({\user2{x}})\) that is randomly sampled from a class of functions \({\mathcal{F}}\) by an unknown distribution \({\mathcal{P}}_{{\mathcal{F}}},\) and

     
  2. 2.

    each key point of an image as a random sample from an unknown distribution \(q_i({\user2{x}}).\)

     
The above statistical description of key points and visual words allows us to interpret the similarity between two images in bag-of-words representation, the key quantity in object categorization, as an empirical expectation over the distributions \(q_i({\user2{x}})\) and \({\mathcal{P}}_{{\mathcal{F}}}.\) Based on the proposed statistical framework, we present two random algorithms for vector quantization, one based on the empirical distribution and the other based on kernel density estimation. We show that both random algorithms for vector quantization are statistically consistent in estimating the similarity between two images. Our empirical study with object recognition also verifies that the two proposed algorithms (I) yield recognition accuracy that is comparable to the clustering based bag-of-words representation, and (II) are resilient to the number of visual words when the number of training examples is limited. The success of the two simple algorithms validates the proposed statistical framework for vector quantization.

The rest of this paper is organized as follows. Section 2 presents the overview of existing approaches for key point quantization that were used by object recognition. Section 3 presents a statistical framework that generalizes the classical bag-of-words representation, and two random algorithms for vector quantization based on the proposed framework. We show that both algorithms are statistically consistent in estimating the similarity between two images. Empirical study with object recognition reported in Sect. 4 shows encouraging results of the proposed algorithms for vector quantization, which in return validates the proposed statistical framework for the bag-of-words representation. Section 5 concludes this work.

2 Related work

In object recognition and texture analysis, a number of algorithms have been proposed for key point quantization. Among them, K-means is probably the most popular one. To reduce the high computational cost of K-means, hierarchical K-means is proposed in [13] for more efficient vector quantization. In [25], a supervised learning algorithm is proposed to reduce the visual vocabulary that is initially obtained by K-means, into a more descriptive and compact one. Farquhar et al. [5] model the problem as Gaussian mixture model where each visual words corresponds to a Gaussian component and use the maximum a posterior (MAP) approach to learn the parameter. A method based on mean-shift is proposed in [7] for vector quantization to resolve the problem that K-means tends to ‘starve’ medium density regions in feature space and each key point is allocated to the first visual word similar to it. Moosmann et al. [12] use extremely randomized clustering forests to efficiently generate a highly discriminative coding of visual words. To minimize the loss of information in vector quantization, Lazebnik and Raginsky [8] try to seek a compressed representation of vectors that preserve the sufficient statistics of features. In [16], images are characterized using a set of category-specific histograms describing whether the content can best be modeled by the universal vocabulary or the specific vocabulary. Tuytelaars and Schmid [21] propose a quantization method that discretizes a feature space by a regular lattice. van Gemert et al. [22] use kernel density estimation to avoid the problem of ‘codeword uncertainty’ and ‘codeword plausibility’.

Although many studies have shown encouraging results of the bag-of-words representation for object categorization, none of them provide statistical consistency analysis, which reveals the asymptotic behavior of the bag-of-words model for object recognition. Unlike the existing statistical approaches for key point quantization that are designed to reduce the training error, the proposed framework generalizes the bag-of-words model by the statistical expectation, making it possible to analyze the statistical consistency of the bag-of-words model. Finally, we would like to point out that although several randomized approaches [12,14,24] have been proposed for key point quantization, none of them provides theoretical analysis on statistical consistency. In contrast, we present not only the theoretic results for the two proposed random algorithms for vector quantization, but also the results of the empirical study with object recognition that support the theoretic claim.

3 A statistical framework for bag-of-words representation

In this section, we first present a statistical framework for the bag-of-words representation in object categorization, followed by two random algorithms that are derived from the proposed framework. The analysis of statistical consistency is also presented for the two proposed algorithms.

3.1 A statistical framework

We consider the bag-of-words representation for images, with each image being represented by a collection of local descriptors. We denote by N the number of training images, and by \(X_i = ({\user2{x}}_i^1, \ldots, {\user2{x}}_i^{n_i})\) the collection of key points used to represent image \({\mathcal{I}}_i\) where \({\user2{x}}_i^l \in {\mathcal{X}}, l=1,\ldots, n_i\) is a key point in feature space \({\mathcal{X}}.\) To facilitate statistical analysis, we assume that each key point \({\user2{x}}_i^l\) in Xi is randomly drawn from an unknown distribution \(q_i({\user2{x}})\) associated with image \({\mathcal{I}}_i.\)

The key idea of the bag-of-words representation is to quantize each key point into one of the visual words that are often derived by clustering. We generalize this idea of quantization by viewing the mapping to a visual word \({\user2{v}}_k \in {\mathcal{X}}\) as a quantization function \(f_k({\user2{x}}): {\mathcal{X}}\,\mapsto\,[0, 1].\) Due to the uncertainty in constructing the vocabulary, we assume that the quantization function \(f_k({\user2{x}})\) is randomly drawn from a class of functions, denoted by \({\mathcal{F}},\) via a unknown distribution \({\mathcal{P}}_{{\mathcal{F}}}.\) To capture the behavior of quantization, we design the function class \({\mathcal{F}}\) as follows
$$ {\mathcal{F}} = \{f({\user2{x}} ;{\user2{v}}) | f({\user2{x}};{\user2{v}}) = I(\|{\user2{x}} - {\user2{v}}\| \leq \rho), {\user2{v}} \in {\mathcal{X}}\} $$
(1)
where indicator function I(z) outputs 1 when z is true, or 0 otherwise. In the above definition, each quantization function \(f({\user2{x}}; {\user2{v}})\) is essentially a ball of radius ρ centered at \({\user2{v}}.\) It outputs 1 when a point \({\user2{x}}\) is within the ball, and 0 if \({\user2{x}}\) is outside the ball. This definition of quantization function is clearly related to the vector quantization by data clustering.
Based on the above statistical interpretation of key points and quantization functions, we can now provide a statistical description for the histogram of visual words, which is the key of bag-of-words representation. Let \({\hat{h}}_i^k\) denotes the normalized number of key points in image \({\mathcal{I}}_i\) that are mapped to visual word \({\user2{v}}_k.\) Given m visual words, or m quantization functions \(\{f_k({\user2{x}})\}_{k=1}^m\) that are sampled from \({\mathcal{F}}, {\hat{h}}_i^k\) is computed as
$$ {\hat{h}}_i^k = {\frac{1}{n_i}} \sum_{j=1}^{n_i} f_k\left({\user2{x}}_i^l\right) ={\hat{{\mathbb{E}}}}_i[f_k({\user2{x}})] $$
(2)
where \({\hat{{\mathbb{E}}}}_i[f_k({\user2{x}})]\) stands for the empirical expectation of function \(f_k({\user2{x}})\) based on the samples \({\user2{x}}_i^1, \ldots, {\user2{x}}_i^{n_i}.\) We can generalize the above computation by replacing the empirical expectation \({\hat{{\mathbb{E}}}}_i[f_k({\user2{x}})]\) with an expectation over the true distribution \(q_i({\user2{x}}),\) i.e.,
$$ h_i^k = {\mathbb{E}}_i[f_k({\user2{x}})] = \int d\,{\user2{x}} q_i({\user2{x}}) f_k({\user2{x}}). $$
(3)
The bag-of-words representation for image \({\mathcal{I}}_i\) is expressed by vector \({\user2{h}}_i = (h^1_i , \ldots , h_i^m ).\)
In the next step, we analyze the pairwise similarity between two images. It is important to note that the pairwise similarity plays a critical role in any pattern classification problems including object categorization. According to the learning theory [18], it is the pairwise similarity, not the vector representation of images, that decides the classification performance. Using the vector representation \({\user2{h}}_i\) and \({\user2{h}}_j,\) the similarity between two images \({\mathcal{I}}_i\) and \({\mathcal{I}}_j,\) denoted by \({\bar{s}}_{ij},\) is computed as
$$ {\bar{s}}_{ij} = {\frac{1}{m}} {\user2{h}}_i^T {\user2{h}}_j = {\frac{1}{m}}\sum_{k=1}^m {\mathbb{E}}_i [f_k({\user2{x}})]{\mathbb{E}}_j[f_k({\user2{x}})] $$
(4)
Similar to the previous analysis, the summation in the above expression can be viewed as an empirical expectation over the sampled quantization functions \(f_k({\user2{x}}), k=1,\ldots, m.\) We thus generalize the definition of pairwise similarity in Eq. 4 by replacing the empirical expectation with the true expectation, and obtain the true similarity between two images \({\mathcal{I}}_i\) and \({\mathcal{I}}_j\) as
$$ s_{ij} = {\mathbb{E}}_{f \sim {\mathcal{P}}_{{\mathcal{F}}}} \left[ {\mathbb{E}}_i[f({\user2{x}})]{\mathbb{E}}_j[f({\user2{x}})] \right] $$
(5)
According to the definition in Eq. 1, each quantization function is parameterized by a center \({\user2{v}}.\) Thus, to define \({\mathcal{P}}_{{\mathcal{F}}},\) it suffices to define a distribution for the center \({\user2{v}},\) denoted by \(q({\user2{v}}).\) Thus, Eq. 5 can be expressed as
$$ s_{ij} = {\mathbb{E}}_{{\user2{v}}} \left[ {\mathbb{E}}_i[f({\user2{x}})]{\mathbb{E}}_j[f({\user2{x}})] \right] .$$
(6)

3.2 Random algorithms for key point quantization and their statistical consistency

We emphasize that the pairwise similarity in Eq. 6 can not be computed directly. This is because both distributions \(q_i({\user2{x}})\) and \(q({\user2{v}})\) are unknown, which makes it intractable to compute \({\mathbb{E}}_i[\cdot]\) and \({\mathbb{E}}_{{\user2{v}}}[\cdot].\) In real applications, approximations are needed. In this section, we study how approximations will affect the estimation of pairwise similarity. In particular, given the pairwise similarity estimated by different kinds of approximated distributions, we aim to bound its difference to the underlying true similarity. To simplify our analysis, we assume that each image has at least n key points.

By assuming that the key points in all the images are sampled from \(q({\user2{v}}),\) we have an empirical distribution for \(q({\user2{v}}),\) i.e.,
$$ \hat{q}({\user2{v}}) = {\frac{1}{\sum_{i=1}^N n_i}}\sum_{i=1}^N \sum_{l=1}^{n_i} \delta\left({\user2{v}} - {\user2{x}}_i^l\right)$$
(7)
where \(\delta({\user2{x}})\) is a Dirac delta function that \(\int \delta({\user2{x}})\,d{\user2{x}} = 1\) and \(\delta({\user2{x}}) = 0\) for \({\user2{x}} \neq {\mathbf{0}}.\) Direct estimation of pairwise similarities using the above empirical distribution is computationally expensive, because the number of key points in all images can be very large. In the bag-of-words model, m visual words are used as prototypes for the key points in all the images. Let \({\user2{v}}_1, \ldots, {\user2{v}}_m\) be the m visual words randomly sampled from the key points in all the images. The empirical distribution \({\hat{q}}({\user2{v}})\) is
$$ {\hat{q}}({\user2{v}}) = {\frac{1}{m}}\sum_{k=1}^m \delta({\user2{v}} - {\user2{v}}_k) $$
(8)
In the next step, we aim to approximate the unknown distribution \(q_i({\user2{x}})\) in two different ways, and show the statistical consistency for each approximation.

3.2.1 Empirically estimated density function for \(q_i({\user2{x}})\)

First we approximate \(q_i({\user2{x}})\) by the empirical distribution \({\hat{q}}_i({\user2{x}})\) defined as follows
$${\hat{q}}_i({\user2{x}}) = {\frac{1}{n_i}} \sum_{l=1}^{n_i} \delta({\user2{x}} - {\user2{x}}_i^l) $$
(9)
Given the approximations for distribution \(q_i({\user2{x}})\) and \(q({\user2{v}}),\) we can now compute the approximation of the pairwise similarity sij defined in Eq. 6. For Eq. 9, the pairwise similarity, denoted by \({\hat{s}}_{ij},\) is computed as
$$ \begin{aligned} {\hat{s}}_{ij} & = {\hat{{\mathbb{E}}}}_{{\user2{v}}} \left[{\hat{{\mathbb{E}}}}_i[f({\user2{x}})] {\hat{{\mathbb{E}}}}_j[f({\user2{x}})]\right] \\ &= {\frac{1}{m}} \sum_{k=1}^m \left({\frac{1} {n_i}}\sum_{l=1}^{n_i} I\left(\left\|{\user2{x}}_i^l - {\user2{v}}_k\right\| \leq \rho \right) \right) \left({\frac{1}{n_j}}\sum_{l=1}^{n_j} I(\left\|{\user2{x}}_j^l - {\user2{v}}_k\right\| \leq \rho ) \right)\\ \end{aligned} $$
(10)
To show the statistical consistency of \({\hat{s}}_{ij},\) we need to bound \(|s_{ij} - {\hat{s}}_{ij}|.\) Since there are two approximate distribution used in our estimation, we divide our analysis into two steps. First, we measure \(|{\bar{s}}_{ij} - s_{ij}|,\) i.e., the difference in similarity caused by the approximate distribution for \({\mathcal{P}}_{{\mathcal{F}}}.\) Next, we measure \(|{\hat{s}}_{ij} - {\bar{s}}_{ij}|,\) i.e., the difference caused by using the approximate distribution for \(q_i({\user2{x}}).\) The overall difference \(|s_{ij}- {\hat{s}}_{ij}|\) is bounded by the sum of the two difference.

We first state the McDiarmid inequality [11], which is used throughout our analysis.

Theorem 1 (McDiarmid Inequality)

Given independent random variablesv1v2,..., vnvi  ∈ V, and a function\(f : V^n \,\mapsto\, {\mathbb{R}}\)satisfying
$$ \sup\limits_{v_1, v_2, \ldots, v_n, v^{\prime}_i \in V} |f({\user2{v}}) - f({\user2{v}}^{\prime})| \leq c_i $$
(11)
where\({\user2{v}} = (v_1, v_2, \ldots, v_n)\)and\({\user2{v}}^{\prime} = (v_1, v_2, \ldots, v_{i-1}, v^{\prime}_i, v_{i+1}, \ldots, v_n),\)then the following statement holds
$$ \Pr\left(|f({\user2{v}}) - {\mathbb{E}}(f({\user2{v}}))| \geq \epsilon \right) \leq 2 \exp \left( -{\frac{2\epsilon^2}{\sum_{i=1}^n c_i^2}} \right) $$
(12)

Using the McDiarmid inequality, we have the following theorem which bounds \(|{\bar{s}}_{ij} - {s}_{ij}|.\)

Theorem 2

Assuming\(f_k({\user2{x}}), k = 1, \ldots ,m\)are randomly drawn from class\({\mathcal{F}}\)according to an unknown distribution. And further assuming that any function in\({\mathcal{F}}\)is universally bounded between 0 and 1. With probability 1 − δ, the following inequality holds for any two training images \({\mathcal{I}}_i\)and\({\mathcal{I}}_j\)
$$ |{\bar{s}}_{ij} - {s}_{ij}| \leq \sqrt{{\frac{1}{2m}} \ln {\frac{2}{\delta}}} $$
(13)

Proof

For any \(f \in {\mathcal{F}},\) we have \(0 \leq {\mathbb{E}}_i [f({\user2{x}})]{\mathbb{E}}_j[f({\user2{x}})] \leq 1.\) Thus, for any k, ck ≤ 1/m. By setting
$$ \delta = 2 \exp\left(-2m\epsilon^2\right), \quad \hbox{or}\quad \epsilon = {\sqrt{{\frac{1}{2m}} \ln {\frac{2}{\delta}}}},$$
(14)
we have \(\Pr (|\bar{s}_{ij} - {s}_{ij}| \leq \epsilon) \geq 1 - \delta.\)

The above theorem indicates that, if we have the true distribution \(q_i({\user2{x}})\) of each image \({\mathcal{I}}_i,\) with a large number of sampled quantization functions \(f_k({\user2{x}}),\) we have a very good chance to recover the true similarity sij with a small error. The next theorem bounds \(|{\hat{s}}_{ij} - s_{ij}|.\)

Theorem 3

Assuming each image has at least n randomly sampled key points. Also assuming that\(f_k({\user2{x}}), k = 1, \ldots, m\)randomly drawn from an unknown distribution over class\({\mathcal{F}}.\)With probability 1 − δ, the following inequality is satisfied for any two images\({\mathcal{I}}_i\)and\({\mathcal{I}}_j\)
$$ |{\hat{s}}_{ij} - {s}_{ij}| \leq {\sqrt{{\frac{1}{2m}}\ln{\frac{2} {\delta}}}} + 2{\sqrt{{\frac{1}{2n}}\ln{\frac{4m^2}{\delta}}}}$$
(15)

Proof

We first need to bound the difference between \({\hat{{\mathbb{E}}}}_i [f_k({\user2{x}})]\) and \({\mathbb{E}}_i [f_k({\user2{x}})].\) Since \(0 \leq f({\user2{x}}) \leq 1\) for any \(f \in {\mathcal{F}},\) using McDimard inequality, we have
$$ \Pr \left(\left|\hat{{\mathbb{E}}}_i\left[f_k({\user2{x}})\right] - {{\mathbb{E}}}_i\left[f_k({\user2{x}})\right]\right| \geq \epsilon\right) \leq \exp(-2n\epsilon^2) $$
(16)
By setting
$$ 2 \exp (-2n \epsilon^2) = {\frac{\delta}{2m^2}}, \quad \hbox{or} \quad\epsilon = {\sqrt{{\frac{1}{2n}}\ln\left({\frac{4m^2}{\delta}}\right)}}$$
with probability 1 − δ/2, we have \(|{\hat{{\mathbb{E}}}}_i[f_k({\user2{x}})] - {{\mathbb{E}}}_i[f_k({\user2{x}})]| \leq \epsilon\) and \(|{\hat{{\mathbb{E}}}}_j[f_k({\user2{x}})] -{{\mathbb{E}}}_j[f_k({\user2{x}})]| \leq \epsilon\) for all \({f_k({\user2{x}})}^m_{k=1}\) simultaneously. As a result, with probability 1 − δ/2, for any two image \({\mathcal{I}}_i\) and \({\mathcal{I}}_j,\) we have
$$ \begin{aligned} |{\hat{s}}_{ij} - {\bar{s}}_{ij}| &\leq {\frac{1}{m}} \sum\limits_{k=1}^{m} | {\hat{{\mathbb{E}}}}_i^k {\hat{{\mathbb{E}}}}_j^k - {{\mathbb{E}}}_i^k{{\mathbb{E}}}_j^k| \\ &\leq {\frac{1}{m}} \sum\limits_{k=1}^{m} | ({\hat{{\mathbb{E}}}}_i^k - {\mathbb{E}}_i^k){\hat{{\mathbb{E}}}}_j^k| + |{{\mathbb{E}}}_i^k({\hat{{\mathbb{E}}}}_j^k - {{\mathbb{E}}}_j^k)| \\ &\leq {\frac{1}{m}} \sum\limits_{k=1}^{m} | {\hat{{\mathbb{E}}}}_i^k - {\mathbb{E}}_i^k| + |{\hat{{\mathbb{E}}}}_j^k - {{\mathbb{E}}}_j^k| \\ &\leq 2 \epsilon = 2 {\sqrt{{\frac{1}{2n}}\ln\left({\frac{4m^2}{\delta}}\right)}}\\ \end{aligned} $$
(17)
where \({\hat{{\mathbb{E}}}}_i^k\) stands for \({\hat{{\mathbb{E}}}}_i[f_k({\user2{x}})]\) for simplicity. According to Theorem 2, with probability 1 − δ/2, we have
$$ |{\bar{s}}_{ij} - {s}_{ij}| \leq {\sqrt{{\frac{1}{2m}}\ln{\frac{2} {\delta}}}}$$
(18)
Combining Eqs. 17 and 18, we have the result in the theorem. With probability 1 − δ, the following inequality is satisfied
$$ |\hat{s}_{ij} - {s}_{ij}| \leq {\sqrt{{\frac{1}{2m}}\ln{\frac{2} {\delta}}} }+ 2{\sqrt{{\frac{1}{2n}}\ln{\frac{4m^2}{\delta}}}}$$
(19)

Remark

Theorem 3 reveals an interesting relationship between the estimation error \(|s_{ij} - {\hat{s}}_{ij}|\) and the number of quantization functions (or the number of visual words). The upper bound in Theorem 3 consists of two terms: the first term decreases at a rate of \(O(1/\sqrt{m})\) while the second term increases at a rate of \(O(\ln m).\) When the number of visual words m is small, the first term dominates the upper bound, and therefore increasing m will reduce the difference \(|{\hat{s}}_{ij} - s_{ij}|.\) As m becomes significantly larger than n, the second term will dominate the upper bound, and therefore increasing m will lead to a larger \(|{\hat{s}}_{ij} - s_{ij}|.\) This result appears to be consistent with the observations on the size of the visual vocabulary: a large vocabulary tends to performance well in object categorization; but, too many visual words could deteriorate the classification accuracy.

Finally, we emphasize that although the idea of vector quantization by randomly sampled centers was already discussed in [7,24], to the best of our knowledge, this is the first work that presents its statistical consistency analysis.

3.2.2 Kernel density function estimation for \(q_i({\user2{x}})\)

In this section, we approximate \(q_i({\user2{x}})\) by a kernel density estimation. To this end, we assume that the density function \(q_i({\user2{x}})\) belongs to a family of smooth functions \({\mathcal{F}}_D\) that is defined as follows
$$ {\mathcal{F}}_D = \left\{ q({\user2{x}}) : {\mathcal{X}}\, \mapsto\, {\mathbb{R}}_+ \left| \left\langle q({\user2{x}}),q({\user2{x}}) \right\rangle_{{\mathcal{H}}_\kappa} \right.\leq B^2, \int q({\user2{x}})\,d{\user2{x}} = 1 \right\} $$
(20)
where \(\kappa({\user2{x}}, {\user2{x}}^{\prime}): {\mathcal{X}}\times{\mathcal{X}}\,\mapsto\, {\mathbb{R}}_+\) is a local kernel function with \(\int \kappa({\user2{x}}, {\user2{x}}^{\prime})\, d{\user2{x}}^{\prime} = 1.\)B controls the functional norm of \(q({\user2{x}})\) in the reproducing kernel Hilbert space \({\mathcal{H}}_\kappa.\) An example of \(\kappa({\user2{x}}, {\user2{x}}^{\prime})\) is RBF function, i.e. \(\kappa({\user2{x}}, {\user2{x}}^{\prime}) \propto \exp(-\lambda d({\user2{x}}, {\user2{x}}^{\prime})^2),\) where \(d({\user2{x}}, {\user2{x}}^{\prime}) = \|{\user2{x}} - {\user2{x}}^{\prime}\|_2.\) Then, the distribution \(q_i({\user2{x}})\) is approximated by a kernel density estimation \(\tilde{q}_i({\user2{x}})\) defined as follows
$$ {\tilde{q}}_i({\user2{x}}) = \sum_{l=1}^{n_i}\alpha_i^l\kappa\left({\user2{x}},{\user2{x}}_i^l\right), $$
(21)
where αil(1 ≤ l ≤ ni) are the combination weight that satisfy (i) αil ≥ 0, (ii) \(\sum_{l=1}^{n_i} \alpha_i^l = 1,\) and (iii) αiKiαi ≤ B2, where \(K_i = [\kappa({\user2{x}}_i^l, {\user2{x}}_i^{l'})]_{n_i \times n_i}.\)
Using the kernel density function, we approximate the pairwise similarity for Eq. 21 as follows
$$ \tilde{s}_{ij} = {\hat{{\mathbb{E}}}}_{{\user2{v}}}\left[{\tilde{{\mathbb{E}}}}_i [f({\user2{x}})]{\tilde{{\mathbb{E}}}}_j[f({\user2{x}})]\right] = {\frac{1}{m}}\sum_{k=1}^m \left(\sum_{l=1}^{n_i} \alpha_i^l \theta\left({\user2{x}}_i^l, {\user2{v}}_k\right) \right) \left( \sum_{l=1}^{n_j} \alpha_j^l\theta\left( {\user2{x}}_j^l, {\user2{v}}_k \right) \right) $$
(22)
where function \(\theta({\user2{x}}, {\user2{v}})\) is defined as
$$ \theta({\user2{x}}, {\user2{v}}) = \int d{\user2{z}}\, I\left(d({\user2{z}}, {\user2{v}}) \leq \rho\right) \kappa({\user2{x}}, {\user2{z}}) $$
(23)
To bound the difference between \(\tilde{s}_{ij}\) and sij, we follow the analysis [19] by viewing \({\mathbb{E}}_i [f({\user2{x}})] {\mathbb{E}}_j[f({\user2{x}})]\) as a mapping, denoted by \(g : {\mathcal{F}}\, \mapsto\, {\mathbb{R}}_+,\) i.e.,
$$ g(f; q_i, q_j) = {\mathbb{E}}_i [f({\user2{x}})]{\mathbb{E}}_j [f({\user2{x}})] $$
(24)
The domain for function g, denoted by \({\mathcal{G}},\) is defined as
$$ {\mathcal{G}} = \left\{g:{\mathcal{F}}\, \mapsto\, {\mathbb{R}}_+ \left|\ \exists q_i, q_j \in {\mathcal{F}}_D \hbox{ s.t. } g(f) = {\mathbb{E}}_i[f({\user2{x}})]{\mathbb{E}}_j[f({\user2{x}})]\right.\right\} $$
(25)
To bound the complexity of a class of functions, we introduce the concept of Randemacher complexity [2]:

Definition 1 (Randemacher complexity)

Suppose x1,..., xn are sampled from a set \(\mathcal {X}\) with i.i.d. Let \({\mathcal{F}}\) be a class of functions mapping from \(\mathcal {X}\) to \({\mathbb{R}}.\) The Randemacher complexity of \({\mathcal{F}}\) is defined as
$$ R_n({\mathcal{F}}) = {\mathbb{E}}_{x_1,\ldots,x_n,\sigma} \left(\sup_{f \in {\mathcal{F}}} {\frac{2}{n}}\sum_{i=1}^{n}\sigma_i f(x_i)\right) $$
(26)
where σi is independent uniform ± 1-valued random variables.

Assuming at least n key points are randomly sampled from each image, we have the following lemmas that bounds the complexity of domain \({\mathcal{G}}:\)

Lemma 1

The Rademacher complexity of function class\({\mathcal{G}},\)denoted by\(R_m({\mathcal{G}}),\)is bounded as
$$ R_m({\mathcal{G}}) \leq 2B C_{\kappa}{\frac{{\mathbb{E}}_{f}\left[\int d{\user2{x}} |f({\user2{x}})| \right]}{\sqrt{m}}} $$
(27)
where\(C_{\kappa} = \max_{{\user2{x}}, {\user2{z}}}{\sqrt{\kappa({\user2{x}}, {\user2{z}})}}\)

Proof

Denote F = {f1,..., fm}, according to the definition, we have
$$ \begin{aligned} R_m({\mathcal{G}}) &= {\mathbb{E}}_{\sigma,F}\left[\sup\limits_{g \in {\mathcal{G}}}{\frac{2}{m}}\sum\limits_{k=1}^{m}\sigma_k g(f_k)\right] \\ &= {\mathbb{E}}_{F} \left[ \left.{\mathbb{E}}_{\sigma}\left[\sup\limits_{g \in {\mathcal{G}}}{\frac{2}{m}}\sum\limits_{k=1}^{m}\sigma_k g(f_k) \right| F\right] \right] \\ &= {\mathbb{E}}_{F} \left[ {\mathbb{E}}_{\sigma}\left[\left.\sup\limits_{q_i,q_j\in {\mathcal{F}}_D}{\frac{2}{m}}\sum\limits_{k=1}^{m}\sigma_k {\mathbb{E}}_i[f_k]{\mathbb{E}}_j[f_k] \right| F \right] \right] \\ &\leq {\mathbb{E}}_{F} \left[ {\mathbb{E}}_{\sigma}\left[\left.\sup\limits_{\|\varvec{\omega}_i\| \leq B}{\frac{2}{m}}\sum\limits_{k=1}^{m}\sigma_k {\mathbb{E}}_i[f_k]\right| F \right] \right] \\ &= {\frac{2}{m}} {\mathbb{E}}_{F} \left[ {\mathbb{E}}_{\sigma} \left[\left.\sup\limits_{\|\varvec{\omega}_i\| \leq B} \left\langle \varvec{\omega}_i, \sum\limits_{k=1}^{m} \sigma_k \Upphi_k\right\rangle\right| F \right] \right] \\ &\hbox{where }\Upphi_k = \left(\langle \phi_1(\cdot), f_k(\cdot)\rangle, \langle \phi_2(\cdot), f_k(\cdot)\rangle, \ldots \right)\hbox{ and }\phi_k(x)\hbox{ is an eigen function} \\ &\hbox{of }\kappa(x, x^{\prime}) \\ &\leq {\frac{2B}{m}} {\mathbb{E}}_{F} \left[ {\mathbb{E}}_{\sigma}\left[ \left.\left\| \sum\limits_{k=1}^{m} \sigma_k \Upphi_k \right\| \right| F \right] \right] \\ &= {\frac{2B}{m}} {\mathbb{E}}_{F} \left[ \left.{\mathbb{E}}_{\sigma}\left[ \left( \sum\limits_{k,t} \sigma_k \sigma_t \left\langle \Upphi_k,\Upphi_t \right\rangle \right)^{\frac{1}{2}} \right| F \right] \right] \\ &\leq {\frac{2B}{m}} {\mathbb{E}}_{F} \left[ \left( \sum\limits_{k,t} {\mathbb{E}}_{\sigma}\left[\left. \sigma_k \sigma_t \left\langle \Upphi_k,\Upphi_t({\user2{x}}) \right\rangle \right|F\right]\right)^ {\frac{1}{2}} \right] \\ &= {\frac{2B}{m}} {\mathbb{E}}_{F} \left[ \left( \sum\limits_{k} {\mathbb{E}}_{\sigma}\left[ \left.\sigma_k^2 \left\langle \Upphi_k,\Upphi_k \right\rangle \right| F \right] \right)^{\frac{1}{2}} \right] \\ &= {\frac{2B}{m}} {\mathbb{E}}_{F} \left[ \left( \sum\limits_{k} \left\langle \Upphi_k,\Upphi_k \right\rangle \right)^{\frac{1}{2}} \right]\\ &= {\frac{2B}{m}} {\mathbb{E}}_{F} \left[ \left( \sum\limits_{k} \int d{\user2{z}}\,d{\user2{x}} f_k({\user2{x}}) f_k({\user2{z}}) \kappa({\user2{x}}, {\user2{z}})\right)^{\frac{1}{2}} \right] \\ &\leq 2B C_{\kappa}{\frac{{\mathbb{E}}_{f}\left[\int d{\user2{x}} |f({\user2{x}})| \right]} {\sqrt{m}}} \\ \end{aligned} $$
(28)
where the first inequality is because \({\mathbb{E}}_j[f_k] \leq 1,\) the second inequality is from Cauchy’s inequality, the third and fourth inequalities are from Jensen’s inequality. The last equality follows
$$ \langle \Upphi_k,\Upphi_k \rangle = \sum_{i} \left\langle \phi_i(\cdot), f_k(\cdot) \right\rangle^2 = \int dz\,dx f_k({\user2{x}}) f_k({\user2{z}}) \kappa({\user2{x}}, {\user2{z}}) $$
(29)

From [2], we have the following lemmas:

Lemma 2 (Theorem 12 in [2])

For 1 ≤ q < ∞, let\({\mathcal{L}} = \{|f-h|^q : f \in {\mathcal{F}}\},\)wherehand\(\|f - h\|_{\infty}\)is uniformly bounded. We have
$$ R_n({\mathcal{L}}) \leq 2q\|f - h\|_{\infty} \left(R_n({\mathcal{F}}) + {\frac{\|h\|_{\infty}}{\sqrt{n}}}\right) $$
(30)

Lemma 3 (Theorem 8 in [2])

With probability 1 − δ the following inequality holds
$$ {\mathbb{E}}\phi(Y, f(X)) \leq \hat{{\mathbb{E}}}_n\phi(Y, f(X)) + R_n(\phi \circ {\mathcal{F}}) + {\sqrt{{\frac{8\ln(2/\delta)}{n}}} } $$
(31)
where ϕ(xy) is the loss function, n is the number of samples and \(\phi \circ {\mathcal{F}} = \{(x,y) \,\mapsto\, \phi(y,f(x)) - \phi(y,0):f \in {\mathcal{F}}\}.\)

Based on the above lemmas, we have the following theorem

Theorem 4

Assume that the density function\(q_i({\user2{x}}), q_j({\user2{x}}) \in {\mathcal{F}}_D.\)Let\(\tilde{q}_i({\user2{x}}), \tilde{q}_j({\user2{x}}) \in {\mathcal{F}}_D\)be an estimated density function from n sampled key points. We have, with probability 1 − δ, the following inequality holds
$$ \begin{aligned} &{\mathbb{E}}_f[|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; q_i, q_j)|] \leq\hat{\mathbb{E}}_f [|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; \hat{q}_i,\hat{q}_j)|]\\ &\quad+2\left(2B C_{\kappa}{\frac{{\mathbb{E}}_{f}\left[\int d{\user2{x}} |f({\user2{x}})|\right]}{\sqrt{m}}}+{\frac{1}{\sqrt{m}}}\right)+{\sqrt{\frac{\ln(8/\delta)}{2m}}}+2{\sqrt{\frac{\ln(8m^2/\delta)}{2n}}} \end{aligned} $$
(32)

Proof

From Lemma 3, with probability 1 − δ/2, we have
$$ \begin{aligned} &{\mathbb{E}}_f [|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; q_i, q_j)|] \\ &\quad\leq\hat{{\mathbb{E}}}_f [|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; q_i, q_j)|]+R_m(|{\mathcal{G}}-g(f;q_i,q_j)|)+ {\sqrt{{\frac{8\ln(4/\delta)}{m}}}} \end{aligned} $$
(33)
Since 0 ≤ g(fqi, qj) ≤ 1, using the results in Lemma 1 and 2, we have
$$ \begin{aligned} R_m\left(|{\mathcal{G}} - g(f; q_i, q_j)|\right) &\leq 2\left(R_m({\mathcal{G}}) + {\frac{1}{\sqrt{m}}}\right) \\ &\leq 2 \left( 2B C_{\kappa}{\frac{{\mathbb{E}}_{f}\left[\int d{\user2{x}} |f({\user2{x}})|\right]}{\sqrt{m}}} + {\frac{1}{\sqrt{m}}}\right) \end{aligned} $$
(34)
Hence, we have, with probability 1 − δ/2 the following inequality holds
$$ \begin{aligned} &{\mathbb{E}}_f [|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; q_i, q_j)|] \leq\hat{{\mathbb{E}}}_f[|g(f;\tilde{q}_i, \tilde{q}_j)-g(f;q_i,q_j)|]\\ &\quad + 2 \left( 2B C_{\kappa}{\frac{{\mathbb{E}}_{f}\left[\int d{\user2{x}}|f({\user2{x}})|\right]}{\sqrt{m}}} + {\frac{1} {\sqrt{m}}}\right) +{ \sqrt{{\frac{8\ln(4/\delta)}{m}}}} \end{aligned} $$
(35)
Next, we aim to bound \({\hat{{\mathbb{E}}}}_f [|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; q_i, q_j)|].\) Note that
$$ \begin{aligned} &{\hat{{\mathbb{E}}}}_f [|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; q_i, q_j)|] = {\frac{1}{m}}\sum_{k=1}^{m} |g(f_k;\tilde{q}_i,\tilde{q}_j)-g(f_k; q_i, q_j)|\\ & \quad\leq{\frac{1}{m}}\sum_{k=1}^{m}\left(|g(f_k;\tilde{q}_i, \tilde{q}_j)-g(f_k;\hat{q}_i,\hat{q}_j)|+|g(f_k;\hat{q}_i, \hat{q}_j)-g(f_k;q_i,q_j)|\right)\\ \end{aligned} $$
(36)
Using the same logistics in the proof of Theorem 3, we have, with probability 1 − δ/2
$$ {\frac{1}{m}} \sum_{k=1}^{m} |g(f_k; {\hat{q}}_i, {\hat{q}}_j) - g(f_k; q_i, q_j)| \leq {\sqrt{{\frac{\ln(8/\delta)}{2m}}}} + 2{\sqrt{{\frac{\ln(8m^2/\delta)}{2n}}}} $$
(37)
From the above results, we have, with probability 1 − δ/2, the following inequality holds
$$ \begin{aligned} &{\frac{1}{m}} \sum_{k=1}^{m} |g(f_k; \tilde{q}_i, \tilde{q}_j) - g(f_k; q_i, q_j)| \\ & \quad\leq {\frac{1}{m}} \sum_{k=1}^{m} |g(f_k; \tilde{q}_i, \tilde{q}_j) - g(f_k; \hat{q}_i, \hat{q}_j)| +{\sqrt{{\frac{\ln(8/\delta)}{2m}}}} + 2{\sqrt{{\frac{\ln(8m^2/\delta)}{2n}}}} \end{aligned} $$
(38)
Combining the above results together, we have, with probability 1 − δ, the following inequality holds
$$ \begin{aligned} &{\mathbb{E}}_f[|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; q_i, q_j)|]\\ & \quad\leq\hat{{\mathbb{E}}}_f [|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; \hat{q}_i,\hat{q}_j)|]+2\left(2B C_{\kappa}{\frac{{\mathbb{E}}_{f}\left[\int d{\user2{x}} |f({\user2{x}})|\right]}{\sqrt{m}}} +{\frac{1}{\sqrt{m}}}\right)\\ &\qquad\quad+{\sqrt{{\frac{\ln(8/\delta)}{2m}}} } +2{\sqrt{{\frac{\ln(8m^2/\delta)}{2n}}}} \end{aligned} $$
(39)

In our empirical study, we will use RBF kernel function for \(\kappa({\user2{x}}, {\user2{x}}^{\prime})\) with αil = 1/ni. The corollary below shows the bound for this choice of kernel density estimation.

Corollary 5

When the kernel function\(\kappa({\user2{x}}, {\user2{x}}^{\prime}) = {(1/(2\pi\sigma^2))}^{d/2} \exp(-\|{\user2{x}} - {\user2{x}}^{\prime}\|_2^2/(2\sigma^2))\)and αil = 1/ni, the bound in Theorem 4 becomes
$$ \begin{aligned} &{{\mathbb{E}}_f[|g(f; \tilde{q}_i, \tilde{q}_j) - g(f; q_i, q_j)|] \leq {\left(1/(2\pi\sigma^2)\right)}^{d/2}\left(1 - \exp\left(-\rho^2/(2\sigma^2)\right)\right)}\\ &\quad + 2{\frac{2{\mathbb{E}}_{f}\left[\int d{\user2{x}} |f({\user2{x}})|\right]/{\sqrt{n_i}} + 1}{\sqrt{m}}} +{\sqrt{{\frac{\ln(8/\delta)}{2m}}}} +2{\sqrt{\frac{\ln(8m^2/\delta)}{2n}}}\\ \end{aligned} $$
(40)

Remark

Theorem 4 bounds the true expectation of the difference between the similarity estimated by kernel density function and the true similarity. Similar to Theorem 3, this bound also consists of a term decreasing at a rate of \(O(1/\sqrt{m})\) and a term increasing at a rate of \(O(\ln m).\) What’s more, we can see in order to minimize the true expectation of the difference between the similarity estimated by kernel density function and the true similarity, we need to minimize the empirical expectation of the difference between the similarity estimated by kernel density function and the similarity estimated by empirical density function. If \(\kappa({\user2{x}},{\user2{v}})\) decreases exponentially as \(d({\user2{x}}, {\user2{v}})\) decreases, such as Gaussian kernel, we have \(\theta({\user2{x}}, {\user2{v}})\) close to 1 when \(d({\user2{x}}, {\user2{v}}) \leq \rho\) while \(\theta({\user2{x}}, {\user2{v}})\) close to 0 when \(d({\user2{x}}, {\user2{v}}) > \rho.\) In such circumstance, setting αil = 1/ni for all 1 ≤ l ≤ ni is a good choice for the approximation and is also very efficient since we do not need to learn α.

Note that although the idea of kernel density estimation was already proposed in some studies [e.g., 22], to the best of our knowledge, this is the first work that reveals the statistical consistency of kernel density estimation for the bag-of-words representation.

4 Empirical study

In this empirical study, we aim to verify the proposed framework and the related analysis. To this end, based on the discussion in Sect. 3.2, we present two random algorithms for vector quantization that are shown in Algorithm 1. We refer to the algorithm based on empirical distribution as “Quantization via Empirical Estimation”, or QEE for short, and to the algorithm based on kernel density estimation as “Quantization via Kernel Estimation”, or QKE for short. Note that since both vector quantization algorithms do not rely on the clustering algorithms to identify visual words, they are in general computationally more efficient. In addition, both algorithms have error bounds decreases at the rate of \(O(1/\sqrt{m})\) when the number of key points n is large, indicating that they are robust to the number of visual words m. We emphasize that although similar random algorithms for vector quantization have been discussed in [5,17,22,14,24], the purpose of this empirical study is to verify that
  • simple random algorithms deliver similar performance of object recognition as the clustering based algorithm, and

  • the random algorithms are robust to the number of visual words, as predicted by the statistical consistency analysis.

Finally in the implementation of QKE, to efficiently calculate θ-function, we approximate it as [1]
$$ \theta \approx {\frac{2(\tilde{d}-\tilde{\rho})^2-1}{4{\sqrt{\pi}} (\tilde{d}-\tilde{\rho})^3\exp(\tilde{d}-\tilde{\rho})^2}} -{\frac{2(\tilde{d}+\tilde{\rho})^2-1}{4{\sqrt{\pi}} (\tilde{d}+\tilde{\rho})^3\exp(\tilde{d}+\tilde{\rho})^2}} $$
(41)
where \(\tilde{d}=d/\sigma, \tilde{\rho}=\rho/\sigma\) and σ is the width of the Gaussian kernel.

Two data sets are used in our study: PASCAL VOC Challenge 2006 data set [4] and Graz02 data set [15]. PASCAL06 contains 5,304 images from 10 classes. We randomly select 100 images for training and 500 for testing. The Graz02 data set contains 365 bike images, 420 car images, 311 people images and 380 background images. We randomly select 100 images from each class for training, and use the remaining for testing. By using a relatively small number of examples for training, we are able to examine the sensitivity of a vector quantization algorithm to the number of visual words. On average 1,000 key points are extracted from each image, and each key point is represented by the SIFT local descriptor [23]. For PASCAL06 data set, the binary classification performance for each object class is measured by the area under the ROC curve (AUC). For Graz02 data set, the binary classification performance for each object class is measured by the accuracy. Results averaged over ten random trials are reported.

We compare three vector quantization methods: K-means, QEE and QKE. Note that we do not include more advanced algorithms for vector quantization in our study because the objective of this study is to validate the proposed statistical framework for bag-of-words representation and the analysis on statistical consistency. Threshold ρ used by quantization functions \(f({\user2{x}})\) is set as \(\rho = 0.5 \times {\bar{d}},\) where \(\bar{d}\) is the average distance between all the key points and the randomly selected centers. A RBF kernel is used in QKE with the kernel width σ is set as \(0.75{\bar{d}}\) according to our experience. Binary linear SVM is used for each classification problem. To examine the sensitivity to the number of visual words, for both data sets, we varied the number of visual words from 10 to 10,000, as shown in Figs. 1 and 2.
Fig. 1

Comparison of different quantization methods with varied number of visual words on PASCAL06

Fig. 2

Comparison of different quantization methods with varying number of visual words on Graz02

First, we observe that the proposed algorithms for vector quantization yield comparable if not better performance than the K-means clustering algorithm. This confirms the proposed statistical framework for key point quantization is effective. Second, we observe that the clustering based approach for vector quantization tends to perform worse, sometimes very significantly, when the number of visual words is large. We attribute this instability to the fact that K-means requires each interest point belongs to exactly one visual word. If the number of clusters is not appropriate, for example, too large compared to the number of instances, two relevant key points may be separated into different clusters although they are both very near to the boundary. It will lead to a poor estimation of pairwise similarity. The problem of “hard assignment” was also observed in [17,22]. In contrast, for the proposed algorithms, we observe a rather stable improvement as the number of visual words increases, consistent with our analysis in statistical consistency.

5 Conclusion

The bag-of-words model is one of the most popular representation methods for object categorization. The key idea is to quantize each extracted key point into one of visual words, and then represent each image by a histogram of the visual words. For this purpose, a clustering algorithm (e.g., K-means), is generally used for generating the visual words. Although a number of studies have shown encouraging results of the bag-of-words representation for object categorization, theoretical studies on properties of the bag-of-words model is almost untouched, possibly due to the difficulty introduced by using a heuristic clustering process. In this paper, we present a statistical framework which generalizes the bag-of-words representation. In this framework, the visual words are generated by a statistical process rather than using a clustering algorithm, while the empirical performance is competitive to clustering-based method. A theoretical analysis based on statistical consistency is presented for the proposed framework. Moreover, based on the framework we developed two algorithms which do not rely on clustering, while achieving competitive performance in object categorization when compared to clustering-based bag-of-words representations.

Bag-of-words representation is a popular approach to object categorization. Despite its success, few studies are devoted to the theoretic analysis of the bag-of-words representation. In this work, we present a statistical framework for key point quantization that generalizes the bag-of-words model by statistical expectation. We present two random algorithms for vector quantization where the visual words are generated by a statistical process rather than using a clustering algorithm. A theoretical analysis of their statistical consistency is presented. We also verify the efficacy and the robustness of the proposed framework by applying it to object recognition. In the future, we plan to examine the dependence of the proposed algorithms on the threshold ρ, and extend QKE to weighted kernel density estimation.

Notes

Acknowledgments

We want to thank the reviewers for helpful comments and suggestions. This research is partially supported by the National Fundamental Research Program of China (2010CB327903), the Jiangsu 333 High-Level Talent Cultivation Program and the National Science Foundation (IIS-0643494). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

References

  1. 1.
    Abramowitz M, Stegun IA (eds) (1972) Handbook of mathematical functions with formulas, graphs, and mathematical tables. Dover, New YorkMATHGoogle Scholar
  2. 2.
    Bartlett PL, Wang M (2002) Rademacher and Gaussian complexities: risk bounds and structural results. J Mach Learn Res 3:463–482CrossRefMathSciNetGoogle Scholar
  3. 3.
    Csurka G, Dance C, Fan L, Williamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: ECCV workshop on statistical learning in computer vision, Prague, Czech Republic, 2004Google Scholar
  4. 4.
    Everingham M, Zisserman A, Williams CKI, Van Gool L (2006) The PASCAL visual object classes challenge 2006 (VOC2006) results. http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf
  5. 5.
    Farquhar J, Szedmak S, Meng H, Shawe-Taylor J (2005) Improving “bag-of-keypoints” image categorisation. Technical report, University of SouthamptonGoogle Scholar
  6. 6.
    Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning. Chemnitz, Germany, pp 137–142Google Scholar
  7. 7.
    Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: Proceedings of the 10th IEEE international conference on computer vision, Beijing, China, 2005, pp 604–610Google Scholar
  8. 8.
    Lazebnik S, Raginsky M (2009) Supervised learning of quantizer codebooks by information loss minimization. IEEE Trans Pattern Anal Mach Intell 31(7):1294–1309CrossRefGoogle Scholar
  9. 9.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  10. 10.
    McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI workshop on learning for text categorization, Madison, WIGoogle Scholar
  11. 11.
    McDiarmid C (1989) On the method of bounded differences. In: Surveys in combinatorics 1989, pp 148–188Google Scholar
  12. 12.
    Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual codebooks using randomized clustering forests. In: Schölkopf B, Platt J, Hoffman T (eds) Advances in neural information processing systems, vol 19. MIT Press, Cambridge, pp 985–992Google Scholar
  13. 13.
    Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, New York, NY, pp 2161–2168Google Scholar
  14. 14.
    Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: Proceedings of the 9th European conference on computer vision, Graz, Austria, pp 490–503Google Scholar
  15. 15.
    Opelt A, Pinz A, Fussenegger M, Auer P (2006) Generic object recognition with boosting. IEEE Trans Pattern Anal Mach Intell 28(3):416–431CrossRefGoogle Scholar
  16. 16.
    Perronnin F, Dance C, Csurka G, Bressian M (2006) Adapted vocabularies for generic visual categorization. In: Proceedings of the 9th European conference on computer vision, Graz, Austria, pp 464–475Google Scholar
  17. 17.
    Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Anchorage, AKGoogle Scholar
  18. 18.
    Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, CambridgeGoogle Scholar
  19. 19.
    Shawe-Taylor J, Dolia A (2007) A framework for probability density estimation. In: Proceedings of the 11th international conference on artificial intelligence and statistics, San Juan, Puerto Rico, pp 468–475Google Scholar
  20. 20.
    Sivic J, Zisserman A (2003) Video Google: A text retrieval approach to object matching in videos. In: Proceedings of the 9th IEEE international conference on computer vision, Nice, France, pp 1470–1477Google Scholar
  21. 21.
    Tuytelaars T, Schmid C (2007) Vector quantizing feature space with a regular lattice. In: Proceedings of the 11th IEEE international conference on computer vision, Rio de Janeiro, Brazil, pp 1–8Google Scholar
  22. 22.
    van Gemert JC, Geusebroek J-M, Veenman CJ, Smeulders AWM (2008) Kernel codebooks for scene categorization. In: Proceedings of the 10th European conference on computer vision, Marseille, France, pp 696–709Google Scholar
  23. 23.
    Vedaldi A, Fulkerson B (2008) VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/
  24. 24.
    Viitaniemi V, Laaksonen J (2008) Experiments on selection of codebooks for local image feature histograms. In: Proceedings of the 10th international conference series on visual information systems, Salerno, Italy, pp 126–137Google Scholar
  25. 25.
    Winn J, Criminisi A, Minka T (2005) Object categorization by learned universal visual dictionary. In: Proceedings of the 10th IEEE international conference on computer vision, Beijing, China, pp 1800–1807Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  1. 1.National Key Laboratory for Novel Software Technology Nanjing UniversityNanjingChina
  2. 2.Department of Computer Science & EngineeringMichigan State UniversityEast LansingUSA

Personalised recommendations