1 Introduction and background

Indexing sets of points in high-dimensional spaces in sub-linear time is one of the long-standing open problems in multimedia databases and multimedia information retrieval. In its most general form, the problem is unsolved and, quite possibly, unsolvable. While a formal proof exists only for specific models such as pivot-based indices [53], one could paraphrase Charles A. Rogers by saying that “most mathematicians believe and all engineers know” that in high dimensional feature spaces no indexing scheme is significantly better than linear scan when we try to work the most common types of queries: nearest neighbor and range. Footnote 1

If we don’t know how to index efficiently is not for lack of ingenuity in devising indexing schemes but, frustratingly, because of the characteristics of the spaces in which we execute the search.

In this paper, we present an efficient method for probably approximately correct nearest-neighbor determination in feature spaces of medium to high dimensionality. The algorithm can be implemented on top of any indexing method with sub-linear search time in low-dimensional feature spaces. The efficiency of the method depends on characteristics of the variance distribution of the features that we verify experimentally to be satisfied by most features of pracical interet.

Consider the following general setting. We have a data base D composed of elements in a space Ω to which the query also belongs. The domain is a metric space, equipped with a distance function ρ. We also assume that Ω is also equipped with a Borel measure μ such that μ(Ω) = 1, and that the data set D is drawn uniformly with respect to μ [52]. The triple (Ω,ρ,μ) is known as a space with metric and measure (mm-space). Note that, in practical applications, it is often the data base D that determines μ: μ is whatever measure makes D uniformly distributed. Also note that the set D itself can be seen as a discrete mm-space with distance function ρ|D and the counting metric μ(A) = |A|/|D| for \(A\subseteq {D}\).

One important characteristics of an mm-space is its contraction functionα. The idea is the following. Take a set \(A\subseteq {\Omega }\) with μ(A) = 1/2 and make A a bit bigger by taking elements at a distance at most 𝜖 from A. How much of Ω is taken up by the new set, and how much of Ω is left untouched? That is, for \(A\subseteq {\Omega }\), let

$$ A_{\epsilon} = \left\{ x : \inf_{a\in{A}} \{ \rho(x,a) \} \le \epsilon \right\} $$
(1)

with A0 = A; then

$$ \alpha(\epsilon) = 1 - \inf\left\{ \mu(A_{\epsilon}) : A\subseteq{\Omega}, \mu(A)\ge\frac{1}{2} \right\} $$
(2)

That is, α(𝜖) is the largest fraction of the space (according to the measure μ) that is left out after we take a set of measure at least 1/2 and grow it by 𝜖. Note that α(0) = 1/2.

Let Ωn be a family of mm-spaces indexed by an integer parameter n. The spaces form a Lévy family if for all 𝜖 > 0, \(\lim _{n\rightarrow \infty }\alpha _{n}(\epsilon )=0\) [37].

Many of the headaches in the design of a good multimedia indexing algorithm derive from the fact that most classes of feature spaces used in practical applications are, indeed, Lévy families. For example, \(({\mathbb {R}}^{n},\rho ,\mu _{n})\), where ρ is any Minkowski distance and μn is the Gaussian measure with \(\sigma =1/\sqrt {n}\) is a Lévy family, and so are unitary spheres in \({\mathbb {R}}^{n}\) with the uniform measure [29, 40]. In both cases, if αn(𝜖) is the contraction function of the n th member of the family, one can show that there are ci,c2 > 0 such that

$$ \alpha_{n}(\epsilon) \le c_{1} \exp(-c_{2} n \epsilon^{2} ) $$
(3)

The contraction function imposes constraints on how “accurate” the functions are that we can use to “sample” the behavior of points in Ω. A function \(f:{\Omega }\rightarrow {\mathbb {R}}\) is 1-Lipschitz (f ∈Lip1(Ω)) if, for all x,y ∈Ω

$$ | f(x) - f(y)| \le \rho(x, y) $$
(4)

(see [22]). One can prove the following property:

Theorem 1 (46)

Let f ∈Lip1(Ω) and let M be its median, that is, the value such that

$$ \mu\left\{\omega : f(\omega)\le{M}\right\}\ge\frac{1}{2}\ \ \ \ \mu\left\{\omega : f(\omega)\ge{M}\right\}\ge\frac{1}{2} $$
(5)

Then, for all 𝜖 > 0,

$$ \mu\left\{\omega : |f(\omega)-M|>\epsilon\right\} < 2 \alpha(\epsilon) $$
(6)

That is: in a highly concentrated space, all Lip1(Ω) functions concentrate around their median.

One important observation, one which opens up a way of mitigating the effects of concentration, is that often the actual dimensionality of the data is much lower than that of the space in which the data are represented. That is, many a data sets in, for example, \({\mathbb {R}}^{n}\) are actually on—or close to—a manifold M of dimensionality mn.

This phenomenon is captured by various measures of intrinsic dimensionality; one of the best known is due to Chavez et al. [18] and is defined as

$$ d_{I}({\Omega}) = \frac{m(\rho)}{2\sigma^{2}(\rho)} $$
(7)

where m(ρ) is the mean value of the distance function ρ on Ω with respect to the measure μ and σ2(ρ) is its variance.

The intrinsic dimension may be orders of magnitude smaller than the space in which the features are defined. This observation opens a possibility to alleviate the dimensionality problem, at least if we accept only probably correct results: it is possible, in principle, to approximate the distance between two points of the space using the geodesic distance in a manifold of much smaller dimensionality. There are, however, two problems.

  1. i)

    Although dI gives a general indication of the dimensionality of a manifold that approximates the distribution of a data base D, it gives no indication on how to identify the manifold, nor gives it its exact dimensionality (it usually underestimates it). Moreover, the numerical characterization of a manifold that contains D points in \({\mathbb {R}}^{N}\) may require up to O(DN) parameters, thus making an index based on it not competitive vis-à-vis sequential scanning, unless the manifold admits a compact representation.

  2. ii)

    Answering a query using the distance in a sub-manifold results in an approximation and, therefore, a probability of error. Any algorithm that uses this kind of approximation should allow an estimation of the error probability, and it should provide means to control it, in a trade-off between efficiency and error probability.

The first problem is often circumvented by restricting the manifold to be a linear sub-space of the feature space. One can use techniques such as singular value decomposition [25] to find sub-spaces of reduced dimensionality in which most of the variance of the data is concentrated. Computing distances in these subspaces is an approximation of the actual distance, so a reduced-dimensionality space can be used to find approximate solutions to a query avoiding searches in high dimensional spaces. Note that in this case the linear sub-space will in general be of dimensionality much higher that dI, since the lowest-dimensional manifold is in general non-linear and the linear search finds a linear subspace that contains the manifold.

The algorithm that we present in this paper is a simple variation of this strategy that allows us to solve the second problem and to fine-tune the trade-off between error and performance.

After the presentation of the algorithm in Section 2, the paper is divided in two main parts: a theoretical one (Sections 34, and 5), and an experimental one (Sections 6 and 7). In Section 3 we determine the theoretical error probability of the algorithm, that is, the probability that the algorithm returned the wrong nearest neighbor as a function of the algorithm parameters. In Section 4 we calculate the expected value of the distance error, that is, of the difference between the distance from the query of the item returned by the algorithm and that of the true nearest neighbor of the query, again as a function of the algorithm parameters. In Section 5 we develop a cost model for the algorithm.

The experimental part begins in Section 6 with validation: we use the SUN data set [63] and various features to determine experimentally the magnitudes calculated theoretically in Sections 3 and 4, and show that these are a good fit with the actual behavior of the algorithm. In Section 7 we determine experimentally the cost of indexing as a function of the algorithm parameters, both in terms of number of computations as well as in the number of block accesses. In the same section we compare the performance of the algorithm with that of other approximate indexing methods, and carry out an analysis of the differences. Section 8 places this paper in the context of related work, and Section 9 draws some conclusions. Details of the experimental methodology and set-up are given in Appendices A and B.

2 Approximate nearest neighbor

The starting point of our method is the idea that we introduced at the end of the previous section, namely, to identify a sub-space in which most of the distance is concentrated, and to use the distance in this sub-space as a basis for the approximate determination of the nearest neighbor.

Let (Ω,ρ,μ) be a mm-space, where \({\Omega }\sim {\mathbb {R}}^{N}\) and ρ is the Euclidean distance on Ω ×Ω. Let \(\mathbf {D}\in {\mathbb {R}}^{N\times {D}}\) be the given data base, with D = [d1|⋯|dD]; di ∈Ω being the i th element in the data base. Assume that μ is a measure such that the di are uniformly distributed with respect to it.

We decompose the matrix D using singular value decomposition as

$$ \mathbf{D} = U{\Sigma}{V^{\prime}} $$
(8)

where \(V\in {\mathbb {R}}^{D\times {N}}\), Σ = diag(λ1,…,λN), with λiλi+ 1 > 0 and where \(U\in {\mathbb {R}}^{N\times {N}}\) is an orthogonal matrix whose columns are a basis of \({\mathbb {R}}^{N}\),

$$ U = [u_{1}|\cdots|u_{N}]. $$
(9)

It can be shown [25] that the transformation \(y=U^{\prime }x\) is a rotation of the space such that the variance of the data base D in the transformed space has the property:

$$ \text{var}\left( u_{i}^{\prime} \mathbf{D}\right) \ge \text{var}\left( u_{i+1}^{\prime} \mathbf{D}\right) $$
(10)

Based on this observation, select a dimensionality M < N and define the matrices

$$ \begin{aligned} \mathbf{M}_{M} &= [u_{1}|\cdots|u_{M}]^{\prime} \\ \mathbf{N}_{M} &= [u_{M+1}|\cdots|u_{N}]^{\prime} \end{aligned} $$
(11)

With these matrices we decompose Ω ≡Ξ×Θ with

$$ \begin{aligned} {\Xi} &= \text{span}(\mathbf{M}_{M}) \sim {\mathbb{R}}^{M} \\ {\Theta} &= \text{span}(\mathbf{N}_{M}) \sim {\mathbb{R}}^{N-M} \end{aligned} $$
(12)

Similarly, we can define the restrictions to Ξ and Θ of ρ: ρΞ and ρΘ, respectively. Given an element x ∈Ω, we can represent it as the pair (y,z) with y = MMx and z = NMx. The decomposition is such that the covariances of D (and of any stochastic variable uniform in μ) under these transformations are approximately diagonal:

$$ \begin{aligned} {\Sigma}_{\Xi} &= \text{cov}(\mathbf{M}_{m}\mathbf{D}) \approx \text{diag}\left( \sigma_{\Xi,1}^{2},\ldots,\sigma_{\Xi,M}^{2}\right) \\ {\Sigma}_{\Theta} &= \text{cov}(\mathbf{N}_{m}\mathbf{D}) \approx \text{diag}\left( \sigma_{\Theta,1}^{2},\ldots,\sigma_{\Theta,N-M}^{2}\right) \end{aligned} $$
(13)

with

$$ \sigma_{\Xi,i}^{2} \ge \sigma_{\Xi,i+1}^{2} \ge \sigma_{\Theta,j}^{2} \ge \sigma_{\Theta,j+1}^{2} $$
(14)

Define

$$ \sigma_{\Xi}^{2} = \sum\limits_{i=1}^{m} \sigma_{\Xi,i}^{2},\ \ \ \ \sigma_{\Theta}^{2} = \sum\limits_{i=1}^{N-m} \sigma_{\Theta,i}^{2} $$
(15)

and

$$ \nu = \frac{\sigma_{\Xi}^{2}}{\sigma_{\Theta}^{2}} $$
(16)

The parameter ν is a measure of how much variance is contained in the space Ξ. We shall always choose M so that ν > 1. Our algorithm is then as follows:

i):

Given a query point q, find its nearest neighbor in Ξ; let \({\rho _{0}^{2}}\) be the square of the distance from q to its nearest neighbor in Ξ;

ii):

find all the points whose distance from q in Ξ is less than \(\sqrt {{\rho _{0}^{2}}+\alpha }\), where α is a suitable value that we shall discuss in the following; let T be the set of such points;

iii):

scan the set T and, among the elements of T, find the nearest point to q in the complete feature space Ω.

More formally, we can describe the algorithm as in Fig. 1.

Fig. 1
figure 1

The approximate nearest neighbor algorithm. Here, (Ξ,ρΞ,μΞ) is a nn-subspace of (Ω,ρ,μ) of dimensionality M. \(\mathbf {D}^{\prime }\) is the projection of the data base D on Ξ (computed offline). For the use of the distance in steps 2 and 4 and its square in steps 4 and 5, see the text

We assume that M is small enough that the “arg min” in line 2 and the range query in line 4 can profitably be done using a fast indexing method, while the general feature space has dimensionality such that the “arg min” in line 5 has to be done by scanning all the elements of the set T. Note also the use of ρ2 in line 5: if we assume a Euclidean distance, computing ρ2 is cheaper than computing ρ and does not change the result. On the other hand, indexing algorithms often rely on the triangle inequality, thus requiring in line 2 the use of the distance rather than its square. In the case of a different Minkowski distance Lp, we would use ρp in line 5; in the case of more complex distances, we can use any monotonic function of the distance that makes its computation cheaper.

The two important parameters of the algorithm are \(M=\dim ({\Xi })\) and α (the extra distance at which we search in step 4). These parameters determine the execution time and the probability of finding the actual nearest neighbor.

3 Error probability

The algorithm of the previous section finds the true nearest neighbor only if it belongs to the set T, that is, if its distance from the query in Ξ does not exceed the distance of the closest projection in Ξ by more than α. In this section, we develop an approximate expression for the probability that this doesn’t happen, that is, for the probability that the algorithm return the wrong nearest neighbor. This expression is derived under considerable simplifying assumptions but, as we shall see in Section 6, it is a good fit to the data and is therefore very useful as a convenient design tool to determine the values of M and α.

Assume that ρ is Euclidean, μ is Gaussian, and that the query q is drawn uniformly from μ. Step 2 of the algorithm finds the nearest neighbor in Ξ—call it n—; its actual distance from the query is

$$ \rho^{2}(q,n) = \rho_{\Xi}^{2}(q,n) + \rho_{\Theta}^{2}(q,n) $$
(17)

Suppose that the algorithm gives the wrong answer. The real nearest neighbor of q is item t, with a distance from q

$$ \rho^{2}(q,t) = \rho_{\Xi}^{2}(q,t) + \rho_{\Theta}^{2}(q,t) < \rho^{2}(q,n) $$
(18)

Had t been in the set T that we build at step 4, the algorithm would have returned it as a solution. Since we are assuming that we get the wrong answer, it must be tT, or

$$ \rho_{\Xi}^{2}(q,t) > \rho_{\Xi}^{2}(q,n) + \alpha $$
(19)

Set

$$ \begin{aligned} {\Delta}_{\Xi} &\stackrel{\triangle}{=} \rho_{\Xi}^{2}(q,t) - \rho_{\Xi}^{2}(q,n) \\ {\Delta}_{\Theta} &\stackrel{\triangle}{=} \rho_{\Theta}^{2}(q,t) - \rho_{\Theta}^{2}(q,n). \end{aligned} $$
(20)

With these definitions, we can write the probability of (18) conditioned on (19) as

$$ \begin{aligned} {\mathbb{P}}_{E} &= {\mathbb{P}}\left\{ \rho^{2}(q,t)<\rho^{2}(q,n) \left| \rho_{\Xi}^{2}(q,t)>\rho_{\Xi}^{2}(q,n) + \alpha \right.\right\} \\ &= {\mathbb{P}}\left\{ \rho_{\Xi}^{2}(q,t)-\rho_{\Xi}^{2}(q,n) < \rho_{\Theta}^{2}(q,n)-\rho_{\Theta}^{2}(q,t) \left| \rho_{\Xi}^{2}(q,t)>\rho_{\Xi}^{2}(q,n)+\alpha\right.\right\}\\ &= {\mathbb{P}}\left\{ {\Delta}_{\Xi} < -{\Delta}_{\Theta} \left| {\Delta}_{\Xi}>\alpha\right.\right\}\\ &= \frac{{\mathbb{P}}\left\{ {\Delta}_{\Xi} < -{\Delta}_{\Theta}, {\Delta}_{\Xi}>\alpha\right\}}{{\mathbb{P}}\left\{{\Delta}_{\Xi}>\alpha\right\}} \\ &\stackrel{\triangle}{=} \frac{P_{1}}{P_{2}} \end{aligned} $$
(21)

If μ is Gaussian and the elements of the data base as well as the query are uniform in μ, then \(\rho _{\Xi }^{2}\) and \(\rho _{\Theta }^{2}\) have χ2 distributions with a number of degrees of freedom that depends on the dimensionality of the respective sub-space [34]. Here, we’ll do our most important approximation: since the variance of the data is predominantly on the first axes of Ξ and Θ, we shall assume that we can adequately approximate the distribution with just two degrees of freedom, that is, that \(\rho _{\Xi }^{2}\) and \(\rho _{\Theta }^{2}\) have exponential distribution:

$$ \begin{aligned} p_{\Xi}(u) &= \frac{1}{2\sigma_{\Xi}^{2}} \exp\left( - \frac{u}{2\sigma_{\Xi}^{2}} \right) \\ p_{\Theta}(u) &= \frac{1}{2\sigma_{\Theta}^{2}} \exp\left( - \frac{u}{2\sigma_{\Theta}^{2}} \right) \end{aligned} $$
(22)

The difference between squares of distances will then have a Laplace distribution:

$$ \begin{aligned} {\Delta}_{\Xi}(u) &= \frac{1}{4\sigma_{\Xi}^{2}} \exp\left( - \frac{|u|}{2\sigma_{\Xi}^{2}} \right) \\ {\Delta}_{\Theta}(u) &= \frac{1}{4\sigma_{\Theta}^{2}} \exp\left( - \frac{|u|}{2\sigma_{\Theta}^{2}} \right) \end{aligned} $$
(23)

Under these hypotheses, with reference to the quantities P1 and P2 defined in (21), we have

$$ \begin{aligned} P_{2} &= {\int}_{\alpha}^{\infty}\!\!\!\!{\Delta}_{\Xi}(u) du = \frac{1}{4\sigma_{\Xi}^{2}} {\int}_{\alpha}^{\infty}\!\!\!\!\exp\left( -\frac{u}{2\sigma_{\Xi}^{2}}\right) du \\ &= \frac{1}{2}\exp\left( -\frac{\alpha}{2\sigma_{\Xi}^{2}} \right) \\ &\stackrel{\triangle}{=} \frac{1}{2}\exp\left( -\frac{\zeta}{2} \right) \end{aligned} $$
(24)

Here we have defined \(\zeta \stackrel {\triangle }{=}\alpha /\sigma _{\Xi }^{2}\): ζ is the excess distance at step 4 measured in units of the variance of the data in the subspace Ξ. We can consider it as a distance measured in the “natural” units for the distribution. The value P1 can be determined as:

$$ \begin{aligned} P_{1} &= {\int}_{\alpha}^{\infty}\!\!\!\!{\Delta}_{\Xi}(u){\int}_{-\infty}^{-u}\!\!\!\!{\Delta}_{\Theta}(v) dv du \\ &\stackrel{\dagger}{=}{\int}_{\alpha}^{\infty}\!\!\!\!{\Delta}_{\Xi}(u){\int}_{u}^{\infty}\!\!\!\!{\Delta}_{\Theta}(v) dv du \\ &= \frac{1}{16\sigma_{\Xi}^{2}\sigma_{\Theta}^{2}} {\int}_{\alpha}^{\infty}\!\!\!\!\exp\left( - \frac{u}{2\sigma_{\Xi}^{2}} \right) {\int}_{u}^{\infty}\!\!\!\!\exp\left( - \frac{v}{2\sigma_{\Theta}^{2}} \right) dv du \\ &= \frac{1}{2} \frac{\sigma_{\Theta}^{2}}{\sigma_{\Xi}^{2}+\sigma_{\Theta}^{2}} \exp\left( -\frac{1}{2}\frac{\sigma_{\Xi}^{2}+\sigma_{\Theta}^{2}}{\sigma_{\Theta}^{2}}\frac{\alpha}{\sigma_{\Xi}^{2}} \right) \\ &= \frac{1}{2(1+\nu)} \exp\left( - \frac{1+\nu}{2} \zeta \right) \end{aligned} $$
(25)

where the equality ‡ is due to the symmetry of ΔΘ. Putting this together we get

$$ P_{E} = \frac{P_{1}}{P_{2}} = \frac{1}{1+\nu} \exp\left( -\frac{\nu}{2}\zeta\right) $$
(26)

Consequently, the probability of finding the true nearest neighbor is 1 − PE. These values depend on two parameters: ν and ζ. The first parameter depends on the characteristics of the features and on the size M that we choose for the reduced space Ξ in which we make the initial search; the second is the amount by which we “peek” beyond the distance of the nearest neighbor in Ξ expressed in units of σΞ. In many design situations, we might be interested in doing the opposite: given a desired error probability p, we want to determine the corresponding ζ:

$$ \zeta = \frac{2}{\nu} \log\frac{1}{(\nu+1)p} $$
(27)

4 Average error

Due to the approximation that we are making, with probability PE we do not find the true nearest neighbor t, but an approximate one n, with ρ2(q,n) > ρ2(q,t). This approximation is the more acceptable the closer the item n is to the nearest neighbor, that is, the smaller the difference ρ2(q,n) − ρ2(q,t). In this section we shall determine the probability density of ρ2(q,n) − ρ2(q,t) and its expected value.

Let p(ξ) be the probability density of the distance error. With probability 1 − PE, we find the correct nearest neighbor therefore, with probability 1 − PE, ξ = 0. With probability PE we get the wrong nearest neighbor, therefore for ξ > 0 we have

$$ p(\xi) = P_{E}\pi(\xi) $$
(28)

Abusing the notation a bit, if X is a continuous random variable, we write \(X\sim {x}\) if X ∈ [x,x + dx]. With this notation, we have

$$ \begin{aligned} \pi(\xi) &= {\mathbb{P}}\left\{\rho^{2}(q,n)-\rho^{2}(q,t)\sim\xi\left|\rho_{\Xi}^{2}(q,n)-\rho_{\Xi}^{2}(q,t)>\zeta\right.\right\} \\ &= {\mathbb{P}}\left\{{\Delta}_{\Xi}+{\Delta}_{\Theta}\sim\xi\left|{\Delta}_{\Xi}>\zeta\right.\right\} \\ &= \frac{{\mathbb{P}}\left\{{\Delta}_{\Xi}+{\Delta}_{\Theta}\sim\xi,{\Delta}_{\Xi}>\zeta\right\}} {{\mathbb{P}}\left\{{\Delta}_{\Xi}>\zeta\right\}} \end{aligned} $$
(29)

Note that we are working in the ξ space, in which

$$ \begin{aligned} {\Delta}_{\Xi}(\xi) &\sim \frac{1}{4} \exp\left( -\frac{|\xi|}{2}\right) \\ {\Delta}_{\Theta}(\xi) &\sim \frac{\nu}{4} \exp\left( -\nu\frac{|\xi|}{2}\right) \end{aligned} $$
(30)

In this space

$$ {\mathbb{P}}\left\{{\Delta}_{\Xi}>\zeta\right\} = \frac{1}{4} {\int}_{\zeta}^{\infty}\!\exp\left( -\frac{\xi}{2}\right) d\xi = \frac{1}{2}\exp\left( -\frac{\zeta}{2}\right) $$
(31)

while

$$ \begin{aligned} {\mathbb{P}}\left\{{\Delta}_{\Xi}+{\Delta}_{\Theta}\sim\xi,{\Delta}_{\Theta}>\zeta\right\} &= {\int}_{\zeta}^{\infty}\!{\Delta}_{\Xi}(u){\Delta}_{\Theta}(\xi-u) du \\ & = \frac{\nu}{16}{\int}_{\zeta}^{\infty}\!\exp\left( -\frac{|u|}{2}\right)\exp\left( -\frac{\nu}{2}|\xi-u|\right) du \end{aligned} $$
(32)

We have to consider two cases:

i):

ξ < ζ, that is, ξu < 0 in the range of integration

$$ \begin{aligned} {\mathbb{P}}\left\{{\Delta}_{\Xi}+{\Delta}_{\Theta}\sim\xi,{\Delta}_{\Theta}>\zeta\right\} &= \frac{\nu}{16}{\int}_{\zeta}^{\infty}\!\exp\left( -\frac{u}{2}\right)\exp\left( -\frac{\nu}{2}(u-\xi)\right) du \\ &= \frac{1}{8}\frac{\nu}{\nu+1}\exp\left( -\frac{\zeta}{2}\right)\exp\left( \frac{\nu}{2}(\xi-\zeta)\right) \end{aligned} $$
(33)
ii):

ξ > ζ, which entails that we have to divide the interval of integration in [ζ,ξ], and \([\xi ,\infty ]\):

$$ \begin{array}{@{}rcl@{}} {\mathbb{P}}\left\{{\Delta}_{\Xi}+{\Delta}_{\Theta}\sim\xi,{\Delta}_{\Theta}>\zeta\right\} &=& \frac{\nu}{16}\left[{\int}_{\zeta}^{\xi}\!\exp\left( -\frac{u}{2}\right)\exp\left( -\frac{\nu}{2}(\xi-u)\right) du\right. \\ &&+ \left.{\int}_{\xi}^{\infty}\!\exp\left( -\frac{u}{2}\right)\exp\left( -\frac{\nu}{2}(u-\xi)\right) du\right] \\ &=& \frac{1}{8} \frac{\nu}{\nu-1}\left[\exp\left( -\frac{\xi}{2}\right)-\exp\left( -\frac{\zeta}{2}\right)\exp\left( -\frac{\nu}{2}(\xi - \zeta)\right)\right] \\ &&+ \frac{1}{8}\frac{\nu}{\nu+1}\exp\left( -\frac{\xi}{2}\right) \end{array} $$
(34)

Yielding

$$ \pi(\xi) =\left\{ \begin{array}{ll} {\frac{1}{4} \frac{\nu}{\nu+1}\exp\left( \nu \frac{\xi-\zeta}{2}\right)} & \xi < \zeta \\ \\ {\frac{1}{4} \frac{\nu}{\nu-1} \left[ \frac{2\nu}{\nu+1} \exp\left( -\frac{\xi-\zeta}{2}\right) - \exp\left( -\nu\frac{\xi-\zeta}{2}\right)\right]} & \xi>\zeta \end{array}\right. $$
(35)

The average error is then

$$ \begin{array}{@{}rcl@{}} {\mathbb{E}}[p(\xi)] &=& P_{E} {\mathbb{E}}[\pi(\xi)] \\ &=& P_{E}{\int}_{0}^{\infty}\!\!\xi\pi(\xi) d\xi \\ &=& P_{E} \frac{1}{4} \frac{\nu}{\nu+1}{\int}_{0}^{\zeta}\!\xi\exp\left( -\nu\frac{\xi-\zeta}{}\right) d\xi \\ &&+ P_{E} \frac{1}{4}\frac{\nu}{\nu-1}\left[ \frac{2\nu}{\nu+1} {\int}_{\zeta}^{\infty}\!\!\xi\exp\left( -\frac{\xi-\zeta}{2}\right) d\xi - {\int}_{\zeta}^{\infty}\!\xi\exp\left( -\nu\frac{\xi-\zeta}{2}\right) d\xi\right] \\ &=& \frac{2}{\nu+1} \left( \frac{\zeta}{2}+1\right)\exp\left( -\nu\frac{\zeta}{2}\right) + \frac{1}{\nu(\nu+1)}\exp\left( -\nu\zeta\right) \end{array} $$
(36)

5 Theoretical cost model

Whenever we consider the cost of an indexing algorithms, there are at least two magnitiudes that we might be interested in measuring: the number of floating point operations (typically, those involved in the computation of distances), or the number of random access to disk blocks. The first measure is especially relevant for random access memory data bases, while in the case of data bases stored on disk the second measure greatly dominates the first (with standard equipment, the cost in terms of disk access can be of the order of 105–106 times that of in-memory computation). Both measures have their application area, and both should be considered when doing the a priori evaluation of an algorithm outside of the context of a specific application.

In our algorithm, the cost in terms of disk access depends essentially on the cost model of the index on which we build. Developing a theoretical cost model for these indices goes beyond the scope of this paper, so we shall consider this cst mainly in the experimental part, in Section 7. Here, we shall develop the computational cost model of the algorithm.

There are three steps of the algorithm in which distances are measured: the nearest neighbor query of step 2, the range query used to build the set T at step 4, and the limited scope nearest neighbor query of step 5. The first two queries are executed on the reduced dimensionality space Ξ, of dimensionality M, the last one on the whole space Ω of dimensionality NM.

Not all these computations have the same cost. In step 5 we may compute the square of the distance, with a cost of N multiplications. Steps 2 and 4, on the other hand, are executed using an indexing method, many of which rely on the triangle inequality and therefore require the computation of a square root. A simple iterative algorithm like the following to compute \(\sqrt {y_{0}}\),

$$ \begin{array}{@{}rcl@{}} &&y \leftarrow x*x-y_{0}\\ &&x \leftarrow x - y/(2*x) \end{array} $$

can compute the square root with a precision of 10− 8 for y0 < 106 in less than 10 iterations, that it, using less than 50 multiplications, so the cost of the distance computation in an M-dimensional space can be estimated in I + M, where I is of the order of 50.

For a data base of D elements in an M-dimensional space, let S(D,M) be the number of distance computations in the indexed search. We assume that the cost of nearest neighbor and range searches are of the same order of magnitude, a reasonable assumption for most indexing methods. Note that S(D,M) = O(D) for large M, while in general \(S(D,M)=O(\log {D})\) for small M. With these definitions, the average cost of one search can be estimated as

$$ C(D,N,M) \sim (I+M)S(D,M)+N\cdot {\mathbb{E}}[|T|] $$
(37)

In order to determine the cost, we need to determine the expected value of |T|, \(\tau \stackrel {\triangle }{=}{\mathbb {E}}(|T|)\). Consider xD; in step 2 of the algorithm we determine the nearest neighbor in the Ξ space, n. In step 5, it is xT if \(\rho _{\Xi }^{2}(q,x)-\rho _{\Xi }^{2}(q,n)<\alpha \). The probability of this happening is

$$ \begin{aligned} &~~~\phantom{\mathbb{P}}{\mathbb{P}}\left\{\rho_{\Xi}^{2}(q,x)-\rho_{\Xi}^{2}(q,n)<\alpha \left| \rho_{\Xi}^{2}(q,x)-\rho_{\Xi}^{2}(q,n) > 0 \right.\right\}\\ &= {\mathbb{P}}\left\{{\Delta}_{\Xi}+{\Delta}_{\Theta}<\alpha \left| {\Delta}_{\Xi}+{\Delta}_{\Theta} > 0 \right.\right\} \\ &= \frac{{\mathbb{P}}\left\{{\Delta}_{\Xi}+{\Delta}_{\Theta}<\alpha, {\Delta}_{\Xi}+{\Delta}_{\Theta} > 0 \right\} } {{\mathbb{P}}\left\{{\Delta}_{\Xi}+{\Delta}_{\Theta} > 0 \right\}} \\ &= 2\cdot \frac{1}{4\sigma_{\Xi}^{2}}{\int}_{0}^{\alpha}\!\!\!\!\exp\left( -\frac{x}{2\sigma_{\Xi}^{2}}\right) dx \\ &= 1 - \exp\left( -\frac{\alpha}{2\sigma_{\Xi}^{2}}\right) \\ &= 1 - \exp\left( -\frac{\zeta}{2}\right) \\ &\stackrel{\triangle}{=} \upbeta(\zeta) \end{aligned} $$
(38)

The probability that, out of the D elements in the data base, k are in T is therefore

$$ P_{\zeta}(k) = \binom{D}{k} \upbeta(\zeta)^{k}(1-\upbeta(\zeta))^{D-k} \approx \frac{(K\upbeta(\zeta))^{k}}{k!}\exp(-D\upbeta(\zeta)) $$
(39)

where the Poisson approximation is valid for large D. Then

$$ \tau(\zeta) = {\mathbb{E}}[|T|] = {\mathbb{E}}[P_{\zeta}(k)] = D\upbeta(\zeta) = D\left( 1-\exp\left( -\frac{\zeta}{2}\right)\right) $$
(40)

If we are interested in the complexity for a prescribed error probability p, we plug in eq. (27) obtaining:

$$ \begin{aligned} \tau(p) &= D\left( 1 - \exp\left( - \frac{1}{1+\nu} \log\!\frac{1}{2(1+\nu)p}\right)\right) \\ &= D\left[1 - \left( 2(1+\nu)p\right)^{\frac{1}{1+\nu}}\right] \end{aligned} $$
(41)

6 Validation

In this section, we validate the model of the previous section comparing its prediction with the measurements obtained from an instrumented implementation of the algorithm.

We validate the model of error and cost using the data from of the SUN data set [63], containing 130,000 images, more than enough to obtain statistically significant results. The main reason for using this data set is that it comes with a variety of pre-computed features of different size and statistical characteristics (see Table 1), allowing us to validate the model under a variety of conditions using features computed in a standard way, thus increasing replicability.

Table 1 The individual features of the SUN data set with their individual and their total size

The validation was carried out on all 12 features but, in order to simplify the presentation, we shall present the data relative to only two of them, representative of the range of sizes and performance obtained: Geo Texton, and Tiny Image. The former is one of the features with the best fit to the theoretical model, the latter is the one with the worst fit.

Our main hypothesis, the one on which the algorithm is based, is that the variance along each axis decreases rapidly with the axis index after SVD has been applied to the data base. Figure 2a shows the results for features of the SUN data base, which confirms our hypothesis.

Fig. 2
figure 2

In (a) above, the behavior of the variance on each axis of the feature space, transformed using SVD, for four features of the SUN data base. The abscissa is the axis number (the axes are ordered by decreasing eigenvalue) and the value reported is \({\sigma _{n}^{2}}/{\sigma _{1}^{2}}\), where \({\sigma _{n}^{2}}\) is the variance on the n th axis. Because of the normalization, the value for n = 1 is 1 for all features. In (b), below, the corresponding values of \(\nu =\sigma _{\Xi }^{2}/\sigma _{\Theta }^{2}\) for increasing values of the dimensionality of Ξ

The rate at which the variance decreases depends on the feature, the slowest descent being for HoG, the fastest for Geo Texton. The corresponding values of ν are shown in Fig. 2b. Note that for m ≥ 5 the condition ν > 1 is met for all the features with the exception of HoG.

Figure 3 compares the sampled distribution of ρ2 with the theoretical model (22) for Geo Texton and Tiny Image.

Fig. 3
figure 3

The fit between the model (22) and the sampled distribution of the square of the distance between points (ρ2) for two feature spaces of the SUN data set: Geo Texton (size 512), and Tiny Image (size 3,072). The sampled distributions were estimated using a sample of 15,000 images

The model captures reasonably well the general trend of the distribution for Geo Texton, but it is quite off at low and medium distances for Tiny Image. Note, however, that in our model we never use the distribution of ρ2 directly; what we do use is the distribution of the difference between the square of two distances, which we assume to follow a Laplace distribution.

The sampled distribution of Δ = ρ2(q,x) − ρ2(q,y) as well as the theoretical model (23) are shown in Fig. 4. The fit is in this case much better than that of ρ2, although it still presents discrepancies for Tiny Image.

Fig. 4
figure 4

The fit between the model (23) and the sampled distribution of the distance difference Δ = ρ2(q,x) − ρ2(q,y) for two feature spaces of the SUN data set: Geo Texton, and Tiny Image

Figures 5 and 6 show the sampled probability of success (actually finding the nearest neighbor) versus the theoretical model as a function of ν (and therefore of \(\dim ({\Xi })\)) for various values of ζ and for the features Geo Texton and Tiny Image. Note that for Geo Texton, ν grows very rapidly, so that one can have a high probability of success with spaces Ξ of very low dimensionality. In the case of Tiny Image, ν grows slowly, as can be expected given the slower drop in variance shown in Fig. 2.

Fig. 5
figure 5

Sampled probability of success compared with the theoretical model for Geo Texton. The graphs show the model as a function of ν (and therefore as a function of \(M=\dim ({\Xi })\)) for various values of ζ

Fig. 6
figure 6

Sampled probability of success compared with the theoretical model for Tiny Image. The graphs show the model as a function of ν (and therefore as a function of \(M=\dim ({\Xi })\)) for various values of ζ

Figures 7 and 8 show the average error in the distance between the true nearest neighbor and the approximate one for Geo Texton and Tiny Image. The sampled data are compared with the theoretical model. The model is in both cases a good fit to the data. Note that in the case of Tiny Image the values of ν that we get for given values of \(\dim ({\Xi })\) are about one order of magnitude smaller than those of Geo Texton and, correspondingly, the values of \({\mathbb {E}}({\Delta }\zeta )\) is about one order of magnitude larger.

Fig. 7
figure 7

Sampled average error between the distance of the true neighbor compared with the theoretical model for Geo Texton. The graphs show the model as a function of ν (and therefore as a function of M) for various values of ζ

Fig. 8
figure 8

Sampled average error between the distance of the true neighbor compared with the theoretical model for Tiny Image. The graphs show the model as a function of ν (and therefore as a function of M) for various values of ζ

7 Cost

In order to make some indicative measurements of the cost of approximate indexing, we need to select an indexing method on which to base our algorithm. This method will play a double role: on the one hand, it will serve as the reference method against which we shall compare: we shall store the whole vectors in the index and determine the cost of search. We shall call this the full indexing method, and will be the yardstick against which we shall measure the cost of our algorithm. One the other hand, we shall use it as the indexing method that we need for the initial searches in the space Ξ.

In our tests, for this purpose, we shall use K-D trees [11], which are an easy to implement, well-established and well-known method whose performance is in line with those of most partition-based methods. The use of other indices, such as M-trees [20] would not appreciably impact the performance in terms of operations, while it would have a very minor effect on disk accesses. The results found in the following would remain valid, and the comparisons would change slightly in favor of our algorithm.

As we mentioned in Section 5, there are basically two ways of measuring the cost of a search algorithm. If the algorithm is executed in central memory, then one usually isolates the most expensive operation (in our case: the computation of the distance) and counts how many times it is executed. If the algorithm is executed on disk, then the cost of accessing a random sector overshadows all other costs, and one usually counts disk accesses. In this section, we shall determine the performance of our algorithm using the two measures. The methodological basis of our measurements are laid out in Appendix A.

7.1 Computational Cost

In this section, we use the cost model (37). For the exact search, carried out on the whole feature vector of dimensionality N, there is no T list, so τ = 0 and the cost is (I + N)S(D,N). In the test, we measure S(D,N) during the search in the kd-tree, by applying suitable instrumentation to the algorithm. For approximated searches with \(\dim ({\Xi })=M\), (37) applies. The value of S(D,M) depends on the number of points that fit in a node of the kd-tree that is, considering that a node is stored in a disk cluster, on the size of a cluster. We assume that the disk has clusters of B = 25,000 bytes and that each element of the vector is a floating point number represented in f = 4 bytes. So, if we are dealing with vectors of size N, the number of vectors per node will be V = ⌊B/(fN)⌋. Table 2 shows the number of vectors per node for various values of the dimensionality of the feature space. The last two columns represent the dimensionality of Geo Texton and Tiny Image, that is, they are the number of vectors per node (and per cluster) in the full searches.

Table 2 The number of points that fit in a disk cluster (B = 25,000 bytes, f = 4 bytes) for the dimensionalities of the space that we shall use in our tests. The first seven values are dimensions of the reduced space Ξ that we use for approximate search, while the last two columns are the dimensionalities of the full Geo Texton and Tiny Image features, which we use for the full search

Figure 9 shows the average computational cost (37) as a function of D for the full indexing scheme and for the approximate indexing for various values of M and for ζ = 0.001 (Fig. 9a) and ζ = 0.01 (Fig. 9b).

Fig. 9
figure 9

Complexity, in number of operations necessary to compute the distances, of the full index and the approximate index for Geo Texton, for various values of M. For the data relative to the different curves, see Table 3. The “bumps” in the curves, evident mainly in the full indices, corresponds to the change in height of the K-D tree

For the sake of clarity, in the figures we show only the variance of the full indexing at few points, but the difference between full indexing and approximated have been determined to be significant (p < 0.01). The full index, as expected, has a complexity that grows linearly with the size of the data base. The approximate index also grows more or less linearly, due essentially to the fact that |T| grows linearly with D, as shown in (40). In both cases, we can obtain a probability of error of less than 5% and an error in the distance of less than 0.01 with a cost an order of magnitude less than the full index (Table 3).

Table 3 Data relative to the performance curves. For each one of the features, the curves with the symbol indicated in the first column are relative to the dimensionality of Ξ indicated in the second column. This determine the value of ν and, for given ζ, values of the probability of error PE and the average distance error \({\mathbb {E}}({\Delta }\zeta )\)

The situation is similar for Tiny Image (Fig. 10) although, given the slow descent of the per-axis variance and the slow rise of ν (Fig. 2) the error probability and the error in the distance are bigger: in order to get a 7% error rate, we need \(M\sim {200}\). With so large a space, it is likely that the increased performance is not due to the more efficient operation of the K-D tree (on a 200-dimensional space, the performance of partition methods is severely degraded), but to the lower cost of distance computation due to the reduced dimensionality. The difference in performance is, in this case as well, roughly one order of magnitude.

Fig. 10
figure 10

Complexity, in number of operations necessary to compute the distances, of the full index and the approximate index for Tiny Image, for various values of M. For the data relative to the different curves, see Table 3

7.2 Disk access

The cost of distance computations is a good indicator of performance for in-memory data bases, but if the data are stored on disk, the cost of a search is essentially that of random accesses to clusters. In order to determine the efficiency of our indexing method for disk operations, we assume again that the data are stored in a K-D tree in a disk with clusters of B = 25,000 bytes. The internal nodes are stored in memory while the leaves, which contain the data, are stored on disk. The number of points that can be retrieved with a random page access is given in Table 2. In the case of full indexing, the disk accesses that we do to obtain the necessary nodes constitute the whole cost of the search. In the approximate search, we use the K-D tree on the reduced space (and, therefore, with more points per cluster) to build the set T, then we need to access the points of T in the full space in order to find the approximate nearest neighbor. In order to make this access more efficient, we store the full vector corresponding to the points in a node of the tree in consecutive clusters so that they can be read with a single random access. Thus, once the points of T have been found in the tree, the full representation can be obtained with a minimum of random accesses on disk.

Figure 11 shows the number of random cluster accesses for Geo Texton, for ζ = 0.001 and ζ = 0.01.

Fig. 11
figure 11

Complexity, in number of random block accesses on disk, of the full index and the approximate index for Geo Texton, for various values of M. For the data relative to the different curves, see Table 3

Note again that, with the exception of M = 50, the number of disk accesses is about one order of magnitude smaller than that of full search. Similar considerations hold for Tiny Image, whose results are shown in Fig. 12.

Fig. 12
figure 12

Complexity, in number of random block accesses on disk, of the full index and the approximate index for Tiny Image, for various values of M. For the data relative to the different curves, see Table 3

All these comparisons assume that the K-D tree for the approximate algorithm is stored on disk. However, the smaller size of the data points may make it feasible to store it in central memory, at least for the smaller values if M, leaving only the full representation on disk. In this cases, the cost of the approximate indexing drops dramatically: our tests show that in most cases the approximate nearest neighbor can be found with less than two disk accesses for Geo Texton and less than five disk accesses for Tiny Image.

7.3 Comparative test

The state of the art in indexing (see Section 8) is huge, and a full comparison between a new model and a representative sample (considering and clarifying all the variables and the different presuppositions involved) could be the subject of a paper in its own right. Here, we do an experimental comparison with two recent methods of different nature that share some of our presuppositions: Hierarchical Navigable Small World Graphs (HNSW, [45]) applied to a sub-space of \({\mathbb {R}}^{D}\) obtained by SVD, and the Inverted Multi-Index (IMI, [5]). As a reference, we use Geo Texton with M = 20 and ζ = 0.01, which gives an error probability PE = 0.02. The parameters of the other methods were adjusted by trial and error so that the measured error probability would be the same PE. The details are available in Appendix 13. Note that we do not consider the time necessary to build the index which, in the case of IMI, is dominated by the k-means necessary to build the code book, making the method ill-suited for dynamic data bases.

The results for computational and disk access cost are shown in Fig. 13.

Fig. 13
figure 13

Performance comparison of the method presented here with HNSW and IMI (see text). The method presented here has been used with M = 20 and ζ = 0.01, corresponding to PE = 0.02. The parameters of the other methods have been adjusted by trial and error for each data base size (the parameters vary with the size of the data set for both methods) so as to obtain the same error

The method presented here (labeled “meta” in Fig. 13) outperforms HNSW and has performance similar to IMI. In both cases, the main advantage of our method is that we can work out the initial search with very small vectors, smaller than those we can use with the other methods to obtain comparable performance. This increase the number of points that we can store in a cluster, providing an advantage that the following post-processing (heavier in our method that in the others) does not completely eliminate. It should be noted, however, that these tests have been made without disk cache, and that HNSW, because of its characteristics, will benefit from caching more than the other methods. The system designer should therefore consider carefully the characteristics of its hardware and operating system before deciding what indexing to use.

8 Related work

Indexing in high-dimensional spaces has been an important topic in multimedia and data bases for the past 25 years, at least since the QBIC [26] search engine showed that query-by-example was a viable technique for retrieval. Times have changed, and many multimedia techniques have been developed, but indexing in high-dimensional spaces remains a crucial topic for which there seems to be no “silver bullet” such as B-trees are for ordered data [9].

One of the earliest multidimensional indices was [15], in which a tree (called the BKT, or Burkhard-Keller Tree) was introduced, suitable for searches in metric spaces with discrete-valued distances. Other early solutions for multi-dimensional spaces were the grid files [50], and IBIM (Interpolation-Based Index Maintenance) [16], a hash method for range queries in multi-dimensional data. The idea of designing has functions that map nearby points to the same disk page has been used ever since. A relatively recent incarnation, Locality Sensitive Hashing (LSH, [56]) adapts it to probably correct indices with a bounded error probability. The general idea of LSH bears some resemblance to ours, but the need to store multiple random projections limits its disk efficiency.

One of the most common classes of indexing methods is based on space partitioning: a tree is built in which each node “covers” a portion of the space, and the children of the node cover parts of the region covered by the parent (possibly overlapping and not covering the whole region covered by the parent). Many of the early methods of this class were originally developed for geographic data, that is, for two-dimensional spaces. The first such structure to gain widespread recognition was the R-Tree [30], in which each node represents a rectangle, and the children of the node represent rectangles included in the parent’s. The children do not partition the parent: depending on the data distribution, areas of the parent without data points could be left uncovered, while there may be overlaps between the rectangles of the children. While the former is an advantage (no need to represent portions of the space in which there are no points), the latter is a burden (when traversing the tree, one might need to traverse several sub-trees in parallel, degrading performance). Over the years, many improvements have been superimposed to the basic R-Tree design, either to improve its heuristics [10], to refine the tree structure [12] or in general to improve their performance [31, 36, 55]. One of the problems one encounters when trying to use R-Trees for high-dimensional feature spaces is the expensive representation of the nodes. Each node is a rectangle, and can be represented by fixing two opposite corners. For a N-dimensional space, this requires 2N floating point numbers; for high-dimensional spaces and standard cluster sizes, this means that few internal nodes can fit in a cluster. Some variations palliate this problem: the SS-Trees [62] use spheres (N + 1 floating point numbers per node), while G-Trees [39] use space partitions to organize multidimensional data in a B-Tree. K-D Trees [11] splits one axis at each node, a split that can be represented by an integer (the axis number) and one floating point number (where does the split occurs). A variation of this idea uses random partitions of the feature space [65], while solutions like the Set Compression Tree [3] use a structure similar to the K-D Tree to compress a set of vectors into a compact representation, thus fitting more points in a disk cluster. Almost all the methods proposed store the data on disk, but some solutions have been devised for data in central memory [24].

All these methods assume that the data are elements in a vector space. Other methods make the logically weaker assumption that the data are elements of a metric space that is, they make no use of the vector components, but index based solely on the distance values between items (note however that a set of data for which only distances are known can always be approximately be represented as a set of vectors using techniques such as multidimensional scaling [23, 58, 59]). The BKT [15], which we have already mentioned, is a method of this kind. Other indices of this class include the Fixed Query Tree [7] and their fixed height variation (FHQT, [6, 8]), Fixed Query Arrays [17], which is essentially a compact representation of FHQT, Vantage Point Trees (VPT, [19, 60]), and their generalization, Excluded middle Vantage Point Forest [64].

The best known and most successful method in this class is probably the M-tree [20], a tree in which each node contains a representative and a radius representing the region of space it covers. Sub-trees rooted at that node cover regions included in that of the parent. Thanks to a careful use of the properties of the distance function—in particular, the triangle inequality—storage and indexing can be done in the absence of a vector representation of the data. The M-tree has been shown to often outperform partition methods based on vector spaces.

Methods such as the MVP-tree [14] and the ball tree [51] are based on similar principles. These methods represent so far the practical state of the art for partition-based exact search in high dimensional data [47].

Partition methods are probably the most studied and well known, but they are not the only ones. In [33] a structure is proposed in which memory units summarize portions of the data base in a single vector. In [49] a fast algorithm based on projections, and supported by a specially designed data structure, is developed to answer 𝜖-range queries, while in [54] specialized disk allocation methods are studied. More recent methods, such as [41, 42] are based on the creation of suitable one-dimensional indices in randomly chosen dimensions. The indices are created in such a way that elements close to each other are placed relatively close in the index. Alternatively, multidimensional indices of reduced size are created [43, 57].

Most of the so-called “curse of dimensionality” derives from the concentration of the Minkowski distance in high dimensions. In [1] the problem is addressed at its roots: a distance function is defined that remains discriminative in high dimensions. The method presented here would make the best of the characteristics of this distance.

Experimental analysis shows that for most workloads, indices stop being competitive vis-à-vis serial scan for dimensions as low as 10-20 [61], and that the very notion of “nearest neighbor” might be meaningless [13], although [38] points out that the situation might not be so dire, as most of the data sets of interest have intrinsic dimension much lower than the dimension of the space in which they are represented, and that the important dimension for indexing is the intrinsic. The method that we have presented here is based on the same property.

The difficulty in creating efficient indices for exact nearest neighbor queries has created a great interest in approximate algorithms such as the one we presented here.

The Balanced box-decomposition tree (BBD-Tree [4]) is a structure than can be created in time \(O(ND \log D)\) that can find an approximate nearest neighbor with a distance from the query at most (1 + 𝜖) times that of the true nearest neighbor in time \(O(C \log D)\), with \(C\le {N}\bigl [1+\frac {6N}{\epsilon }\bigr ]^{N}\). The BBD-Tree guarantees that the (1 + 𝜖) bound is met; in [21] this constraint is relaxed: the algorithm only makes a probabilistic promise: with probability δ it will return a point whose distance from the query is at most (1 + 𝜖) times that of the true nearest neighbor. The method presented here does a similar claim, but our model allows us to derive the complete probability distribution of the error ζ. A hard limit similar to that in [21] can be derived by thresholding the integral of ζ. Comparison between the two methods is not immediate, but a comparison of Figs. 9 and 10 with Figure 4 in [21] suggest that the method presented here is more efficient (in terms of operations) for features for which ν grows reasonably rapidly. Other tree-based methods for approximate search include spill-trees [44] as well as variations of K-D trees [48].

A different class of methods is based on hashing, that is, on computing disk addresses directly from the coordinates, rather than on partitioning the space [2, 28]. One interesting class is that of Locality Sensitive Hashing (LSH), which we already mentioned [32] based on principles similar to linear hashing [15] and IBIM. In their more recent versions, they provide theoretical guarantees and performances comparable to the method presented here. Given that what we present here is a “meta” method built on top of a nearest neighbor index, an interesting experiment would be to build the present model on top of LSH.

Methods not based on trees include hashing [2, 28] or decompositions of the space into sub-spaces that can be quantized for efficiency [27].

9 Conclusions

The method that we have presented here can be considered, more than an indexing method, a meta-indexing strategy. It can be used to improve the performance of any indexing method as long as two basic premises are met: (1) the performance of the indexing method degrades as the dimensionality of the feature space increases, and (2) most of the variance of the inter-item distances is reasonably concentrated in a relatively low-dimensional sub-space of the feature space (the space that we have called Ξ). If these conditions are met, the method presented here can piggyback on the index to offer the designer a fine control of the trade-off between efficiency and accuracy.

Several directions of possible development remain open. The first is the dimensionality reduction. In our work, we have used SVD, which makes the effectiveness of dimensionality reduction dependent on the distribution of the data: the method is very efficient for some features (such as Geo Texton in our example), less so for others (Tiny Image), depending on the distribution of the data. However, the Johnson-Lindenstrauss Lemma [35] states that one can embed a data set of D points into a space of dimension \(O(\log D)\) with little distortion of the pair-wise distance. Since for most features \(\log D\ll {N}\), the Lemma opens the possibility of defining sub-spaces Ξ of low dimensionality that may work in a more general setting than those created using SVD. This relates also to the observations that we made in Section 1 on the intrinsic dimension dI. The intrinsic dimension, much like the Johnson-Lindestrauss lemma guarantees that a manifold of limited dimensionality exists that allows a faithful representation of the data, but gives no indication on how to find it.

Furthermore, as we already mentioned (page 3), the numerical characterization of a manifold that contains D points in \({\mathbb {R}}^{N}\) may require up to O(ND) parameters, making an index based on it not competitive. One interesting direction of work is the study of non-linear manifolds with compact representation that can be used in lieu of our linear sub-space Ξ.

Another possible development is the use of approximate nearest neighbor method for the initial search on Ξ for example, as we have already mentioned in the previous section, to build it on top of LSH. This, of course, will require a change in the analytical model, as the initial nearest neighbor on Ξ may itself contain an error. This also entails that, cœteris paribus, the probability of error will be greater if the search on Ξ is itself approximate. In order to obtain results comparable to those of the method with exact indexing, one has to increase \(\dim ({\Xi })\) or ζ. The question is whether the increased performance deriving from the use of the approximate indexing will compensate the need to increase \(\dim ({\Xi })\) and ζ. Only the development of the appropriate model will allow us to answer this question.