Skip to main content
Log in

Randomized Partition Trees for Nearest Neighbor Search

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

The \(k\)-d tree was one of the first spatial data structures proposed for nearest neighbor search. Its efficacy is diminished in high-dimensional spaces, but several variants, with randomization and overlapping cells, have proved to be successful in practice. We analyze three such schemes. We show that the probability that they fail to find the nearest neighbor, for any data set and any query point, is directly related to a simple potential function that captures the difficulty of the point configuration. We then bound this potential function in several situations of interest: when the data are drawn from a doubling measure; when the data and query distributions are identical and are supported on a set of bounded doubling dimension; and when the data are documents from a topic model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Ailon, N., Chazelle, B.: The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39, 302–322 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  2. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  3. Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching. J. ACM 45, 891–923 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  4. Assouad, P.: Plongements lipschitziens dans \({\mathbb{R}}^n\). Bull. Soc. Math. France 111(4), 429–448 (1983)

    MATH  MathSciNet  Google Scholar 

  5. Bentley, J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  6. Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: Proceedings of the 23rd International Conference on Machine Learning (2006)

  7. Cayton, L., Dasgupta, S.: A learning framework for nearest-neighbor search. In: Advances in Neural Information Processing Systems (2007)

  8. Clarkson, K.: Nearest neighbor queries in metric spaces. Discret. Comput. Geom. 22, 63–93 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  9. Clarkson, K.: Nearest-neighbor searching and metric space dimensions. In: Darrell, T., Indyk, P. (eds.) Nearest-Neighbor Methods for Learning and Vision: Theory and Practice. MIT Press, Cambridge (2005)

    Google Scholar 

  10. Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: ACM Symposium on Theory of, Computing, pp. 537–546 (2008)

  11. Dasgupta, S., Sinha, K.: Randomized partition trees for exact nearest neighbor search. In: 26th Annual Conference on Learning Theory (2013)

  12. Gupta, A., Krauthgamer, R., Lee, J.R.: Bounded geometries, fractals, and low-distortion embeddings. In: 44th Annual IEEE Symposium on Foundations of Computer, Science, pp. 534–543 (2003)

  13. Karger, D., Ruhl, M.: Finding nearest neighbors in growth-restricted metrics. In: ACM Symposium on Theory of, Computing, pp. 741–750 (2002)

  14. Kleinberg, J.: Two algorithms for nearest-neighbor search in high dimensions. In: 29th ACM Symposium on Theory of, Computing (1997)

  15. Krauthgamer, R., Lee, J.: Navigating nets: simple algorithms for proximity search. In: ACM-SIAM Symposium on Discrete Algorithms (2004)

  16. Liu, T., Moore, A., Gray, A., Yang, K.: An investigation of practical approximate nearest neighbor algorithms. In: Advances in Neural Information Processing Systems (2004)

  17. Maneewongvatana, S., Mount, D.: The analysis of a probabilistic approach to nearest neighbor searching. In: Seventh International Worshop on Algorithms and Data Structures, pp. 276–286 (2001)

  18. McFee, B., Lanckriet, G.: Large-scale music similarity search with spatial trees. In: 12th Conference of the International Society for Music Retrieval (2011)

  19. Stone, C.: Consistent nonparametric regression. Ann. Stat. 5, 595–645 (1977)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors are grateful to the National Science Foundation for support under grant IIS-1162581, and to the anonymous reviewers for their detailed feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanjoy Dasgupta.

Additional information

A preliminary abstract of this work appeared in [11].

Appendices

Appendix A: Summation Lemma

Lemma 6

Suppose that for some constants \(A, B > 0\) and \(d_o \ge 1\),

$$\begin{aligned} F(m) \le A \left( \frac{B}{m} \right) ^{1/d_o} \end{aligned}$$

for all integers \(m \ge n_o\). Pick any \(0 < \beta < 1\) and define \(\ell = \log _{1/\beta } (n/n_o)\). Assume for convenience that \(\ell \) is an integer. Then:

$$\begin{aligned} \sum _{i = 0}^\ell F(\beta ^i n) \le \frac{A d_o}{1-\beta } \left( \frac{B}{n_o} \right) ^{1/d_o} \end{aligned}$$

and, if \(n_o \ge B(A/2)^{d_o}\),

$$\begin{aligned} \sum _{i=0}^\ell F(\beta ^i n) \ln \frac{2e}{F(\beta ^i n)} \le \frac{A d_o}{1-\beta } \left( \frac{B}{n_o} \right) ^{1/d_o} \left( \frac{1}{1-\beta } \ln \frac{1}{\beta } + \ln \frac{2e}{A} + \frac{1}{d_o} \ln \frac{n_o}{B} \right) . \end{aligned}$$

Proof

Writing the first series in reverse,

$$\begin{aligned} \sum _{i = 0}^{\ell } F(\beta ^i n) =\sum _{i=0}^\ell F\left( \frac{n_o}{\beta ^i} \right)&\le \sum _{i=0}^\ell A \left( \frac{B \beta ^i}{n_o} \right) ^{1/d_o} \\&= A \left( \frac{B}{n_o} \right) ^{1/d_o} \sum _{i=0}^\ell \beta ^{i/d_o} \\&\le \frac{A}{1-\beta ^{1/d_o}} \left( \frac{B}{n_o} \right) ^{1/d_o} \le \frac{A d_o}{1-\beta } \left( \frac{B}{n_o} \right) ^{1/d_o}. \end{aligned}$$

The last inequality is obtained by using

$$\begin{aligned} (1-x)^p \ge 1-px \hbox {for}\,0 < x < 1, p \ge 1 \end{aligned}$$

to get \((1 - (1-\beta )/d_o)^{d_o} \ge \beta \) and thus \(1-\beta ^{1/d_o} \ge (1-\beta )/d_o\).

Now we move on to the second bound. The lower bound on \(n_o\) implies that \(A (B/m)^{1/d_o} \le 2\) for all \(m \ge n_o\). Since \(x \ln (2e/x)\) is increasing when \(x \le 2\), we have

$$\begin{aligned} \sum _{i=0}^\ell F(\beta ^i n) \ln \frac{2e}{F(\beta ^i n)}&\le \sum _{i=0}^\ell A \left( \frac{B}{\beta ^i n} \right) ^{1/d_o} \ln \frac{2e}{A (B/(\beta ^i n))^{1/d_o}}. \end{aligned}$$

The lemma now follows from algebraic manipulations that invoke the first bound as well as the inequality

$$\begin{aligned} \sum _{i = 0}^\ell i A \left( \frac{B\beta ^i}{n_o} \right) ^{1/d_o} \le \frac{A d_o^2}{(1-\beta )^2} \left( \frac{B}{n_o} \right) ^{1/d_o} , \end{aligned}$$

which in turn follows from

$$\begin{aligned} \sum _{i = 0}^\ell i \beta ^{i/d_o}&\le \sum _{i=1}^\infty i \beta ^{i/d_o} =\sum _{i=1}^\infty \sum _{j = i}^\infty \beta ^{j/d_o} =\sum _{i=1}^\infty \frac{\beta ^{i/d_o}}{1-\beta ^{1/d_o}} =\frac{\beta ^{1/d_o}}{(1-\beta ^{1/d_o})^2} \\&\le \frac{d_o^2}{(1-\beta )^2}. \end{aligned}$$

Appendix B: Clarkson’s Lemma

Suppose we are given a finite set of points \(S \subset {\mathbb {R}}^d\). How many of these points can have a specific \(x \in S\) as one of their \(\ell \) nearest neighbors? Stone [19] showed that the answer is \(\le \ell \gamma _d\), where \(\gamma _d\) is a constant exponential in \(d\) but independent of \(|S|\) and \(\ell \). This was a key step towards establishing the universal consistency of nearest neighbor classification in Euclidean spaces.

Clarkson [9] extended this result to metric spaces of bounded doubling dimension and to approximate nearest neighbors. Before stating his result, we introduce some notation. For any point \(z \in {\mathbb {R}}^d\), any set \(A \subset {\mathbb {R}}^d\), and any integer \(\ell \ge 1\), let \(\text{ NN }_\ell (z,A)\) denote the \(\ell \)th nearest neighbor of \(z\) in \(A\), breaking ties arbitrarily. For \(\gamma \ge 1\), we say \(x \in A\) is an \((\ell ,\gamma )\)-NN of \(z\) in \(A\) if

$$\begin{aligned} \Vert x - z \Vert \le \gamma \Vert z - \text{ NN }_\ell (z,A) \Vert , \end{aligned}$$

that is, \(x\) is at most \(\gamma \) times further away than \(z\)’s \(\ell \)th nearest neighbor.

Recall also that we define the aspect ratio of a finite set \(S \subset {\mathbb {R}}^d\) to be

$$\begin{aligned} \varDelta (S) =\frac{\max _{x,y \in S} \Vert x-y\Vert }{\min _{x, y \in S, x \ne y} \Vert x-y\Vert }. \end{aligned}$$

The following is shown in [9, Lemma 5.1].

Lemma 7

Pick any integer \(\ell \ge 1\) and any \(\gamma \ge 1\). If a finite set \(S \subset {\mathbb {R}}^d\) has doubling dimension \(d_o\), then any \(s \in S\) can be an \((\ell ,\gamma )\)-NN nearest neighbor of at most \((8\gamma )^{d_o} \ell \log _2 \varDelta (S)\) other points of \(S\).

Proof

Pick any \(s \in S\) and any \(r > 0\). Consider the annulus \(A_r = \{x \in S: r < \Vert x - s\Vert \le 2r\}\). By Lemma 8, \(A_r\) can be covered by \(\le (8 \gamma )^{d_o}\) balls of radius \(r/(2\gamma )\). Consider any such ball \(B\): if \(B \cap A_r\) contains \(\ge \ell +1\) points, then each of these points has \(\ell \) neighbors within distance \(r/\gamma \), and thus does not have \(s\) as an \((\ell , \gamma )\)-NN. Therefore, there are at most \(\ell (8 \gamma )^{d_o}\) points in \(A_r\) that have \(s\) as an \((\ell , \gamma )\)-NN.

We finish by noticing that by the definition of aspect ratio, \(S\) can be covered by \(\log _2 \varDelta (S)\) annuli \(A_r\), with successively doubling radii.

Lemma 8

Suppose \(S \subset {\mathbb {R}}^d\) has doubling dimension \(d_o\). Pick any \(r \ge \epsilon > 0\). If \(B\) is a ball of radius \(r\), then \(S \cap B\) can covered by \((2r/\epsilon )^{d_o}\) balls of radius \(\epsilon \).

Proof

By the definition of doubling dimension, \(S \cap B\) can be covered by \(2^{d_o}\) balls of radius \(r/2\), and thus \(2^{2d_o}\) balls of radius \(r/4\), and so on. More generally, \(S \cap B\) can be covered by \(2^{\ell d_o}\) balls of radius \(r/2^\ell \) for any integer \(\ell \ge 0\). Now take \(\ell = \lceil \log _2 (r/\epsilon ) \rceil \le \log _2 (2r/\epsilon )\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dasgupta, S., Sinha, K. Randomized Partition Trees for Nearest Neighbor Search. Algorithmica 72, 237–263 (2015). https://doi.org/10.1007/s00453-014-9885-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-014-9885-5

Keywords

Navigation