1 Introduction

Nearest neighbor search is one of the most fundamental problems in computational geometry and machine learning. It has been broadly investigated in a large body of real-world scenarios such as data compression (Gersho and Gray, 2012), speech recognition (Makhoul et al., 1985), and information retrieval (Jegou et al., 2011). As a concrete example, for customers without any shopping history, it is often plausible to look up the customers in the database with similar profiles to help make recommendation on the items.

There are many early works for (exact) nearest neighbor search, such as k-d tree and R-tree (Bentley, 1975; Samet, 1990, 2006). These methods perform very well when the data lie in a low-dimensional space, say three dimensions, while suffering computational intractability in a high-dimensional space (Arya et al., 1995). In fact, an early attempt from Dobkin and Lipton (1976) provided the first algorithm for nearest neighbor search in d-dimensional space which takes double-exponential time of \(O(n^{2^{d+1}})\) for preprocessing and \(O(2^d\log n)\) for retrieval. Such problem is known as the curse of dimensionality, and to tackle the problem in high dimensions, the notion of approximate nearest neighbor was proposed as a practical alternative (Arya and Mount, 1993). To be a little formal, given any approximation factor \(\epsilon >0\), we say that a point \(\varvec{p}\) is an \(\epsilon \)-nearest neighbor of a given query \(\varvec{q}\) if the ratio of distances from \(\varvec{p}\) to \(\varvec{q}\) and from \(\varvec{q}\) to its exact nearest neighbor is at most \((1+\epsilon )\).

We consider the data in real-world applications which are usually perturbed with noise (Abdullah et al., 2014). Formally, the observed data set \(\varvec{P}=\{\varvec{p}_1,\cdots , \varvec{p}_n\}\) is generated by a clean data set \(\varvec{X}=\{\varvec{x}_1,\cdots , \varvec{x}_n\}\) with random noise corruption, that is

$$\begin{aligned} \varvec{p}_i = \varvec{x}_i + \varvec{t}_i,~ \forall \ i=1, \cdots , n. \end{aligned}$$
(1)

The query \(\varvec{q}\) is a superposition of the clean query point \(\varvec{y}\) corrupted by the same type of noise, i.e., \(\varvec{q}= \varvec{y}+ \varvec{t}\). Suppose \(\varvec{x}^*\) is the (exact) nearest neighbor of \(\varvec{y}\), that is

$$\begin{aligned} \left\Vert \varvec{y}-\varvec{x}^* \right\Vert _{2} \le 1 ~\text {and} ~\forall \varvec{x}\in \varvec{X}\setminus \{\varvec{x}^*\}, \left\Vert \varvec{y}-\varvec{x} \right\Vert _{2} \ge 1+\epsilon , \end{aligned}$$
(2)

where \(\left\Vert \cdot \right\Vert _{2}\) denotes the \(\ell _2\)-norm. We will consider that the noise is bounded, in the sense that \(\max \{\left\Vert \varvec{t}_i \right\Vert _{2}, \left\Vert \varvec{t} \right\Vert _{2}\} \le \epsilon /16\). Though it seems that the most natural assumption on the noise is Gaussian, we note that both Gaussian and bounded random variables are sub-Gaussian. So they admit the same tail bound. Under this smoothed problem setting, Indyk and Motwani (1998) proposed the celebrated locality-sensitive hashing (LSH) algorithm that achieves sub-linear query time. Under the locality-sensitive hashing framework, there have been a large body of works showing efficient computation is possible (Andoni and Indyk, 2008; Andoni et al., 2014, 2018). Notably, the construction of the hashing functions in LSH is independent of the data.

On the other spectrum, algorithms that incorporate machine learning techniques to learn the hash functions from the data have attracted a lot of interest in recent years (Kulis and Darrell, 2009; Liu et al., 2011; Kong and Li, 2012). For example, spectral graph has been widely studied to learn the binary codes that preserve the similarity structure of the database (Weiss et al., 2009; Abdullah et al., 2014). Supervised hashing methods learn the binary code representations of samples that are correlated with their labels (Shen et al., 2015). Recent works on representation learning using deep neural networks have shown practical values in various tasks, which motivates a surge of works to utilize convolutional neural networks as hash functions; see, for example, Çakir et al. (2018).

Though the learning based approaches outperform the locality-sensitive hashing based methods in many applications (Jegou et al., 2011; Xia et al., 2015), there seems a lack of theoretical understanding of the success of many of the existing algorithms. In this paper, we propose a data-dependent learning algorithm for approximate nearest neighbor search, and we aim to resolve two important technical barriers: (1) approximate the low-dimensional space efficiently; and (2) provide the theoretical guarantee that the mutual distance is preserved in the low-dimensional space. That is, if the data points are neighbors in the original space, they should be close to each other in the low-dimensional space. Abdullah et al. (2014) provided the first justification for this disparity, which directly utilized principal component analysis with preprocessing time \(O(nd^2+d^3)\). In our algorithm, we learn the projection matrix by leverage score based sampling which is more computationally efficient (Alaoui and Mahoney, 2015; Musco and Musco, 2015; Cohen et al., 2016; Musco and Musco, 2017). In addition,it is demonstrated that leverage score based sampling approaches often give the strong provable guarantees for subspace approximation and statistical performance in downstream applications (Alaoui and Mahoney, 2015; Rudi et al., 2015; Gittens and Mahoney, 2016).

1.1 Summary of our contributions

In this work, we present a learning-to-hash algorithm based on ridge leverage score: it produces the hash function provably matching the accuracy of principal component analysis methods and the obtained low-dimensional subspace preserves the geometry structure of the database. The advantage of our method is twofold. First, approximating the low-dimensional space is significantly more efficient than many existing spectral methods (Weiss et al., 2009; Abdullah et al., 2014), as our sampling techniques used for subspace learning is performed on s landmark points. The preprocessing, in particular, takes time \(O(n\cdot s^2)\), where \(s \ll \min (n,d)\) is the number of landmarks. Second, we show that \((1+\epsilon /4)\)-approximate nearest neighbor of the query can be obtained with high probability.

In terms of empirical results, we evaluate the performance of our algorithm on real-world applications: computer vision and natural language understanding. The experiments are conducted on real-world data sets, including MNIST, Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) (SST-2), Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), Microsoft Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005), Stanford Question Answering Natural Language Inference Corpus (QNLI) (Rajpurkar et al., 2016), and Glove (Pennington et al., 2014). Our algorithm achieves the best performance with various hash code lengths on all the data sets compared with the state-of-the-art algorithms.

1.2 Roadmap

In Sect. 2, we present a more concrete literature review and state the connection to this work. Section 3 gives the main algorithm with performance guarantee in Sect. 4. A comprehensive empirical study is carried out in Sect. 5. We conclude the paper in Sect. 6. The proof details can be found in the “Appendix”.

Notation. We use lowercase letters to denote vectors and capital letters for matrices. For a vector \(\varvec{q}\), we denote its \(\ell _2\)-norm by \(\left\Vert \varvec{q} \right\Vert _{2}\). We reserve \(\varvec{P}\in \mathbb {R}^{n\times d}\) for the database with n data points in d-dimensional feature space. We use \(p_i^{\top }\in \mathbb {R}^{d}\) to denote the i-th row of \(\varvec{P}\), that is, the i-th sample in \(\varvec{P}\). We use two matrix norms: the Frobenius norm and spectral norm, defined as \(\left\Vert \varvec{P} \right\Vert _{F}=\sqrt{\sum _{i=1}^{d}\sigma _i(\varvec{P})^2}\) and \(\left\Vert \varvec{P} \right\Vert _{2}=\sigma _{1}(\varvec{P})\) respectively, where \(\sigma _i(\varvec{P})\) represents the singular value of \(\varvec{P}\) in descending order (\(\sigma _1(\varvec{P})\ge \sigma _2(\varvec{P}) \ge \dots \ge \sigma _d(\varvec{P}) \ge 0\)). The distance between a data point \(\varvec{q}\) and the subspace \(\varvec{U}\) is defined as \(d(\varvec{q},\varvec{U}) := \inf _{\varvec{y}\in \varvec{U}}\left\Vert \varvec{q}-\varvec{y} \right\Vert _{2} = \left\Vert \varvec{q}-\varvec{q}_{{\varvec{U}}} \right\Vert _{2}\), where \(\varvec{q}_{{\varvec{U}}}\) is the orthogonal projection onto the subspace \({\varvec{U}}\). When we say a subspace is k-dimensional, we mean its intrinsic dimension is k.

2 Related works

The core of nearest neighbor search is to find the data point most close to the query in the database, while the approximate nearest neighbor search returns the data points within \((1+\epsilon )dist\) of the query, where dist is the distance between the query and the nearest neighbor. In either category, the search is usually performed on a collection of data points; the process to organize the database into certain data structure is called data processing, which is assumed to be independent of the number of queries. As the straightforward search is brute force which takes O(n) time for 1-dimensional space, more efficient searching algorithms usually construct a data structure to make the query efficient in terms of space and time cost in processing and retrieval. For example, binary search method formed the balanced binary tree with time \(O(n\log n)\) and answered the query in \(\lfloor \log n \rfloor +1\) time (Knuth, 1973). A plethora of related algorithms have been proposed in the literature, such as k-d trees, R-trees (Bentley, 1975; Samet, 1990; Sellis et al., 1997; Samet, 2006). These approaches are usually based on computational geometry. However, if the number of dimensions exceeds 20, searching in k-d trees and related structures requires the inspection of a large fraction of the database, thereby doing no better than brute-force linear search Gionis et al. (1999). Therefore, the approximate nearest neighbor search has attracted attention for practical problems with high-dimensional data.

Table 1 Summary of state-of-the-art results in terms of space and time bounds for approximate nearest neighbor search. k is the length of hash code, s is the number of landmarks, d is the original feature dimension of database with n data points

Existing algorithms for approximated nearest neighbor search could be categorized as locality-sensitive-hashing families and learning based hashing, depending on how the data structure is constructed. Indyk and Motwani (1998) introduced the idea of locality-sensitive hashing. There are many related works discussing how to chose the parameters L (the number of buckets), \(r_1\) (the radius of the ball centered at \(\varvec{q}\)) and k (the length of hash code) to achieve the low failure probability guarantee. For example, Andoni and Indyk (2008) proposed an algorithm that utilized linear random projection to reduce the feature dimension to k (\(k=O(\log n)\)), then the approximated nearest neighbors could be returned in sublinear query time using nearly-linear space. Andoni et al. (2014) proposed a data-dependent hashing function with Johnson-Lindenstrauss dimension reduction procedure and got a better result. Andoni et al. (2018) presented a data structure for general symmetric norms. Very recently, Andoni et al. (2021) showed improved data structures for the high-dimensional approximate nearest neighbor search for \(\ell _p\) distances for large values of p and for generalized Hamming distances. The details of related space and time bounds for Euclidean distance are summarized in Table 1.

Learning based hashing has seen a recent surge of interest (Gong and Lazebnik, 2011; Weiss et al., 2012; Erin Liong et al., 2015; Han et al., 2015; Liu et al., 2016). Much of this excitement centers around the discovery that these approaches achieve outstanding performance in real-world applications, such as computer vision (Xia et al., 2015) and information retrieval (Jegou et al., 2011). There are some works focusing on supervised binary code projection methods (Liu et al., 2014; Shen et al., 2015). For example, sparse projection (SP) introduced the sparse projections for binary encoding which involved minimizing the distortion and adopted the variable-splitting techniques in optimization (Xia et al., 2015). The spectral analysis based unsupervised methods have attracted a lot of attention since the labeled data is precious. For example, Spectral Hashing utilized a subset of thresholded eigenvectors of the graph Laplacian matrix (Weiss et al., 2009). Iterative quantization (ITQ) proposed an efficient way to find the hash code by minimizing the quantization error of mapping the data to the vertices of a zero-centered binary hypercube (Gong and Lazebnik, 2011). Jegou et al. (2011) decomposed the space into a Cartesian product of low dimensional subspace and the hash code is composed of its subspace quantization indices. Liu et al. (2011) assumed that the data reside in a low-dimensional manifold and proposed a graph-based hashing method. Isotropic hashing (ISO) found the hash projection function with equal variances for different dimensions, called isotropic hashing (Kong and Li, 2012). Multidimensional Spectral Hashing (MDSH) learned the binary codes based on reconstructing the affinity between data points, rather than computing their distances (Weiss et al., 2012). bilinear projection based binary codes (BPBC) learned the similarity-preserving binary codes by compact bilinear projections instead of a single large projection matrix (Gong et al., 2013). The algorithm utilizes a spectral relaxation where the bits are mapped by thresholded eigenvectors of the affinity matrix. Circulant binary embedding (CBE) learned the data-dependent circulant projects by minimizing the objective in original and Fourier domains (Yu et al., 2014). Scalable graph hashing (SGH) is proposed to approximate the whole graph without explicitly computing the similarity graph matrix, but optimizing a sequential learning function to learn the compact hash codes in a bit-wise manner (Jiang and Li, 2015). We follow this line of research and propose an inexact spectral analysis for approximate nearest neighbor search. The experimental results demonstrate the superiority of our algorithm compared with the state-of-the-art learning based hashing approaches mentioned in this section.

3 Main algorithm

In this section, we elaborate on our approach, which consists of two steps: Algorithm 1 samples the landmark points for the construction of data structure that can be used for efficient retrieval, and Algorithm 2 performs approximate nearest neighbor search.

3.1 Overview

Our pipeline consists of learning the hash codes and retrieval, where the primary idea is to find a good embedding of all original data points under which the mutual distance is well controlled with overwhelming probability. To this end, it seems that a straightforward approach is to utilizing principal component analysis (PCA). However, it is known that finding the exact principal components is computationally slow for large-scale problems. Therefore, we propose to first select a manageable number of landmark points followed by PCA. The selection process is based on the ridge leverage score which is a good measure of the importance of data points (Alaoui and Mahoney, 2015).

Definition 1

(Ridge leverage score) For any \(\lambda >0\), the \(\lambda \)-ridge leverage score of the row of \(\varvec{P}\in \mathbb {R}^{n\times d}\) is defined as:

$$\begin{aligned} l_i = \varvec{p}_i(\varvec{P}^{\top }\varvec{P}+\lambda \mathbf {I})^{-1}\varvec{p}_i^{\top }, \end{aligned}$$
(3)

where \(\mathbf {I}\) is the \(n\times n\) identity matrix.

To be more concrete, when constructing the principal components of the training set, our algorithm runs in multiple iterations, where in each iteration a fraction of the training data are sampled and some of them will be selected as landmarks. The low-dimensional subspace is learned based on selected landmarks. The algorithm terminates when all the training data have been evaluated. When a new query comes in, it will be projected onto the learned subspace, through which retrieval is efficient.

figure a

3.2 Learning to hash

Algorithm 1 learns a low-dimensional projection matrix \(\varvec{Z}\in \mathbb {R}^{d\times k}\) that can be applied to embed the data. It consists of two major steps: Phase I selects the landmark points as indicated by the matrix \(\tilde{\varvec{S}}\in R^{d\times s}\), and Phase II runs PCA on selected landmark points to return the low-dimensional projection matrix. The algorithm starts with checking if the problem is in large scale, that is, whether the number of samples in \(\varvec{P}\) is greater than \(192\log (1/\delta )\). If not, we could use PCA to get \(\varvec{Z}\) directly; otherwise, the algorithm enters Phase I in the while loop to sample important data points.

The key observation of our sampling approach is that uniform sampling is practical, but it only enjoys theoretical guarantees under strong regularity or incoherence assumptions on the data (Gittens, 2011). On the other hand, ridge leverage scores evaluate the importance of data points which have shown practical impact in downstream applications. However, the calculation of exact ridge leverage score is often slow. In this regard, we propose to combine these two widely used schemes.

First, note that we aim to approximately estimate the ridge leverage score of all data points in an iterative manner, and each data point is evaluated only once. To this end, the number of iterations T is initialized as \(O(\log n)\). In each iteration, we randomly draw half of the data points that have not been accessed. The iteration terminates when the size of the remaining data is less than \(192\log (1/\delta )\).

In particular, in each iteration, we construct uniform sampling matrix \(\bar{\varvec{S}}\) by selecting data points uniformly at random with probability 1/2, and \(\tilde{\varvec{S}}\) is the sampling matrix learned by approximated ridge leverage scores. Each column of \(\tilde{\varvec{S}}\) has one nonzero element that indicates the index of selected sample. In each iteration, we uniformly sample a subset \(\mathcal {J}_i\) and approximate the ridge leverage score of the j-th sample as

$$\begin{aligned} \tilde{l}_i=\varvec{p}_i (\varvec{P}^{\top }\tilde{\varvec{S}}\tilde{\varvec{S}}^{\top }\varvec{P}+\lambda \mathbf {I})^{-1}\varvec{p}_i^{\top }. \end{aligned}$$
(4)

Equation (4) is a good approximation of the original ridge leverage score computed as in Definition 1. With the fact \(\varvec{P}^{\top }\tilde{\varvec{S}}\tilde{\varvec{S}}^{\top }\varvec{P}\preceq \varvec{P}^{\top }\varvec{P}\), \(\tilde{l}_{i}\) is an upper bound of the ridge leverage score \(l_{i}\), i.e.

$$\begin{aligned} \begin{aligned} \tilde{l}_{i}=\varvec{p}_i(\varvec{P}^{\top }\tilde{\varvec{S}}\tilde{\varvec{S}}^{\top }\varvec{P}+\lambda \mathbf {I})^{-1}\varvec{p}_i^{\top }\ge \varvec{p}_i (\varvec{P}^{\top }\varvec{P}+\lambda \mathbf {I})^{-1}\varvec{p}_i^{\top }=l_i. \end{aligned} \end{aligned}$$
(5)

Then we compute the sampling probability of data points based on the approximated leverage score as follows:

$$\begin{aligned} \eta _i = \min (1,16\tilde{l}_i\log (\sum _i\tilde{l}_i /\delta )). \end{aligned}$$
(6)

The data points are selected as landmarks with probability \( \eta _i \). The corresponding column of the landmark point in the sampling matrix \(\varvec{S}\) is weighted by \(1/\sqrt{\eta _i}\). \(\varvec{S}\) is assigned to \(\tilde{\varvec{S}}\) as the final selected sampling matrix in the current iteration. Then we get the next round data partition \(\mathcal {J}_t\) by uniform sampling. We output a partition of data set \(\{\mathcal {J}_1, \cdots , \mathcal {J}_T\}\) at the end of the algorithm.

Phase II seeks for a low-dimensional projection matrix \(\varvec{Z}\) based on the selected landmarks. A straightforward approach to learn \(\varvec{Z}\) is to optimize the following objective function:

$$\begin{aligned} \min _{\varvec{Z}} \left\Vert \varvec{P}- \varvec{P}\varvec{Z}\varvec{Z}^{\top } \right\Vert _{F}^2. \end{aligned}$$
(7)

As \(\varvec{Z}\) always lies in the column span of \(\varvec{P}^{\top }\), it can be represented by constructing a matrix \(\varvec{Y}\in \mathbb {R}^{n\times k}\), such that \(\varvec{Z}= \varvec{P}^{\top }\varvec{Y}\). We re-parameterize by writing \(\varvec{Y}=\varvec{K}^{-1/2}\varvec{W}\) where \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\), thus \(\varvec{Z}= \varvec{P}\varvec{K}^{-1/2}\varvec{W}\). Recall that Phase I selects s landmarks denoted by \(\varvec{S}\). Let \(\Phi \) be the orthogonal projection onto the row span of \(\varvec{S}^{\top }\varvec{P}\). We can approximate the database matrix as \(\tilde{\varvec{P}}{\mathop {=}\limits ^{\text {def}}}\varvec{P}\Phi \), where \(\Phi =\varvec{P}^{\top }\varvec{S}(\varvec{S}^{\top }\varvec{P}\varvec{P}^{\top }\varvec{S})^+\varvec{S}^{\top }\varvec{P}\). Since \(\Phi \) is an orthogonal projection, \(\Phi \Phi ^{\top }=\Phi ^2=\Phi \), we can approximate \(\varvec{K}\) as \(\tilde{\varvec{K}}= \tilde{\varvec{P}}\tilde{\varvec{P}}^{\top }=\varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}=\varvec{P}\). So, the projection matrix is in the form \(\varvec{Z}=\Phi \varvec{P}\tilde{\varvec{K}}^{-1/2}\tilde{\varvec{W}}=\varvec{P}^{\top }\varvec{S}(\varvec{S}^{\top }\varvec{P}\varvec{P}^{\top }\varvec{S})^+\varvec{S}^{\top }\varvec{P}\tilde{\varvec{W}}\), where \(\tilde{\varvec{W}}\) minimizes the following function:

$$\begin{aligned} {{\,\mathrm{tr}\,}}(\tilde{\varvec{K}})-{{\,\mathrm{tr}\,}}(\varvec{W}\varvec{W}^{\top }\tilde{\varvec{K}}\varvec{W}\varvec{W}^{\top }). \end{aligned}$$
(8)

The optimization of (7) is equivalent to minimizing the above objective function which is standard in the literature (Woodruff et al., 2014). Since \(\varvec{W}\) can be taken as the top k eigenvectors of \(\varvec{K}\), we approximate it by performing singular value decomposition on \(\varvec{P}\varvec{P}^{\top }\varvec{S}\) and assign it to \(\Sigma _k\) in Algorithm 1.

3.3 Retrieval

In the retrieval phase, it is easy to learn the hash code of the data points by the projection matrix \(\varvec{Z}\), that is, the hash code of data point \(\varvec{p}^{\top }\in \mathbb {R}^{d}\) is \(h(\varvec{p}) = \text {sign}(\varvec{p}^{\top }\times \varvec{Z})\). We can get the hash code of the query in the same way. The near neighbors of the query include the data points that conflict with the query in terms of the hash code. The neighbors could also be retrieved with Hamming distance within certain radius. The search procedure is performed on each data subset of \(\{\mathcal {J}_1,\cdots ,\mathcal {J}_T\}\) in parallel.

As shown in Algorithm 2, we set m as the desired number of approximate nearest neighbors to return. First, we learn the hash code of the data points in \(\varvec{P}\) and the query data point by projection matrix \(\varvec{Z}\). Then the data points conflict with the query is considered as the near neighbors of the query point. As the data set \(\varvec{P}\) is partitioned to several data subsets \(\{\mathcal {J}_i\}_i^{T}\) (\(T =O(\log n)\)). The search in each subset could be implemented simultaneously. The search procedure terminates when the desired number of neighbors are returned. We ensure that the approximated nearest neighbor can be returned in low-dimensional query with high probability as shown in Theorem 3.

figure b

3.4 Time and memory cost

Phase I in Algorithm 1 performs at most \(T=O(\log n)\) iterations in total. After \(O(\log n)\) iterations, all data points will be identified by certain group. The time cost in the iterative procedure is dominated by the ridge leverage score computation which takes \(O(ns^2)\) time. Since n is cut in half at each level of iteration, the total run time is \(O(ns^2+\frac{ns^2}{2}+\frac{ns^2}{4}+\cdots ) =O(ns^2)\). The computation of top k eigenvectors of \(\varvec{P}\varvec{P}^{\top }\varvec{S}\) is \(O(ns^2)\). Since \(\varvec{S}\) has \(O(\frac{k}{\epsilon }\log \frac{k}{\delta \epsilon })\) columns, the computation of eigenvectors to get a low-dimensional projection matrix \(\varvec{Z}\) can be performed very efficiently. The construction of \(\varvec{Z}\) takes \(O(s^3+s^2)\). Hence, the total time complexity of Algorithm 1 is \(O((s^3+ns^2)\cdot \log n)\). Recall that the time cost of spectral analysis is usually polynomial in n or d, such as \(O(nd^2+d^3)\) (Abdullah et al., 2014). Clearly, our algorithm is more efficient. In terms of memory cost, the storage of \(\varvec{P}^{\top }\varvec{P}\) requires \(O(d^2)\) extra space which will be used in the ridge leverage score estimation. To search for the neighbors of the query data point as in Algorithm 2, it requires saving all the binary code of training data with space O(nk) and query time \(O(k\cdot \log n)\).

3.5 Hyper-parameter setting

Algorithm 1 learns low-dimensional projection matrix \(\varvec{Z}\in \mathbb {R}^{d\times k}\), where d is the feature dimension of data \(\varvec{P}\) and k is the dimension of projected space, \(k< d\). The while-loop in Phase I terminates in \(T=O(\log n)\) iterations as the uniform sampling will select a half of samples at each iteration from \(\Omega \). We assume that data \(\varvec{P}\) lives in low-dimensional space and k is rank of the data matrix. After data projection, we utilize sign function to get the hash code, hence k equals the length of hash code. The parameter k is tuned in the range of [0, d]. The input parameters of Algorithm 1 \(\lambda =\frac{\epsilon }{k}\sum _{i=k+1}^n\sigma _{i}(\varvec{K})\), \(\epsilon \), \(\delta \) which are used to get sampling matrix \(\varvec{S}\in \mathbb {R}^{n\times s}\), s is the number of sampled data points which is in the order of \(\frac{k}{\epsilon } \log \frac{k}{\delta \epsilon }\). The reason is that \(s\le 2\sum _i \eta _i\) with probability \(1-\delta \) by following Lemma 6. If the ridge leverage score is computed exactly, we bound \(\sum _i l_i\le \frac{2k}{\epsilon }\) as shown in Lemma 9 of “Appendix A”. Accordingly, \(\sum _i \eta _i\le 32\frac{k}{\epsilon }\log \frac{k}{\delta \epsilon }\) as designed.

If the number of data points \(n < 192\log (1/\delta )\), the while-loop is skipped. The \(192\log (1/\delta )\) number of samples is set following the simplified Chernoff bounds in (Mitzenmacher and Upfal, 2017). That is, when \(n\ge 192\log (1/\delta )\), \({\mathbb E}|\bar{S}|\ge 96\log (1/\delta )\), we have:

$$\begin{aligned} \Pr (1\le |\bar{S}| \le 0.56n) \ge 1-\delta , \end{aligned}$$
(9)

as long as \(\delta \le 1/32\). Then the while-loop continues on the index set \(\Omega \) of size \(\ge 1\) and \(\le 0.56n\). Accordingly, Theorem 1 holds for all data set \(\mathcal {J}\) with size between 1 and \(n-1\) with probability \(1-\delta \).

\(\lambda \) is the parameter to approximate ridge leverage score which is initialized as \( \frac{\epsilon }{k}\sum _{i=k+1}^n \sigma _i(\varvec{P}\varvec{P}^{\top })\). Then we get the \((1+2\epsilon )\) relative Frobenius error guarantee among the approximated low-rank space and ground-truth. The quantity \(192\log ( 1/\delta )\) is the minimum number of sampled set to compute leverage score. We assume that the number of samples in \(\varvec{P}\) is larger than \(192\log (1/\delta )\), otherwise the low-rank matrix could be computed by singular value decomposition directly.

4 Performance guarantee

In this section, we use the following notations. Let \(\varvec{S}^{\top }\varvec{P}\) denote the data matrix with s samples selected by weighted sampling matrix \(\varvec{S}\) from database \(\varvec{P}\). We write \(\varvec{K}=\varvec{P}\varvec{P}^\top \in \mathbb {R}^{n\times n}\). Note that the Nyström approximation of \(\varvec{K}\) based on \(\varvec{S}\) is \(\tilde{\varvec{K}}= \varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}\).

Lemma 1

For any \(\delta \in (0,1/32)\), with probability \((1-3\delta )\), Algorithm 1 returns \(\varvec{S}\) with s columns that satisfies:

$$\begin{aligned} \frac{1}{2}(\varvec{P}^{\top }\varvec{P}+\lambda \mathbf {I})\preceq (\varvec{P}^{\top }\varvec{S}\varvec{S}^{\top }\varvec{P}+\lambda \mathbf {I}) \preceq \frac{3}{2}(\varvec{P}^{\top }\varvec{P}+\lambda \mathbf {I}). \end{aligned}$$

We remark that Lemma 1 is a direct corollary of Lemma 6 and matrix Bernstein inequality.

Lemma 2

For any \(\delta \in (0,1/32)\), let \(\varvec{S}\in \mathbb {R}^{n\times s}\) be returned by Algorithm 1 with \(s\le 384\cdot \mu \log (\mu /\delta )\), where \(\mu ={{\,\mathrm{tr}\,}}(\varvec{K}(\varvec{K}+\lambda \mathbf {I})^{-1})\) is the effective dimension of \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\) with parameter \(\lambda \). Denote Nyström approximation of \(\varvec{K}\) by \(\tilde{\varvec{K}}= \varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}\). With probability \(1-3\delta \), the following holds:

$$\begin{aligned} \tilde{\varvec{K}}\preceq \varvec{K}\preceq \tilde{\varvec{K}}+\lambda \mathbf {I}. \end{aligned}$$

Proof

By Lemma 1, we get

$$\begin{aligned} \frac{1}{2} (\varvec{P}^{\top }\varvec{P}+\lambda \mathbf {I})\preceq (\varvec{P}^{\top }\varvec{S}\varvec{S}^{\top }\varvec{P}+\lambda \mathbf {I}) \preceq \frac{3}{2}(\varvec{P}^{\top }\varvec{P}+\lambda \mathbf {I}), \end{aligned}$$

for a weighted sampling matrix \(\varvec{S}\). If we remove the weight from \(\varvec{S}\) so that it has all unit entries, by Lemma 5 and Nyström approximation, \(\tilde{\varvec{K}}\) satisfies:

$$\begin{aligned} \tilde{\varvec{K}}\preceq \varvec{K}\preceq \tilde{\varvec{K}}+\lambda \mathbf {I}\end{aligned}$$

as claimed. \(\square \)

Now, we are ready to use Lemma 1 and Lemma 2 to give an efficient method to approximate the principal components of the data matrix \(\varvec{P}\).

Theorem 1

Let \(\varvec{S}\in \mathbb {R}^{n\times s}\) returned by Algorithm 1 with \(\lambda =\frac{\varepsilon }{k}\sum _{i=k+1}^{n}\sigma _i(\varvec{P}\varvec{P}^{\top })\) and \(\delta \in (0,1/8)\). \(\varvec{V}\in \mathbb {R}^{d\times k}\) contains optimal top k row principal components of data matrix \(\varvec{P}\). From \(\varvec{P}^{\top }\varvec{S}\), we can compute a matrix \(\varvec{X}\in \mathbb {R}^{n\times s}\) such that if we set \(\varvec{Z}= \varvec{P}^{\top }\varvec{S}\varvec{X}\), with probability \(1-\delta \):

$$\begin{aligned} \left\Vert \varvec{P}-\varvec{P}\varvec{Z}\varvec{Z}^{\top } \right\Vert _{F}^2\le (1+2\epsilon )\left\Vert \varvec{P}-\varvec{P}\varvec{V}\varvec{V}^{\top } \right\Vert _{F}^2, \end{aligned}$$

with \(s=O(\frac{k}{\epsilon }\log \frac{k}{\delta \epsilon })\).

The proof is presented in “Appendix B”. In the following, we demonstrate that the nearest neighbor can be retrieved in the learned data structure. To this end, we first show that the nearest neighbor of the query remains consistent even corrupted with noise.

Lemma 3

If the query \(\varvec{y}\) and its nearest neighbor \(\varvec{x}^*\) are corrupted with noise \(\varvec{t}\), that is, \(\varvec{q}=\varvec{y}+\varvec{t}\), \(\varvec{p}^*=\varvec{x}^*+\varvec{t}\), the nearest neighbor of \(\varvec{q}\) is \(\varvec{p}^*\).

Proof

Recalling that the noise is bounded, that is \(\left\Vert \varvec{t} \right\Vert _{2} \le \alpha \). Hence for all \(i\in [0, n]\), we have

$$\begin{aligned} \left\Vert \varvec{p}_i - \varvec{x}_i \right\Vert _{2} \le \left\Vert \varvec{t} \right\Vert _{2} \le \alpha . \end{aligned}$$

By the triangle inequality,

$$\begin{aligned} \left\Vert \varvec{q}-\varvec{p}^* \right\Vert _{2}&\le \left\Vert \varvec{q}- \varvec{y} \right\Vert _{2} + \left\Vert \varvec{y}- \varvec{x}^* \right\Vert _{2} + \left\Vert \varvec{x}^* - \varvec{p}^* \right\Vert _{2}\\&\le \left\Vert \varvec{y}- \varvec{x}^* \right\Vert _{2} + 2\alpha . \end{aligned}$$

Then, for any other data point in the data set \(\varvec{P}\), that is \(\varvec{p}\in \varvec{P}\) and \(\varvec{p}\ne \varvec{p}^*\), we get

$$\begin{aligned} \left\Vert \varvec{q}- \varvec{p} \right\Vert _{2} \ge \left\Vert \varvec{y}- \varvec{x}^* \right\Vert _{2} + \epsilon - 2\alpha . \end{aligned}$$

Then we get the guarantee that distances between data and low-dimensional subspace are bounded. \(\square \)

Theorem 2

Let \(\varvec{Z}\in \mathbb {R}^{d\times k}\) be the projection matrix learned by Algorithm 1, \(\tilde{\varvec{U}}\) be the corresponding subspace, then we have:

$$\begin{aligned} \sum _{\varvec{p}\in \varvec{P}_i} d(\varvec{p},\tilde{\varvec{U}})^2 \le (1+2\epsilon )\sum _{i=k+1}^{n}\sigma _i(\varvec{P}), \end{aligned}$$

where \(\sigma _i\) is the i-th singular value of \(\varvec{P}\).

Proof

Let \(\varvec{V}\in \mathbb {R}^{d\times k}\) contain the projection matrix obtained by singular value decomposition of \(\varvec{P}\) and \(\varvec{U}\) be corresponding k-dimensional subspace. The distance between a data point and subspace can be computed as:

$$\begin{aligned}&\sum _{\varvec{p}\in \varvec{P}} d(\varvec{p},\varvec{U})^2 = \sum _{\varvec{p}\in \varvec{P}} \inf _{\varvec{w}\in \varvec{U}}\left\Vert \varvec{p}-\varvec{w} \right\Vert _{2}^2 = \left\Vert \varvec{P}-\varvec{P}\varvec{V}\varvec{V}^{\top } \right\Vert _{F}^2.\\&\sum _{\varvec{p}\in \varvec{P}} d(\varvec{p},\tilde{\varvec{U}})^2 = \sum _{\varvec{p}\in \varvec{P}} \inf _{\varvec{w}\in {\tilde{\varvec{U}}}}\left\Vert \varvec{p}-\varvec{w} \right\Vert _{2}^2 = \left\Vert \varvec{P}-\varvec{P}\varvec{Z}\varvec{Z}^{\top } \right\Vert _{F}^2. \end{aligned}$$

Combining with Theorem 1, we show that

$$\begin{aligned} \left\Vert \varvec{P}- \varvec{P}\varvec{Z}\varvec{Z}^{\top } \right\Vert _{F}^2 \le (1+2\epsilon )\left\Vert \varvec{P}- \varvec{P}\varvec{V}\varvec{V}^{\top } \right\Vert _{F}^2 \le (1+2\epsilon )\sum _{i=k+1}^{n}\sigma _i(\varvec{P}), \end{aligned}$$

where \(\sigma _i\) is the i-th singular value of \(\varvec{P}\). For the case that k is close to the rank of \(\varvec{P}\), \(\sum _{i=k+1}^{n}\sigma _i\) can be very small.

With Theorem 2, we can easily prove that the similarity among data points is preserved in the projected low-dimensional space as Lemma 4, which we defer to “Appendix C”. Then we get our main result Theorem 3, that the nearest neighbor will be returned in the low-dimensional space.

Lemma 4

Suppose the nearest neighbor of \(\varvec{q}\) is \(\varvec{p}^*\) in d-dimensional feature space. In the k-dimensional subspace projected by \(\varvec{Z}\in \mathbb {R}^{d\times k}\) which is learned by Algorithm 1, the nearest neighbor of \(\varvec{q}\) is \(\varvec{p}^*\).

Theorem 3

Algorithm 2 returns data point \(\varvec{p}^*\) from database \(\varvec{P}\) as a \((1+\epsilon /4)\)-approximate nearest neighbor of query point \(\varvec{q}\).

Proof

Recall that noisy data \(\varvec{p}\in \varvec{P}\) is permuted from clean data \(\varvec{x}\in \varvec{X}\) with noise \(\varvec{t}\) (\(\varvec{p}=\varvec{x}+\varvec{t}\)), and so is the query data \(\varvec{q}\) (\(\varvec{q}=\varvec{y}+\varvec{t}\) with \(\varvec{y}\) as clean data). Let the nearest neighbor of \(\varvec{y}\) be \(\varvec{x}^*\in \varvec{X}\) which corresponds to \(\varvec{p}^*\) in \(\varvec{P}\). Assume that \(\varvec{p}^*\) is the returned nearest neighbor of \(\varvec{q}\). Fix \(\varvec{x}\ne \varvec{x}^*\), using the triangle inequality to write

$$\begin{aligned} \left\Vert \varvec{x}- \varvec{p}_{\tilde{\varvec{U}}} \right\Vert _{2}&\le \left\Vert \varvec{x}-\varvec{p} \right\Vert _{2} + \left\Vert \varvec{p}- \varvec{p}_{\tilde{\varvec{U}}} \right\Vert _{2} \\&\le \left\Vert \varvec{t} \right\Vert _{2} + (1+2\epsilon ) \left\Vert \varvec{p}-\varvec{p}_{\varvec{U}} \right\Vert _{2}\\&\le \alpha + (1+2\epsilon ) \sum _{i=k+1}^{n}\sigma _i \le 3 \alpha . \end{aligned}$$

The third inequality is derived from Theorem 2. Following the proof of Theorem 2, \(\sum _{i=k+1}^{n}\sigma _i\) can be as small as possible and \(\epsilon \in (0,1)\). Here we let \((1+2\epsilon ) \sum _{i=k+1}^{n}\sigma _i \le 2\alpha \) to get the last inequality. Similarly for \(\varvec{x}^*\), we have

$$\begin{aligned} \left\Vert \varvec{x}^* - \varvec{p}^*_{\tilde{\varvec{U}}} \right\Vert _{2} \le 3 \alpha . \end{aligned}$$

Using the triangle inequality, we get

$$\begin{aligned} \left\Vert \varvec{q}- \varvec{p}^*_{\tilde{\varvec{U}}} \right\Vert _{2}&\le \left\Vert \varvec{q}-\varvec{y} \right\Vert _{2}+\left\Vert \varvec{y}-\varvec{x} \right\Vert _{2}+\left\Vert \varvec{x}-\varvec{p}^*_{\tilde{\varvec{U}}} \right\Vert _{2}\\&\le \left\Vert \varvec{y}-\varvec{x} \right\Vert _{2}+3 \alpha , \end{aligned}$$

and

$$\begin{aligned} \left\Vert \varvec{q}- \varvec{p}_{\tilde{\varvec{U}}} \right\Vert _{2}&\ge \left\Vert \varvec{y}-\varvec{x} \right\Vert _{2} -\left\Vert \varvec{y}-\varvec{q} \right\Vert _{2}-\left\Vert \varvec{p}_{\tilde{\varvec{U}}}-\varvec{x} \right\Vert _{2}\\&\ge \left\Vert \varvec{y}-\varvec{x} \right\Vert _{2}-3 \alpha . \end{aligned}$$

With \(\alpha \) set as \(16/\epsilon \), we can bound \(\left\Vert \varvec{q}- \varvec{p}^*_{\tilde{\varvec{U}}} \right\Vert _{2}\), which implies

$$\begin{aligned} \frac{ \left\Vert \varvec{q}- \varvec{p}_{\tilde{\varvec{U}}} \right\Vert _{2} }{ \left\Vert \varvec{q}- \varvec{p}^*_{\tilde{\varvec{U}}} \right\Vert _{2} }&= \frac{ \left\Vert \varvec{y}-\varvec{x} \right\Vert _{2} \pm 4\cdot \alpha }{ \left\Vert \varvec{y}-\varvec{x}^* \right\Vert _{2} \pm 4\cdot \alpha } = \frac{ \left\Vert \varvec{y}-\varvec{x} \right\Vert _{2} \pm \tfrac{1}{4} \epsilon }{ \left\Vert \varvec{y}-\varvec{x}^* \right\Vert _{2} \pm \tfrac{1}{4} \epsilon } \\&\ge \frac{\left\Vert \varvec{y}-\varvec{x} \right\Vert _{2} - \tfrac{1}{4} \epsilon }{\left\Vert \varvec{y}-\varvec{x}^* \right\Vert _{2} + \tfrac{1}{4} \epsilon } \ge \frac{ \left\Vert \varvec{y}-\varvec{x}^* \right\Vert _{2} + \tfrac{3}{4} \epsilon }{ \left\Vert \varvec{y}-\varvec{x}^* \right\Vert _{2} + \tfrac{1}{4} \epsilon } \\&> 1+\tfrac{1}{4} \epsilon . \end{aligned}$$

By using Pythagoras’ Theorem (recall both \(\varvec{p}_{\tilde{\varvec{U}}},\varvec{p}^*_{\tilde{\varvec{U}}}\in \tilde{\varvec{U}}\)),

$$\begin{aligned} \frac{ \left\Vert \varvec{q}_{\tilde{\varvec{U}}} - \varvec{p}_{\tilde{\varvec{U}}} \right\Vert _{2}^2 }{ \left\Vert \varvec{q}_{\tilde{\varvec{U}}} - \varvec{p}^*_{\tilde{\varvec{U}}} \right\Vert _{2}^2 } = \frac{ \left\Vert \varvec{q}- \varvec{p}_{\tilde{\varvec{U}}} \right\Vert _{2}^2 - \left\Vert \varvec{q}- \varvec{q}_{\tilde{\varvec{U}}} \right\Vert _{2}^2 }{ \left\Vert \varvec{q}- \varvec{p}^*_{\tilde{\varvec{U}}} \right\Vert _{2}^2 - \left\Vert \varvec{q}- \varvec{q}_{\tilde{\varvec{U}}} \right\Vert _{2}^2 } > (1+\tfrac{1}{4} \epsilon )^2. \end{aligned}$$

Hence, \(\varvec{p}^*\) is reported as the nearest neighbor of \(\varvec{q}\) in the low-dimensional subspace. \(\square \)

5 Experiments

In this section, we perform experiments on benchmark data sets to demonstrate the effectiveness of our algorithm. First, we describe our experimental settings.

5.1 Experimental setting

5.1.1 Baseline algorithms

We illustrate the effectiveness of our algorithm by comparing it with the celebrated data-independent hashing algorithm of locality-sensitive hashing (LSH) (Andoni and Indyk, 2008), and state-of-the-art data-dependent algorithms, including anchor graph hashing (AGH) (Liu et al., 2011), circulant binary embedding (CBE) (Yu et al., 2014), iterative quantization (ITQ) (Gong and Lazebnik, 2011), Isotropic hashing (ISO) (Kong and Li, 2012), multidimensional spectral hashing (MDSH) (Weiss et al., 2012), supervised discrete hashing (SDH) (Shen et al., 2015), scalable graph hashing (SGH) (Jiang and Li, 2015), spectral hashing (Weiss et al., 2009), sparse projection (SP) (Xia et al., 2015) and bilinear projection based binary codes (BPBC) (Gong et al., 2013). The parameters are set as suggested in the original works. We refer to our algorithm as Inexact Subspace Analysis for approximate Nearest Neighbor Search (ISANNS).

5.1.2 Data sets

We consider data sets from both computer vision and and natural language processing. In particular, for the computer vision application, we apply all the compared algorithms to the handwritten digit recognition data set MNISTFootnote 1, which consists of 70,000 digit images. We randomly sample 69,000 images for training and the left 1000 images for test where each image is represented as a 784-dimensional vector (i.e. the raw pixels).

Table 2 Statistics of experimental data sets. #Train and #Test are the size of training and testing sets, respectively

For the natural language processing task, we use four data sets from the GLUE (General Language Understanding Evaluation) benchmark (Wang et al., 2019) and Glove (Pennington et al., 2014), a word representation data set for Wikipedia’s entries. The data sets of GLUE benchmark include Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) (SST-2), Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), Microsoft Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005) and Stanford Question Answering Natural Language Inference Corpus (QNLI) (Rajpurkar et al., 2016). More specifically, SST-2 consists of movie reviews, the sentiment of which is either positive or negative. CoLA consists of English sentences from books and journal articles. The sentences are grammatically acceptable or not. MRPC is formed by sentence pairs from online news sources. The label of the sentence pair represents whether the two sentences is semantically equivalent or not. QNLI is the data set with pairs of a question and the context sentence, the label of which represents whether the context sentence contains the answer to the question. We compute the representations for sentences and paragraphs with sentence transformers (Reimers and Gurevych, 2019) based on pretrained STS (Semantic Textual Similarity) model “stsb-roberta-base”Footnote 2. Each example is represented with a 768-dimensional dense vector. The statistics of data sets is shown in Table 2.

5.1.3 Evaluation metrics

The performance of the methods are evaluated by two common metrics: Hamming distance ranking and hash table lookup. We retrieve the items within Hamming distance 2 and report related precision, recall and mean average precision (MAP). We also return the top 500 retrieved items and report and mean average precision as well as time complexity.

To compute the precision and recall, let k denote the elements within the Hamming radius 2, and n denote the total number of relevant items in the database, then

$$\begin{aligned} Precision = \frac{\#\text {relevant seen}}{k}, ~~ Recall = \frac{\#\text {relevant seen}}{n}. \end{aligned}$$
(10)

We show the performance with various lengths of hash code.

Fig. 1
figure 1

Performance of precision and recall with the increase of hash code length on MNIST data set

Table 3 Results in terms of MAP of Hamming distance 2 (the column “MAP”), MAP of top 500 samples (the column “MAP@500”) and training time (s) on MNIST data set with hash code lengths 10 and 16 bits, respectively

5.2 Empirical results

Figure 1 shows the precision and recall on the MNIST data set when we tune the hash code length. In terms of precision (left panel), our algorithm always outperforms the baselines, especially when the data are encoded with more bits. Perhaps more surprising, the increase of code length degrades the performance of baseline algorithms, while improves our algorithm. It demonstrates the effective of our algorithm in low-dimensional subspace.

The superiority of our algorithm is outstanding compared with baseline algorithms in terms of both precision and recall in almost all cases. Table 3 lists the hash table lookup results for 10-bit, 16-bit and 20-bit hash codes on MNIST data set. We observe that our algorithm dramatically outperforms the compared algorithms. Specifically, in terms of 16-bit hash codes on MNIST data set, the MAP of our algorithm is up to 0.8843 while the others are below 0.5 with Hamming radius 2. In terms of MAP with top 500 retrieved data points, our algorithm achieves significant superiority compared with baseline approaches. Our algorithm also enjoys the best time efficiency. With the increase of hash code length, it requires more time to learn the hash codes. With more information, the performance of models is also improved. The experimental results in Figure 1 and Table 3 show the advantage of our algorithm in all cases.

Fig. 2
figure 2

Performance of precision and recall with the increase of hash code length for GLUE benchmark

Table 4 Results in terms of MAP of Hamming distance 2 for GLUE benchmark with hash code length 10-bit and 16-bit
Table 5 Results in terms of MAP of top 500 retrieved samples for GLUE benchmark with hash code lengths 10-bit and 16-bit
Table 6 Training time cost (s) for GLUE benchmark with hash code length 10-bit and 16-bit
Table 7 Recall and training time cost (s) on Glove with hash code length 8-bit

Figure 2 shows the precision and recall of compared algorithms on GLUE benchmark with the increase of hash code length. Table 4 and Table 5 show MAP within Hamming radius 2 and MAP of top 500 retrieved samples for 10-bit and 16-bit hash code. It is shown that our algorithm achieves the best performance in all listed GLUE benchmark in almost all the cases.

Table 3 also lists the time cost of learning the hash projection matrix for different methods on MNIST data set referred as “training time”. We report the training time cost on GLUE benchmark in Table 6 with hash code length 10-bit and 16-bit. We observe that our algorithm is efficient as the low-rank projection matrix is performed on the sampled matrix instead of the global data matrix. In terms of query time cost, the nearest neighbors in the experiments are computed based on the Hamming distance with radius 2. The dominant query time cost is the computation of the Hamming distance matrix between training data points and the testing data points. Hence, the query time complexity of various methods is same for certain data sets, such as 0.71 s for MNIST, 0.73 s for SST-2, 0.11 s for CoLA, 0.13 s for MRPC, and 10 s for QNLI.

Table 7 presents recall and training time cost of compared algorithms on Glove data set. There are memory issues while implementing AGH, SDH and SH on Glove data set, hence the results of these methods are not included. The experimental results show the advantage of our algorithm in terms of both Recall and training cost. Though SP achieves comparable performance in terms of Recall, our algorithm enjoys higher training efficiency.

In a nutshell, the experimental results on the computer vision and natural language understanding tasks show the practical values of our algorithm.

6 Conclusion

For the approximate nearest neighbor search problem, the high-dimensional and large-scale data raises various challenges. In this paper, we have proposed a spectral analysis for nearest neighbor search method that is based on inexact subspace estimation. Given the data set \(\varvec{P}\in \mathbb {R}^{n\times d}\) and the query \(\varvec{q}\), we have reduced the feature space of the data from d to k with \(k< \log n\). By comparing the time complexity of our method and the spectral analysis based on principal component analysis, it is not hard to see that the computational cost of ours is proportional to \(ns^2\) while that of PCA scales with \(nd^2\). We have further provided the theoretical analysis that the \((1+\epsilon /4)\)-approximate neighbors retrieved in the low-dimensional space are the data points close to the query in the original space. The experimental results have shown the significant improvement of our algorithm over state-of-the-art approaches.