Advertisement

Sharing hash codes for multiple purposes

  • Wiktor Pronobis
  • Danny Panknin
  • Johannes Kirschnick
  • Vignesh Srinivasan
  • Wojciech Samek
  • Volker Markl
  • Manohar Kaul
  • Klaus-Robert Müller
  • Shinichi Nakajima
Perspectives on data science for advanced statistics
  • 171 Downloads

Abstract

Locality sensitive hashing (LSH) is a powerful tool in data science, which enables sublinear-time approximate nearest neighbor search. A variety of hashing schemes have been proposed for different dissimilarity measures. However, hash codes significantly depend on the dissimilarity, which prohibits users from adjusting the dissimilarity at query time. In this paper, we propose multiple purpose LSH (mp-LSH) which shares the hash codes for different dissimilarities. mp-LSH supports L2, cosine, and inner product dissimilarities, and their corresponding weighted sums, where the weights can be adjusted at query time. It also allows us to modify the importance of pre-defined groups of features. Thus, mp-LSH enables us, for example, to retrieve similar items to a query with the user preference taken into account, to find a similar material to a query with some properties (stability, utility, etc.) optimized, and to turn on or off a part of multi-modal information (brightness, color, audio, text, etc.) in image/video retrieval. We theoretically and empirically analyze the performance of three variants of mp-LSH, and demonstrate their usefulness on real-world data sets.

Keywords

Locality sensitive hashing Approximate near neighbor search Information retrieval Collaborative filtering 

1 Introduction

Statistics and probability theory have been playing the central role in machine learning, artificial intelligence, and related application fields, e.g., text analytics, computer vision, information retrieval, computational biology, and data mining (Hastie et al. 2001; Bishop 2006). When the data size and the complexity of the statistical model were moderate, typical machine learning problems such as clustering, regression, and classification were solved by (explicitly or implicitly) estimating the probability distribution.

In recent years when those research fields are generically called data science, large amounts of data are used to train statistical models with very high complexity. This arose from the rapid progress of semiconductor devices (CPUs/GPUs, memory, communication devices, etc.), and the breakthrough with deep neural networks, where complex deep architectures have been proven to learn highly non-linear fine structure of data from massive data, further accelerated the demand of large models that can be trained on big data (Hinton 2007; Bengio 2009; Montavon et al. 2012; Krizhevsky et al. 2012; Bengio et al. 2015; Schütt et al. 2017).

Rapid increase of data size also necessitated new technologies for basic tools in data analysis. Nearest neighbor search (NNS), which is intensively used in data science, is one of them. In retrieval systems and recommender systems, NNS is used to find items which are closest to (or best match) a given query. NN classifiers have been shown to perform comparably to the state-of-the-art multi-class classifiers (Torralba et al. 2008), which implies that NNS can well approximate (or reflect) the probability distribution when the number of training samples is sufficiently large. NNS has also shown to be useful in extreme classification, where the number of classes is extremely large (Tagami 2017).

Since NNS is required to perform on millions to billions of samples within a few seconds in some real-time applications, a naive implementation with linear complexity can be too slow. Thus sublinear methods have become important analysis tools. Locality sensitive hashing (LSH), one of the key technologies for big data analysis, enables approximate nearest neighbor search (ANNS) in sublinear time (Indyk and Motwani 1998; Wang et al. 2014). With LSH functions for a required dissimilarity measure in hand, each data sample is assigned to a hash bucket in the pre-processing stage. At runtime, ANNS can be performed by restricting the search to the samples that lie within the hash bucket, to which the query point is assigned, along with the samples lying in the neighboring buckets. Probability theory provided theoretical guarantees of ANNS performance with LSH (Indyk and Motwani 1998). A variety of LSH schemes have been proposed for different dissimilarity measures, including Jaccard distance (Broder 1997), \(L_p\) distance (Datar et al. 2004), cosine distance (Charikar 2002), Chi-squared distance (Gorisse et al. 2012), distance to a hyperplane (Jain et al. 2010), and inner product dissimilarity (maximum inner product search) (Shrivastava and Li 2014, 2015; Bachrach et al. 2014; Neyshabur and Srebro 2015).

A drawback of the existing LSH schemes is that each LSH scheme is specialized for each dissimilarity measure. This can limit the flexibility of the use of LSH. For some data collections, the objective can be clearly expressed from the start, for example, text/image/video/speech analysis. In such cases, the dissimilarity measure can be fixed when LSH codes are given to each sample. However, in other cases such as drug discovery, the material genome project, or climate analysis, the ultimate query structure to such data may still not be fully fixed. In other words, measurements, simulations, or observations may be recorded without being able to spell out the full specific purpose (although the general goal, e.g., producing better drugs, finding more potent materials, or detecting anomaly, is clear). Motivated by the latter case, we consider how one can use LSH schemes without defining any specific dissimilarity at the data acquisition and pre-processing phase.

A challenge in developing LSH without defining specific purpose is that the existing LSH schemes, designed for different dissimilarity measures, provide significantly different hash codes. Therefore, a naive realization requires us to prepare the same number of hash tables as the number of possible target dissimilarities, which is not realistic if we need to adjust the importance of multiple criteria. In this paper, we propose three variants of multiple purpose LSH (mp-LSH), which support L2, cosine, and inner product (IP) dissimilarities, and their weighted sums, where the weights can be adjusted at query time.

The first proposed method, called mp-LSH with vector augmentation (mp-LSH-VA), maps the data space into an augmented vector space, so that the squared-L2-distance in the augmented space matches the required dissimilarity measure up to a constant. This scheme can be seen as an extension of recent developments of LSH for maximum IP search (MIPS) (Shrivastava and Li 2014, 2015; Bachrach et al. 2014; Neyshabur and Srebro 2015). The significant difference from the previous methods is that our method is designed to modify the dissimilarity by changing the augmented query vector. We show that mp-LSH-VA is locality sensitive for L2 and IP dissimilarities and their weighted sums. However, its performance for the L2 dissimilarity is significantly inferior to the standard L2-LSH (Datar et al. 2004). In addition, mp-LSH-VA does not support the cosine distance.

Our second proposed method, called mp-LSH with code concatenation (mp-LSH-CC), concatenates the hash codes for L2, cosine, and IP dissimilarities, and constructs a special structure, called cover tree (Bustos 2012), which enables efficient NNS with the weights for the dissimilarity measures controlled by adjusting the metric in the code space. Although mp-LSH-CC is conceptually simple and its performance is guaranteed by the original LSH scheme for each dissimilarity, it is not memory efficient, which also results in increased query time.

Considering the drawbacks of the aforementioned two variants led us to our final and recommended proposal, called mp-LSH with code augmentation and transformation (mp-LSH-CAT). It supports L2, cosine, and IP dissimilarities by augmenting the hash codes, instead of the original vector. mp-LSH-CAT is memory efficient, since it shares most information over the hash codes for different dissimilarities, so that the augmentation is minimized.

We theoretically and empirically analyze the performance of mp-LSH methods, and demonstrate their usefulness on real-world data sets. Our mp-LSH methods also allow us to modify the importance of pre-defined groups of features. Adjustability of the dissimilarity measure at query time is not only useful in the absence of future analysis plans, but also applicable to multi-criteria searches. The following lists some sample applications of multi-criteria queries in diverse areas:
  1. 1.

    In recommender systems, suggesting items which are similar to a user-provided query and also match the users’ preference.

     
  2. 2.

    In material science, finding materials which are similar to a query material and also possess desired properties such as stability, conductivity, and medical utility.

     
  3. 3.

    In video retrieval, we can adjust the importance of multi-modal information such as brightness, color, audio, and text at query time.

     
Related Work: After the theoretical relation between the performance of approximate nearest neighbor search and the locality sensitivity of hash functions was established (Indyk and Motwani 1998), a lot of LSH schemes have been proposed for different dissimilarity measures, including Jaccard distance (Broder 1997), \(L_p\) distance (Datar et al. 2004), cosine distance (Charikar 2002), Chi-squared distance (Gorisse et al. 2012), distance to a hyperplane (Jain et al. 2010), and inner product dissimilarity (maximum inner product search) (Shrivastava and Li 2014, 2015; Bachrach et al. 2014; Neyshabur and Srebro 2015). They are categorized as data-independent hashing methods where each sample is given a hash code, independently from the other samples (Wang et al. 2014).

On the other hand, data-dependent hashing methods have recently been intensively developed, where the code is optimized for the sample distribution. Some of those methods learn the sample distribution using unsupervised machine learning tools, e.g., PCA (Matsushita and Wada 2009) and ICA (He et al. 2011), while others additionally use label information by supervised methods, e.g., LDA (Strecha et al. 2012), kernel methods (Liu et al. 2012), and neural networks (Lin et al. 2015). In general, data-dependent methods improve the accuracy of the data-independent counterpart by learning the sample distribution, while they are less flexible, because hashing procedure is fixed only after most of the samples are captured, i.e., they are not suitable for the streaming setting, where each sample should be given a hash code right after it is acquired, without waiting the whole data collection process to be completed. In this paper, we propose data-independent LSH methods, and therefore, the data-dependent methods are out of scope.

Some hashing methods cope with multi-modal data (Song et al. 2013; Moran and Lavrenko 2015; Xu et al. 2013), most of which, however, are data dependent and do not offer adjustability of the importance weights at query time. To the best of our knowledge, no existing hashing methods can cope with different dissimilarity measures with the weights adjustable at query time.

2 Background

In this section, we briefly overview previous locality sensitive hashing (LSH) techniques.

Assume that we have a sample pool Open image in new window in L-dimensional space. Given a query \(\varvec{q}\in {\mathbb {R}}^{L}\), nearest neighbor search (NNS) solves the following problem:
$$\begin{aligned} \varvec{x}^* = \mathop {\mathrm {argmin}}_{\varvec{x}\in {\mathcal {X}}} {\mathcal {L}}(\varvec{q}, \varvec{x}), \end{aligned}$$
(1)
where \({\mathcal {L}}(\cdot , \cdot )\) is a dissimilarity measure. A naive approach computes the dissimilarity from the query to all samples, and then chooses the most similar samples, which takes O(N) time. On the other hand, approximate NNS can be performed in sublinear time. We define the following three terms:

Definition 1

(\(S_0\)-near neighbor) For \(S_0 > 0\), \(\varvec{x}\) is called \(S_0\)-near neighbor of \(\varvec{q}\) if \({\mathcal {L}}(\varvec{q}, \varvec{x}) \le S_0\).

Definition 2

(c-approximate nearest neighbor search) Given \(S_0 >0\), \(\delta > 0\), and \(c > 1\), c-approximate nearest neighbor search (c-ANNS) reports some \(cS_0\)-near neighbor of \(\varvec{q}\) with probability \(1 - \delta\) if there exists an \(S_0\)-near neighbor of \(\varvec{q}\) in \({\mathcal {X}}\).

Definition 3

(Locality sensitive hashing) A family \(\mathcal {H}= \{h: {\mathbb {R}}^{L} \rightarrow \mathcal {K}\}\) of functions is called \((S_0, c S_0, p_1, p_2)\)-sensitive for a dissimilarity measure \({\mathcal {L}}: {\mathbb {R}}^{L} \times {\mathbb {R}}^{L} \rightarrow {\mathbb {R}}\) if the following two conditions hold for any \(\varvec{q}, \varvec{x}\in {\mathbb {R}}^{L}\):
$$\begin{aligned}&\bullet \text{ if } {\mathcal {L}}(\varvec{q}, \varvec{x}) \le S_0 \text{ then } \mathbb {P} \left( h(\varvec{q}) = h(\varvec{x}) \right) \ge p_1, \\&\bullet \text{ if } {\mathcal {L}}(\varvec{q}, \varvec{x}) \ge c S_0 \text{ then } \mathbb {P} \left( h(\varvec{q}) = h(\varvec{x}) \right) \le p_2, \end{aligned}$$
where \(\mathbb {P}(\cdot )\) denotes the probability of the event (with respect to the random draw of hash functions).

Note that \(p_1 > p_2\) is required for LSH to be useful. The image \(\mathcal {K}\) of hash functions is typically binary or integer. The following proposition guarantees that locality sensitive hashing (LSH) functions enable c-ANNS in sublinear time.

Proposition 1

Indyk and Motwani (1998) Given a family of \((S_0, cS_0, p_1, p_2)\)-sensitive hash functions, there exists an algorithm for c-ANNS with \(O(N^{\rho } \log N)\) query time and \(O(N^{1 + \rho })\) space, where \(\rho = \frac{\log p_1}{\log p_2} < 1\).

Below, we introduce three LSH families. Let \(\mathcal {N}_L(\varvec{\mu }, \varvec{\varSigma })\) be the L-dimensional Gaussian distribution, \(\mathcal {U}_L(\alpha , \beta )\) be the L-dimensional uniform distribution with its support \([\alpha , \beta ]\) for all dimensions, and \(\varvec{I}_L\) be the L-dimensional identity matrix. The sign function, \(\mathrm {sign}(\varvec{z}): {\mathbb {R}}^H \mapsto \{-1, 1\}^H\), applies elementwise, giving 1 for \(z_h \ge 0\) and \(-1\) for \(z_h < 0\). Likewise, the floor operator \(\lfloor \cdot \rfloor\) applies elementwise for a vector. We denote by \(\sphericalangle (\cdot , \cdot )\) the angle between two vectors, and by a semicolon, the rowwise concatenation of vectors, like in matlab.

Proposition 2

(L2-LSH) Datar et al. (2004) For the L2 distance \({\mathcal {L}}_{\mathrm {L2}}(\varvec{q}, \varvec{x}) = \Vert \varvec{q}- \varvec{x}\Vert _{{\tiny 2}}\), the hash function
$$\begin{aligned} h_{\varvec{a}, b}^{\mathrm {L2}} (\varvec{x})&= \left\lfloor R^{-1}(\varvec{a}^{\top }\varvec{x}+ b) \right\rfloor , \end{aligned}$$
(2)
where \(R > 0\) is a fixed real number, \(\varvec{a}\sim \mathcal {N}_L(\varvec{0}, \varvec{I}_L)\), and \(b \sim \mathcal {U}_1(0, R)\), satisfies \(\mathbb {P} (h_{\varvec{a}, b}^{\mathrm {L2}} (\varvec{q}) = h_{\varvec{a}, b}^{\mathrm {L2}} (\varvec{x}) ) = F_R^{\mathrm {L2}}({\mathcal {L}}_{\mathrm {L2}}(\varvec{q}, \varvec{x}))\), where
$$\begin{aligned} F_R^{\mathrm {L2}}(d)&= 1 - 2 \varPhi (-R / d) - \frac{2}{\sqrt{2 \pi } (R/d)} \left( 1 - e^{-(R/d)^2/2} \right) . \end{aligned}$$
Here, \(\varPhi (z) = \int _{-\infty }^{z} \frac{1}{\sqrt{2 \pi }} e^{-\frac{y^2}{2}} dy\) is the standard cumulative Gaussian.

Proposition 3

(sign-LSH) Goemans and Williamson (1995) and Charikar (2002) For the cosine distance \({\mathcal {L}}_{\mathrm {cos}}(\varvec{q}, \varvec{x}) = 1 - \cos \sphericalangle (\varvec{q}, \varvec{x})= 1 - \frac{ \varvec{q}^{\top }\varvec{x}}{\Vert \varvec{q}\Vert _2 \Vert \varvec{x}\Vert _2}\), the hash function:
$$\begin{aligned} h^{\mathrm {sign}}_{\varvec{a}}(\varvec{x})&= \mathrm {sign}(\varvec{a}^{\top }\varvec{x}), \end{aligned}$$
(3)
where \(\varvec{a}\sim \mathcal {N}_L(\varvec{0}, \varvec{I}_L)\) satisfies \(\mathbb {P} \left( h_{\varvec{a}}^{\mathrm {sign}} (\varvec{q}) = h_{\varvec{a}}^{\mathrm {sign}} (\varvec{x}) \right) = F^{\mathrm {sign}}({\mathcal {L}}_{\mathrm {cos}}(\varvec{q}, \varvec{x}))\), where
$$\begin{aligned} F^{\mathrm {sign}}(d)&= 1 - \frac{1}{\pi } \cos ^{-1} (1-d). \end{aligned}$$
(4)

Proposition 4

Neyshabur and Srebro (2015) (simple-LSH) Assume that the samples and the query are rescaled, so that \(\max _{\varvec{x}\in {\mathcal {X}}} \Vert \varvec{x}\Vert _2 \le 1\), \(\Vert \varvec{q}\Vert _2 \le 1\). For the inner product dissimilarity \({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}, \varvec{x}) = 1 - \varvec{q}^{\top }\varvec{x}\) (with which the NNS problem (1) is called maximum IP search (MIPS)), the asymmetric hash functions
$$\begin{aligned} h^{\mathrm {smp-q}}_{\varvec{a}}(\varvec{q})&= h^{\mathrm {sign}}_{\varvec{a}}(\widetilde{\varvec{q}}) = \mathrm {sign}(\varvec{a}^{\top }\widetilde{\varvec{q}}) \qquad \text{ where }\qquad \widetilde{\varvec{q}} = (\varvec{q}; 0), \end{aligned}$$
(5)
$$\begin{aligned} h^{\mathrm {smp-x}}_{\varvec{a}}(\varvec{x})&= h^{\mathrm {sign}}_{\varvec{a}}(\widetilde{\varvec{x}}) = \mathrm {sign}(\varvec{a}^{\top }\widetilde{\varvec{x}}) \qquad \text{ where }\qquad \widetilde{\varvec{x}} = (\varvec{x}; \sqrt{1 - \Vert \varvec{x}\Vert _2^2} ), \end{aligned}$$
(6)
satisfy \(\mathbb {P} \left( h_{\varvec{a}}^{\mathrm {smp-q}} (\varvec{q}) = h_{\varvec{a}}^{\mathrm {smp-x}} (\varvec{x}) \right) = F^{\mathrm {sign}}({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}, \varvec{x}))\).

These three LSH methods above are standard and state-of-the-art (among the data-independent LSH schemes) for each dissimilarity measure. Although all methods involve the same random projection \(\varvec{a}^{\top }\varvec{x}\), the resulting hash codes are significantly different from each other.

3 Proposed methods and theory

In this section, we first define the problem setting. Then, we propose three LSH methods for multiple dissimilarity measures, and conduct a theoretical analysis.

3.1 Problem setting

Similarly to the simple-LSH (Proposition 4), we rescale the samples, so that \(\max _{\varvec{x}\in {\mathcal {X}}} \Vert \varvec{x}\Vert _2 \le 1\). We also assume \(\Vert \varvec{q}\Vert _2 \le 1\).1 Let us assume multi-modal data, where we can separate the feature vectors into G groups, i.e., \(\varvec{q}= (\varvec{q}_1; \ldots ; \varvec{q}_G)\), \(\varvec{x}= (\varvec{x}_1; \ldots ; \varvec{x}_G)\). For example, each group corresponds to monochrome, color, audio, and text features in video retrieval. We also accept multiple queries \(\{\varvec{q}^{(w)}\}_{w=1}^W\) for a single retrieval task. Our goal is to perform ANNS for the following dissimilarity measure, which we call multiple purpose (MP) dissimilarity:
$$\begin{aligned}&{\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) = \sum\limits_{w=1}^W \sum _{g=1}^G \bigg \{ \gamma _g^{(w)} \Vert \varvec{q}_g^{(w)} - \varvec{x}_g\Vert _2^2 \nonumber \\&\qquad \qquad \qquad \qquad + 2 \eta _g^{(w)} \left( 1 - \frac{\varvec{q}_g^{(w) {\top }} \varvec{x}_g}{\Vert \varvec{q}_g^{(w)} \Vert _2 \Vert \varvec{x}_g\Vert _2}\right) + 2 \lambda _g^{(w)}\left( 1 - \varvec{q}_g^{(w) {\top }} \varvec{x}_g\right) \bigg \}, \end{aligned}$$
(7)
where \(\varvec{\gamma }^{(w)}, \varvec{\eta }^{(w)}, \varvec{\lambda }^{(w)} \in {\mathbb {R}}_+^G\) are the feature weights such that \(\sum _{w=1}^W \sum _{g=1}^G (\gamma _g^{(w)}+\eta _g^{(w)}+\lambda _g^{(w)}) = 1\). In the single query case, where \(W=1\), setting \(\varvec{\gamma }= (1/2, 0, 1/2, 0, \ldots , 0), \varvec{\eta }= \varvec{\lambda }= (0, \ldots , 0)\) corresponds to L2-NNS based on the first and the third feature groups, while setting \(\varvec{\gamma }= \varvec{\eta }= (0, \ldots , 0), \varvec{\lambda }= (1/2, 0, 1/2, 0, \ldots , 0)\) corresponds to MIPS on the same feature groups. When we like to down-weight the importance of signal amplitude (e.g., brightness of image) of the gth feature group, we should increase the weight \(\eta _g^{(w)}\) for the cosine distance, and decrease the weight \(\gamma _g^{(w)}\) for the squared-L2-distance. Multiple queries are useful when we mix NNS and MIPS, for which the queries lie in different spaces with the same dimensionality. For example, by setting \(\varvec{\gamma }^{(1)}= \varvec{\lambda }^{(2)} = (1/4, 0, 1/4, 0, \ldots , 0), \varvec{\gamma }^{(2)} = \varvec{\eta }^{(1)}= \varvec{\eta }^{(2)} =\varvec{\lambda }^{(1)}= (0, \ldots , 0)\), we can retrieve items, which are close to the item query \(\varvec{q}^{(1)}\) and match the user preference query \(\varvec{q}^{(2)}\). An important requirement for our proposal is that the weights \(\{\varvec{\gamma }^{(w)}, \varvec{\eta }^{(w)}, \varvec{\lambda }^{(w)}\}\) can be adjusted at query time.

Our target application is an interactive system, like the demonstration in Sect. 4.3, where the users modify the weights according to the result with the previous weight setting. Optimizing the weights for some meta objective is out of scope of this paper.

3.2 Multiple purpose LSH with vector augmentation (mp-LSH-VA)

Our first method, called multiple purpose LSH with vector augmentation (mp-LSH-VA), is inspired by the research on asymmetric LSHs for MIPS (Shrivastava and Li 2014, 2015; Bachrach et al. 2014; Neyshabur and Srebro 2015), where the query and the samples are augmented with additional entries, so that the squared-L2-distance in the augmented space coincides with the target dissimilarity up to a constant. A significant difference of our proposal from the previous methods is that we design the augmentation, so that we can adjust the dissimilarity measure [i.e., the feature weights \(\{\varvec{\gamma }^{(w)}, \varvec{\lambda }^{(w)} \}\) in Eq. (7)] by modifying the augmented query vector. Since mp-LSH-VA, unfortunately, does not support the cosine distance, we set \(\varvec{\eta }^{(w)} = \varvec{0}\) in this subsection. We define the weighted sum query by
$$\begin{aligned} \overline{\varvec{q}}&=(\overline{\varvec{q}}_1; \cdots ; \overline{\varvec{q}}_G) = \sum\nolimits_{w=1}^W \big ( \phi _1^{(w)} \varvec{q}_1^{(w)}; \cdots ; \phi _G^{(w)} \varvec{q}_G^{(w)} \big ), \\&\quad \quad \quad \quad \text{ where }\quad \phi _g^{(w)} = \gamma _g^{(w)} + \lambda _g^{(w)}. \end{aligned}$$
We augment the queries and the samples as follows:
$$\begin{aligned} \widetilde{\varvec{q}}&= (\overline{\varvec{q}}; \varvec{r}),&\widetilde{\varvec{x}}&= (\varvec{x}; \varvec{y}), \end{aligned}$$
where \(\varvec{r}\in {\mathbb {R}}^M\) is a (vector-valued) function of \(\{\varvec{q}^{(w)}\}\), and \(\varvec{y}\in {\mathbb {R}}^M\) is a function of \(\varvec{x}\). We constrain the augmentation \(\varvec{y}\) for the sample vector so that it satisfies, for a constant \(c_1 \ge 1\):
$$\begin{aligned} \Vert \widetilde{\varvec{x}}\Vert _2&= c_1, \text{ i.e., } \Vert \varvec{y}\Vert _2^2 = c_1^2 - \Vert \varvec{x}\Vert _2^2. \end{aligned}$$
(8)
Under this constraint, the norm of any augmented sample is equal to \(c_1\), which allows us to use sign-LSH (Proposition 3) to perform L2-NNS. The squared-L2-distance between the query and a sample in the augmented space can be expressed as
$$\begin{aligned} \Vert \widetilde{\varvec{q}} - \widetilde{\varvec{x}}\Vert _{{\tiny 2}}^2&= -2\left( \overline{\varvec{q}}^{\top }{\varvec{x}} + \varvec{r}^{\top }\varvec{y}\right) + \text {const.}\end{aligned}$$
(9)
For \(M = 1\), only the choice satisfying Eq. (8) is simple-LSH (for \(r = 0\)), given in Proposition 4. We consider the case for \(M \ge 2\), and design \(\varvec{r}\) and \(\varvec{y}\), so that Eq. (9) matches the MP dissimilarity (7).
The augmentation that matches the MP dissimilarity is not unique. Here, we introduce the following easy construction with \(M = G+3\):
$$\begin{aligned} \widetilde{\varvec{q}}&= \Big (\widetilde{\varvec{q}}'; \sqrt{c_2^2 - \Vert \widetilde{\varvec{q}}'\Vert _2^2} \Big ), \quad \widetilde{\varvec{x}} = (\widetilde{\varvec{x}}'; 0) \quad \text { where} \nonumber \\ \widetilde{\varvec{q}}'&= \left ( \underbrace{ \overline{\varvec{q}}_1; \cdots ; \overline{\varvec{q}}_G }_{ \overline{\varvec{q}} \in {\mathbb {R}}^L} \; ; \; \underbrace{\sum\nolimits_{w=1}^W \gamma _1^{(w)}; \cdots ; \sum\nolimits_{w=1}^W \gamma _G^{(w)} ; 0 ; \mu }_{\varvec{r}' \in {\mathbb {R}}^{G+2}} \right ), \nonumber \\ \widetilde{\varvec{x}}'&= \left( \underbrace{ \varvec{x}_1; \cdots ; \varvec{x}_G }_{\varvec{x}\in {\mathbb {R}}^L} \;;\; \underbrace{ - \frac{ \Vert \varvec{x}_1\Vert _2^2}{2}; \cdots ; - \frac{ \Vert \varvec{x}_{K}\Vert _2^2}{2} ; \nu ; \frac{1}{2} }_{\varvec{y}' \in {\mathbb {R}}^{G+2}} \right). \end{aligned}$$
(10)
Here, we defined
$$\begin{aligned} \mu&= - \sum\limits_{w=1}^W \sum _{g=1}^G \gamma _g^{(w)} \Vert \varvec{q}_g^{(w)}\Vert _2^2, \\ \nu&= \sqrt{c_1^2 - \left( \Vert \varvec{x}\Vert _2^2 + \frac{1}{4}\sum _{g=1}^G\Vert \varvec{x}_g\Vert _2^4 + \frac{1}{4}\right) }, \\ c_1^2&= \max _{\varvec{x}\in {\mathcal {X}}} \left( \Vert \varvec{x}\Vert _2^2 + \frac{1}{4}\sum _{g=1}^G\Vert \varvec{x}_g\Vert _2^4 + \frac{1}{4}\right) , \\ c_2^2&= \max _{\varvec{q}} \Vert \widetilde{\varvec{q}}'\Vert _2^2. \end{aligned}$$
With the vector augmentation (10), Eq. (9) matches Eq. (7) up to a constant (see Appendix 1):
$$\begin{aligned} \Vert \widetilde{\varvec{q}} - \widetilde{\varvec{x}}\Vert _{{\tiny 2}}^2 = c_1^2+c_2^2 - 2\widetilde{\varvec{q}}^{\top }\widetilde{\varvec{x}} = {\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) + \text {const.} \end{aligned}$$
The collision probability, i.e., the probability that the query and the sample are given the same code, can be analytically computed:

Theorem 1

Assume that the samples are rescaled, so that \(\max _{\varvec{x}\in {\mathcal {X}}} \Vert \varvec{x}\Vert _2 \le 1\) and \(\Vert \varvec{q}^{(w)}\Vert _2 \le 1\) for \(w = 1, \ldots , W\). For the MP dissimilarity \({\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x})\), given by Eq. (7), with \(\varvec{\eta }^{(w)} = \varvec{0}\) for \(w = 1, \ldots , W\), the asymmetric hash functions
$$\begin{aligned} h^{\mathrm {VA-q}}_{\varvec{a}}(\{\varvec{q}^{(w)}\})&= h^{\mathrm {sign}}_{\varvec{a}}(\widetilde{\varvec{q}}) = \mathrm {sign}(\varvec{a}^{\top }\widetilde{\varvec{q}}), \\ h^{\mathrm {VA-x}}_{\varvec{a}}(\varvec{x})&= h^{\mathrm {sign}}_{\varvec{a}}(\widetilde{\varvec{x}}) = \mathrm {sign}(\varvec{a}^{\top }\widetilde{\varvec{x}}), \end{aligned}$$
where \(\widetilde{\varvec{q}}\) and \(\widetilde{\varvec{x}}\) are given by Eq. (10), and satisfy
$$\begin{aligned} \mathbb {P} \Big (h_{\varvec{a}}^{\mathrm {VA-q}} ( \{\varvec{q}^{(w)}\} ) = h_{\varvec{a}}^{\mathrm {VA-x}} (\varvec{x}) \Big )&= F^{\mathrm {sign}}\left( 1 + \frac{{\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) - 2\Vert \varvec{\lambda }\Vert _{{\tiny 1}}}{2 c_1 c_2} \right) . \end{aligned}$$
(Proof) Via construction, it holds that \(\Vert \widetilde{\varvec{x}}\Vert _2 = c_1\) and \(\Vert \widetilde{\varvec{q}}\Vert _2 = c_2\), and simple calculations (see Appendix 1) give \(\widetilde{\varvec{q}}^{\top }\widetilde{\varvec{x}} = \Vert \varvec{\lambda }\Vert _{{\tiny 1}} - \frac{{\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\},\, \varvec{x})}{2}\). Then, applying Proposition 3 immediately proves the theorem. \(\square\)
Fig. 1

Theoretical values \(\rho = \frac{\log p_1}{\log p_2}\) (lower is better), which indicates the LSH performance (see Proposition 1). The horizontal axis indicates c for c-ANNS

Figure 1 depicts the theoretical value of \(\rho = \frac{\log p_1}{\log p_2}\) of mp-LSH-VA, computed using Theorem 1, for different weight settings for \(G = 1\). Note that \(\rho\) determines the quality of LSH (smaller is better) for c-ANNS performance (see Proposition 1). In the case for L2-NNS and MIPS, the \(\rho\) values of the standard LSH methods, i.e., L2-LSH (Proposition 2) and simple-LSH (Proposition 4) are also shown for comparison.

Although mp-LSH-VA offers attractive flexibility with adjustable dissimilarity, Fig. 1 implies its inferior performance to the standard methods, especially in the L2-NNS case. The reason might be a too strong asymmetry between the query and the samples: a query and a sample are far apart in the augmented space, even if they are close to each other in the original space. We can see this from the first G entries in \(\varvec{r}\) and \(\varvec{y}\) in Eq. (10), respectively. Those entries for the query are non-negative, i.e., \(r_m \ge 0\) for \(m = 1, \ldots , G\), while the corresponding entries for the sample are non-positive, i.e., \(y_m \le 0\) for \(m = 1, \ldots , G\). We believe that there is room to improve the performance of mp-LSH-VA, e.g., by adding constants and changing the scales of some augmented entries, which we leave as our future work.

In the next subsections, we propose alternative approaches, where codes are as symmetric as possible, and down-weighting is done by changing the metric in the code space. This effectively keeps close points in the original space close in the code space.

3.3 Multiple purpose LSH with code concatenation (mp-LSH-CC)

Let \(\overline{\gamma }_g = \sum _{w=1}^W \gamma _g^{(w)}\), \(\overline{\eta }_g = \sum _{w=1}^W \eta _g^{(w)}\), and \(\overline{\lambda }_g = \sum _{w=1}^W \lambda _g^{(w)}\), and define the metric-wise weighted average queries by \(\overline{\varvec{q}}_g^{\mathrm {L2}} = \frac{\sum _{w=1}^W \gamma _g^{(w)} \varvec{q}_g^{(w)}}{\overline{\gamma }_g}\), \(\overline{\varvec{q}}_g^{\mathrm {cos}} = \sum _{w=1}^W \eta _g^{(w)} \frac{\varvec{q}_g^{(w)}}{\Vert \varvec{q}_g^{(w)}\Vert _2}\), and \(\overline{\varvec{q}}_g^{\mathrm {ip}} = \sum _{w=1}^W \lambda _g^{(w)} \varvec{q}_g^{(w)}\).

Our second proposal, called multiple purpose LSH with code concatenation (mp-LSH-CC), simply concatenates multiple LSH codes, and performs NNS under the following distance metric at query time:
$$\begin{aligned} \mathcal {D}_{\mathrm {CC}}(\{\varvec{q}^{(w)}\}, {\varvec{x}})&= \sum _{g=1}^G \sum _{t=1}^T \Big ( \overline{\gamma }_g R\sqrt{\frac{\pi }{2}} \left| h_{t}^{\mathrm {L2}}(\overline{\varvec{q}}_g^{\mathrm {L2}}) \!-\! h_{t}^{\mathrm {L2}}({\varvec{x}}_g) \right| \nonumber \\ &+ \Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}} \left| h_{t}^{\mathrm {sign}}(\overline{\varvec{q}}_g^{\mathrm {cos}}) - h_{t}^{\mathrm {sign}}({\varvec{x}}_g) \right| \nonumber \\ &+\Vert \overline{\varvec{q}}_g^{\mathrm {ip}}\Vert _{{\tiny 2}} \left| h_{t}^{\mathrm {smp-q}}(\overline{\varvec{q}}_g^{\mathrm {ip}}) - h_{t}^{\mathrm {smp-x}}({\varvec{x}}_g) \right| \Big ) , \end{aligned}$$
(11)
where \(h_{t}^{-}\) denotes the tth independent draw of the corresponding LSH code for \(t = 1, \ldots , T\). The distance (11) is a multi-metric, a linear combination of metrics Bustos (2012), in the code space. For a multi-metric, we can use the cover tree Beygelzimer et al. (2006) for efficient (exact) NNS. Assuming that all adjustable linear weights are upper bounded by 1, the cover tree expresses neighboring relation between samples, taking all possible weight settings into account. NNS is conducted by bounding the code metric for a given weight setting. Thus, mp-LSH-CC allows selective exploration of hash buckets, so that we only need to accurately measure the distance to the samples assigned to the hash buckets within a small code distance. The query time complexity of the cover tree is \(O \left(\kappa ^{12} \log N \right)\), where \(\kappa\) is a data-dependent expansion constant Heinonen (2001). Another good aspect of the cover tree is that it allows dynamic insertion and deletion of new samples, and therefore, it lends itself naturally to the streaming setting. Appendix 1 describes further details.

In the pure case for L2, cosine, or IP dissimilarity, the hash code of mp-LSH-CC is equivalent to the base LSH code, and therefore, the performance is guaranteed by Propositions 2, 3, 4, respectively. However, mp-LSH-CC is not optimal in terms of memory consumption and NNS efficiency. This inefficiency comes from the fact that it redundantly stores the same angular (or cosine distance) information into each of the L2-, sign-, and simple-LSH codes. Note that the information of a vector is dominated by its angular components unless the dimensionality L is very small.

3.4 Multiple purpose LSH with code augmentation and transformation (mp-LSH-CAT)

Our third proposal, called multiple purpose LSH with code augmentation and transformation (mp-LSH-CAT), offers significantly less memory requirement and faster NNS than mp-LSH-CC by sharing the angular information for all considered dissimilarity measures. Let
$$\begin{aligned} \overline{\varvec{q}}_g^{\mathrm {L2+ip}} = \sum\nolimits_{w=1}^W (\gamma _g^{(w)} + \lambda _g^{(w)}) \varvec{q}_g^{(w)}. \end{aligned}$$
We essentially use sign-hash functions that we augment with norm information of the data, giving us the following augmented codes:
$$\begin{aligned} \varvec{H}^{\mathrm {CAT-q}}(\{\varvec{q}^{(w)}\})&= \left( \varvec{H}(\overline{\varvec{q}}^{\mathrm {L2+ip}}) ; \varvec{H}(\overline{\varvec{q}}^{\mathrm {cos}}) ; \mathbf 0 _G^\top \right) , \end{aligned}$$
(12)
$$\begin{aligned} \varvec{H}^{\mathrm {CAT-x}}(\varvec{x})&= \Big (\widetilde{\varvec{H}}(\varvec{x}) ; \varvec{H}(\varvec{x}) ; \varvec{j}^{\top }(\varvec{x})\Big ), \end{aligned}$$
(13)
where
$$\begin{aligned} \varvec{H}(\varvec{v})&=\Big (\mathrm {sign} (\varvec{A}_1 \varvec{v}_1), \ldots , \mathrm {sign} (\varvec{A}_G \varvec{v}_G) \Big ), \nonumber \\ \widetilde{\varvec{H}}(\varvec{v})&= \Big (\Vert \varvec{v}_1\Vert _{{\tiny 2}}\mathrm {sign} (\varvec{A}_1 \varvec{v}_1), \ldots , \Vert \varvec{v}_G\Vert _{{\tiny 2}}\mathrm {sign} (\varvec{A}_G \varvec{v}_G) \Big ),\nonumber \\ \varvec{j}(\varvec{v})&= \Big (\Vert \varvec{v}_1\Vert _{{\tiny 2}}^2; \ldots ; \Vert \varvec{v}_G\Vert _{{\tiny 2}}^2\Big ) , \end{aligned}$$
(14)
for a partitioned vector \(\varvec{v}= (\varvec{v}_1, \ldots, \varvec{v}_G)\in {\mathbb {R}}^L\) and \(\mathbf 0 _G = (0; \cdots ; 0) \in {\mathbb {R}}^G\). Here, each entry of \(\varvec{A}= (\varvec{A}_1, \ldots , \varvec{A}_G) \in {\mathbb {R}}^{T \times L}\) follows \(A_{t, l} \sim \mathcal {N}(0, 1^2)\).
For two matrices \(\varvec{H}', \varvec{H}'' \in {\mathbb {R}}^{(2T+1)\times G}\) in the transformed hash code space, we measure the distance with the following multi-metric:
$$\begin{aligned} \mathcal {D}_{\mathrm {CAT}}(\varvec{H}', \varvec{H}'')&= \sum\nolimits_{g=1}^G \bigg ( \alpha _g \sum _{t=1}^{T} \left| H_{t, g}' - {H}_{t, g}'' \right| +\beta _g \sum _{t=T+1}^{2T} \left| H_{t, g}' - {H}_{t, g}'' \right| \nonumber \\ &+ \overline{\gamma }_g \frac{T}{2} \left| H_{2T+1, g}' - {H}_{2T+1, g}'' \right| \bigg ), \end{aligned}$$
(15)
where \(\alpha _g = \Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}\) and \(\beta _g = \Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}}\).
Although the hash codes consist of \((2T+1)G\) entries, we do not need to store all the entries, and computation can be simpler and faster by first computing the total number of collisions in the sign-LSH part (14) for \(g = 1, \ldots , G\):
$$\begin{aligned} \mathcal {C}_g&(\varvec{v}', \varvec{v}'') = \sum\limits_{t=1}^T \Big \{ \big (\varvec{H}(\varvec{v}') \big )_{t, g} = \big ( \varvec{H}({\varvec{v}''}) \big )_{t, g}\Big \}. \end{aligned}$$
(16)
Note that this computation, which dominates the computation cost for evaluating code distances, can be performed efficiently with bit operations. With the total number of collisions (16), the metric (15) between a query set \(\{\varvec{q}^{(w)}\}\) and a sample \(\varvec{x}\) can be expressed as
$$\begin{aligned} \mathcal {D}_{\mathrm {CAT}}&\Big (\varvec{H}^{\mathrm {CAT-q}}(\{\varvec{q}^{(w)}\}), \varvec{H}^{\mathrm {CAT-x}}(\varvec{x}) \Big ) \nonumber \\ =& \sum\limits_{g=1}^G \bigg (\alpha _g\Big (T + \Vert \varvec{x}_g\Vert _{{\tiny 2}}\big ( T - 2\mathcal {C}_g(\overline{\varvec{q}}^{\mathrm {L2+ip}}, \varvec{x})\big )\Big )\nonumber \\ &\quad \qquad \qquad +2\beta _g\big (T - \mathcal {C}_g(\overline{\varvec{q}}^{\mathrm {cos}}, \varvec{x})\big ) + \overline{\gamma }_g \frac{T}{2}\Vert \varvec{x}_g\Vert _{{\tiny 2}}^2 \bigg ). \end{aligned}$$
(17)
Given a query set, this can be computed from \(\varvec{H}(\varvec{x}) \in {\mathbb {R}}^{T \times G}\) and \(\Vert \varvec{x}_g\Vert _{{\tiny 2}}\) for \(g = 1, \ldots , G\). Therefore, we only need to store the pure TG sign-bits, which is required by sign-LSH alone, and G additional float numbers.
Similar to mp-LSH-CC, we use the cover tree for efficient NNS based on the code distance (15). In the cover tree construction, we set the metric weights to their upper bounds, i.e., \(\alpha _g = \beta _g = \overline{\gamma }_g = 1\), and measure the distance between samples by
$$\begin{aligned} \!\!\!\mathcal {D}_{\mathrm {CAT}}&\Big (\varvec{H}^{\mathrm {CAT-x}}(\varvec{x}') , \varvec{H}^{\mathrm {CAT-x}}(\varvec{x}'') \Big ) \nonumber \\ =& \sum\limits_{g=1}^G \bigg ( \left| \Vert \varvec{x}_g'\Vert _{{\tiny 2}} - \Vert {\varvec{x}}''_g\Vert _{{\tiny 2}} \right| \mathcal {C}_g(\varvec{x}', {\varvec{x}''})\nonumber \\&\quad \qquad \quad + (\Vert \varvec{x}_g'\Vert _{{\tiny 2}} + \Vert {\varvec{x}}''_g\Vert _{{\tiny 2}} + 2)\big (T - \mathcal {C}_g(\varvec{x}', {\varvec{x}''})\big )\nonumber \\ &\quad \qquad \quad +\frac{T}{2} \left| \Vert \varvec{x}_g'\Vert _{{\tiny 2}}^2 - \Vert {\varvec{x}}''_g\Vert _{{\tiny 2}}^2 \right| \bigg ). \end{aligned}$$
(18)
Since the collision probability can be zero, we cannot directly apply the standard LSH theory with the \(\rho\) value guaranteeing the ANNS performance. Instead, we show that the metric (15) of mp-LSH-CAT approximates the MP dissimilarity (7), and the quality of ANNS is guaranteed.

Theorem 2

For \(\varvec{\eta }^{(w)} = \varvec{0}\) for \(w = 1, \ldots , W\), it holds that
$$\begin{aligned} \lim _{T \rightarrow \infty } \frac{\mathcal {D}_{\mathrm {CAT}} }{T} = \frac{1}{2}{\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) + \text {const.} + \text {error}, \end{aligned}$$
with \(|\text {error}| \le 0.2105\left( \Vert \varvec{\lambda }\Vert _{{\tiny 1}} + \Vert \varvec{\gamma }\Vert _{{\tiny 1}}\right) .\) (proof is given in Appendix 1).

Theorem 3

For \(\varvec{\gamma }^{(w)} = \varvec{\lambda }^{(w)} = \varvec{0}\) for \(w = 1, \ldots , W\), it holds that
$$\begin{aligned} \lim _{T \rightarrow \infty } \frac{\mathcal {D}_{\mathrm {CAT}} }{T} = \frac{1}{2}{\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) + \text {const.} + \text {error}, \end{aligned}$$
with \(|\text {error}| \le 0.2105\Vert \varvec{\eta }\Vert _{{\tiny 1}}\). (proof is given in Appendix 1).

Corollary 1

It holds that
$$\begin{aligned} 2\lim _{T \rightarrow \infty } \frac{\mathcal {D}_{\mathrm {CAT}} }{T} = {\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) + \text {const.} + \text {error}, \end{aligned}$$
with
$$\begin{aligned} |\text {error}| \le 0.421. \end{aligned}$$

Note that Corollary 1 does not state that the distance in the code space converges to the multiple purpose dissimilarity even in the asymptotic limit \(T \rightarrow \infty\)—there can be a constant worst case error. However, the constant error is bounded by 0.421, which ranges one order of magnitude below the MP dissimilarity having itself a range of 4. The following theorem guarantees ANNS to succeed with mp-LSH-CAT for pure MIPS case with specified probability (proof is given in Appendix 1):

Theorem 4

Let \(S_0 \in (0,2)\), \(cS_0 \in (S_0 + 0.2105,2)\) and set
$$\begin{aligned} T \ge \frac{48}{(t_2-t_1)^2}\log \left(\frac{n}{\varepsilon }\right),\end{aligned}$$
where \(t_2 > t_1\) depend on \(S_0\) and c (see Appendix 1 for details). With probability larger than \(1 - \varepsilon - \left( \frac{\varepsilon }{n}\right) ^\frac{3}{2}\), mp-LSH-CAT guarantees c-ANNS with respect to \({\mathcal {L}}_{\mathrm {ip}}\) (MIPS).

It is straightforward to show Theorem 4 for squared-L2- and cosine distance.

Because of the constant error, the guarantee by Theorem 4 is applied for c such that \(cS_0 \in (S_0 + 0.2105,2)\). In Sect. 4, we will empirically show the good performance of mp-LSH-CAT, which supports that the constant error is not very harmful in practice.

3.5 Memory requirement

For all LSH schemes, one can trade-off the memory consumption and accuracy performance by changing the hash bit length T. However, the memory consumption for specific hashing schemes heavily differs from the other schemes, such that a comparison of performance is inadequate for a globally shared T. In this subsection, we derive individual numbers of hashes for each scheme, given a fixed memory budget.

We count the theoretically minimal number of bits required to store the hash code of one data point. The two fundamental components we are confronted with are sign-hashes and discretized reals. Sign-hashes can be represented by exactly one bit. For the reals, we choose a resolution such that their discretizations take values in a set of fixed size. The L2-hash function \(h_{\varvec{a}, b}^{\mathrm {L2}} (\varvec{x}) = \left\lfloor R^{-1}(\varvec{a}^{\top }\varvec{x}+ b) \right\rfloor\) is a random variable with potentially infinite, discrete values. Nevertheless, we can come up with a realistic upper bound of values the L2-hash essentially takes. Note that \(R^{-1}(\varvec{a}^{\top }\varvec{x})\) follows a \(\mathcal {N}(\mu =0, \sigma = (R\Vert x\Vert _{{\tiny 2}})^{-1})\) distribution and \(\Vert x\Vert _{{\tiny 2}} \le 1\). Then, \(\mathbb {P}(|R^{-1}(\varvec{a}^{\top }\varvec{x})| > 4\sigma ) < 10^{-4}\). Therefore, L2-hash essentially takes one of \(\frac{8}{R}\) discrete values stored by \(3-\log _2(R)\) bits. Namely, for \(R = 2^{-10} \approx 0.001\), L2-hash requires 13 bits. We also store the norm-part of mp-LSH-CAT using 13 bits.

Denote by \(\text {stor}_{\mathrm {CAT}}(T)\) the required storage of mp-LSH-CAT. Then, \(\text {stor}_{\mathrm {CAT}}(T) = T_\mathrm {CAT} + 13\), which we set as our fixed memory budget for a given \(T_\mathrm {CAT}\). The baselines sign- and simple-LSH, so mp-LSH-VA are pure sign-hashes, thus giving them a budget of \(T_{\mathrm {sign}} = T_{\mathrm {smp}} = T_{\mathrm {VA}} = \text {stor}_{\mathrm {CAT}}(T)\) hashes. As discussed above, L2-LSH may take \(T_{\mathrm {L2}} = \frac{\text {stor}_{\mathrm {CAT}}(T)}{13}\) hashes. For mp-LSH-CC, we allocate a third of the budget for each of the three components giving \(\varvec{T}_{\mathrm {CC}} = (T_{\mathrm {CC}}^{\mathrm {L2}}, T_{\mathrm {CC}}^{\mathrm {sign}}, T_{\mathrm {CC}}^{\mathrm {smp}}) = \text {stor}_{\mathrm {CAT}}(T) \cdot (\frac{1}{39},\frac{1}{3},\frac{1}{3})\). This consideration is used when we compare mp-LSH-CC and mp-LSH-CAT in Sect. 4.2.
Fig. 2

Precision–recall curves (higher is better) on MovieLens10M data for K = 5 and T = 256

Fig. 3

Precision–recall curves on NetFlix data for K = 10 and T = 512

Table 1

ANNS results for mp-LSH-CC with \(\varvec{T}_{\mathrm {CC}} = (T_{\mathrm {CC}}^{\mathrm {L2}}, T_{\mathrm {CC}}^{\mathrm {sign}}, T_{\mathrm {CC}}^{\mathrm {smp}})=(1024, 1024, 1024)\)

 

Recall@k

Query time (msec)

Storage per sample (bytes)

1

5

10

1

5

10

L2

0.53

0.76

0.82

2633.83

2824.06

2867.00

4344

MIPS

0.69

0.77

0.82

3243.51

3323.20

3340.36

4344

L2+MIPS (0.5, 0.5)

0.29

0.50

0.60

3553.63

3118.93

3151.44

4344

Table 2

ANNS results with mp-LSH-CAT with \(T_{\mathrm {CAT}}=1024\)

 

Recall@k

Query time (msec)

Storage per sample (bytes)

1

5

10

1

5

10

L2

0.52

0.80

0.89

583.85

617.02

626.02

224

MIPS

0.64

0.76

0.85

593.11

635.72

645.14

224

L2+MIPS (.5,.5)

0.29

0.52

0.62

476.62

505.63

515.77

224

Table 3

ANNS results for mp-LSH-CC with \(\varvec{T}_{\mathrm {CC}} = (T_{\mathrm {CC}}^{\mathrm {L2}}, T_{\mathrm {CC}}^{\mathrm {sign}}, T_{\mathrm {CC}}^{\mathrm {smp}}) = (27, 346, 346)\).

 

Recall@k

Query time (msec)

Storage per sample (bytes)

1

5

10

1

5

10

L2

0.35

0.49

0.59

1069.29

1068.97

1074.40

280

MIPS

0.32

0.56

0.56

363.61

434.49

453.35

280

L2+MIPS (.5,.5)

0.04

0.07

0.08

811.72

839.91

847.35

280

4 Experiment

Here, we conduct an empirical evaluation on several real-world data sets.

4.1 Collaborative filtering

We first evaluate our methods on collaborative filtering data, the MovieLens10M2 and the Netflix datasets (Funk 2006). Following the experiment in (Shrivastava and Li 2014, 2015), we applied PureSVD Cremonesi et al. 2010 to get L-dimensional user and item vectors, where \(L = 150\) for MovieLens and \(L = 300\) for Netflix. We centered the samples so that \(\sum _{\varvec{x}\in {\mathcal {X}}} \varvec{x}= \varvec{0}\), which does not affect the L2-NNS as well as the MIPS solution.

Regarding the L-dimensional vector as a single feature group (\(G=1\)), we evaluated the performance in L2-NNS (\(W = 1, \gamma = 1, \eta = \lambda = 0\)), MIPS (\(W = 1, \gamma = \eta = 0, \lambda = 1\)), and their weighted sum (\(W = 2, \gamma ^{(1)} = 0.5, \lambda ^{(2)} = 0.5, \gamma ^{(2)} = \lambda ^{(1)} = \eta ^{(1)} = \eta ^{(2)} = 0\)). The queries for L2-NNS were chosen randomly from the items, while the queries for MIPS were chosen from the users. For each query, we found its \(K= 1, 5, 10\) nearest neighbors in terms of the MP dissimilarity (7) by linear search, and used them as the ground truth. We set the hash bit length to \(T = 128, 256, 512\), and rank the samples (items) based on the Hamming distance for the baseline methods and mp-LSH-VA. For mp-LSH-CC and mp-LSH-CAT, we rank the samples based on their code distances (11) and (15), respectively. After that, we drew the precision–recall curve, defined as \(\mathrm {Precision} = \frac{\mathrm {relevant seen}}{k}\) and \(\mathrm {Recall} = \frac{\mathrm {relevant seen}}{K}\) for different k, where “relevant seen” is the number of the true K nearest neighbors that are ranked within the top k positions by the LSH methods. Figures 2 and 3 show the results on MovieLens10M for \(K=5\) and \(T = 256\) and NetFlix for \(K=10\) and \(T = 512\), respectively, where each curve was averaged over 2000 randomly chosen queries.

We observe that mp-LSH-VA performs very poorly in L2-NNS (as bad as simple-LSH, which is not designed for L2-distance), although it performs reasonably in MIPS. On the other hand, mp-LSH-CC and mp-LSH-CAT perform well for all cases. Similar tendency was observed for other values of K and T. Since poor performance of mp-LSH-VA was shown in theory (Fig. 1) and experiment (Figs. 2 and 3), we will focus on mp-LSH-CC and mp-LSH-CAT in the subsequent subsections.

4.2 Computation time in query search

Next, we evaluate query search time and memory consumption of mp-LSH-CC and mp-LSH-CAT on the texmex data set3 Jégou et al. (2011), which was generated from millions of images by applying the standard SIFT descriptor Lowe (2004) with \(L=128\). Similar to Sect. 4.1, we conducted experiment on L2-NNS, MIPS, and their weighted sum with the same setting for the weights \(\varvec{\gamma }, \varvec{\eta }, \varvec{\lambda }\). We constructed the cover tree with all \(N = 10^9\) samples in the ANN_SIFT1B data set, and used all samples in the defined query set as the queries for L2-NNS. We randomly drew the same number of queries for MIPS from the uniform distribution on the set of normalized (\(\Vert \varvec{q}\Vert _2 = 1\)) vectors.

We ran the performance experiment on a machine with 48 cores (4 AMD Opteron™6238 Processors) and 512 GB main memory on Ubuntu 12.04.5 LTS. Tables 1, 2, 3 summarize recall@k, query time, and required memory storage. Here, recall@k is the recall for \(K = 1\) and given k. All reported values are averaged over 100 queries.

We see that mp-LSH-CC (Table 1) and mp-LSH-CAT (Table 2) for \(T = 1024\) perform comparably well in terms of accuracy (see the columns for recall@k). However, mp-LSH-CAT is much faster (see query time) and requires significantly less memory (see storage per sample). Table 3 shows the performance of mp-LSH-CC with equal memory requirement to mp-LSH-CAT for \(T=1024\). More specifically, we use different bit length for each dissimilarity measure, and set them to \(\varvec{T}_{\mathrm {CC}} = (T_{\mathrm {CC}}^{\mathrm {L2}}, T_{\mathrm {CC}}^{\mathrm {sign}}, T_{\mathrm {CC}}^{\mathrm {smp}}) = (27, 346, 346)\), with which the memory budget is shared equally for each dissimilarity measure, according to Sect. 3.5. By comparing Tables 2 and 3, we see that mp-LSH-CC for \(\varvec{T}_{\mathrm {CC}} = (27, 346, 346)\), which uses similar memory storage per sample, gives significantly worse recall@k than mp-LSH-CAT for \(T=1024\).

Thus, we conclude that both mp-LSH-CC and mp-LSH-CAT perform well, but we recommend the latter for the case of limited memory budget, or in applications, where the query search time is crucial.
Fig. 4

Image retrieval results with mixed queries. In both a, b, the top row shows L2 query (left end) and the images retrieved (by ANNS with mp-LSH-CAT for \(T=512\)) according to the L2 dissimilarity (\(\gamma ^{(1)} = 1.0\) and \(\lambda ^{(2)} = 0.0\)), the second row shows MIPS query and the images retrieved according to the IP dissimilarity (\(\gamma ^{(1)} = 0.0\) and \(\lambda ^{(2)} = 1.0\)), and the third row shows the images retrieved according to the mixed dissimilarity for \(\gamma ^{(1)} = 0.6\) and \(\lambda ^{(2)} = 0.4\)

4.3 Demonstration of image retrieval with mixed queries

Finally, we demonstrate the usefulness of our flexible mp-LSH in an image retrieval task on the ILSVRC2012 data set Russakovsky et al. (2015). We computed a feature vector for each image by concatenating the 4096-dimensional fc7 activations of the trained VGG16 model Simonyan and Zisserman (2014) with 120-dimensional color features4. Since user preference vector is not available, we use classifier vectors, which are the weights associated with the respective ImageNet classes, as MIPS queries (the entries corresponding to the color features are set to zero). This simulates users who like a particular class of images.

We performed ANNS based on the MP dissimilarity using our mp-LSH-CAT with T = 512 in the sample pool consisting of all \(N \approx 1.2M\) images. In Fig. 4a, each of the three rows consist of the query at the left end, and the corresponding top-ranked images. In the first row, the shown black dog image was used as the L2 query \(\varvec{q}^{(1)}\), and similar black dog images were retrieved according to the L2 dissimilarity (\(\gamma ^{(1)} = 1.0\) and \(\lambda ^{(2)} = 0.0\)). In the second row, the VGG16 classifier vector for trench coats was used as the MIPS query \(\varvec{q}^{(2)}\), and images containing trench coats were retrieved according to the MIPS dissimilarity (\(\gamma ^{(1)} = 0.0\) and \(\lambda ^{(2)} = 1.0\)). In the third row, images containing black trench coats were retrieved according to the mixed dissimilarity for \(\gamma ^{(1)} = 0.6\) and \(\lambda ^{(2)} = 0.4\). Figure 4b shows another example with a strawberry L2 query and the ice creams MIPS query. We see that, in both examples, mp-LSH-CAT handles the combined query well: it brings images that are close to the L2 query, and relevant to the MIPS query. Other examples can be found through our online demo.5

5 Conclusion

When querying huge amounts of data, it becomes mandatory to increase efficiency, i.e., even linear methods may be too computationally involved. Hashing, in particular locality sensitive hashing (LSH), has become a highly efficient workhorse that can yield answers to queries in sublinear time, such as L2-/cosine-distance nearest neighbor search (NNS) or maximum inner product search (MIPS). While for typical applications, the type of query has to be fixed beforehand, it is not uncommon to query with respect to several aspects in data, perhaps, even reweighting this dynamically at query time. Our paper contributes exactly, therefore, namely, by proposing three multiple purpose locality sensitive hashing (mp-LSH) methods which enable L2-/cosine-distance NNS, MIPS, and their weighted sums. A user can now indeed and efficiently change the importance of the weights at query time without recomputing the hash functions. Our paper has placed its focus on proving the feasibility and efficiency of the mp-LSH methods, and introducing the very interesting cover tree concept (which is less commonly applied in the machine learning world) for fast querying over the defined multi-metric space. Finally, we provide a demonstration on the usefulness of our novel technique.

Future studies will extend the possibilities of mp-LSH for further including other types of dissimilarity measure, e.g., the distance from hyperplane (Jain et al. 2010), and further applications with combined queries, e.g., retrieval with one complex multiple purpose query, say, a pareto-front for subsequent decision making. Another future direction would be to analyze the interpretability of NNS systems, specifically for recommender systems with non-linear query mechanism, in terms of salient features that have led to the query result. This is in the line of research on “explaining learning machines”, i.e., answering to the question which part of the data is responsible for specific decisions made by learning machines (Baehrens et al. 2010; Simonyan et al. 2014; Zeiler and Fergus 2014; Bach et al. 2015; Ribeiro et al. 2016; Montavon et al. 2017, 2018). This question is non-trivial when the learning machines are complex and non-linear. Our mp-LSH enables complex non-linear query mechanism, and therefore, it would be a useful tool if we could, for example, develop a method which can explain why a NNS system with mixed queries recommended a specific set of items, and analyze the dependency on the weight setting.

Footnotes

  1. 1.

    This assumption is reasonable for L2-NNS if the size of the sample pool is sufficiently large, and the query follows the same distribution as the samples. For MIPS, the norm of the query can be arbitrarily modified, and we set it to \(\Vert \varvec{q}\Vert _2 = 1\).

  2. 2.
  3. 3.
  4. 4.

    We computed histograms on the central crop of an image (covering 50% of the area) for each rgb color channel with 8 and 32 bins. We normalized the histograms and concatenate them.

  5. 5.

Notes

Acknowledgements

This work was supported by the German Research Foundation (GRK 1589/1) by the Federal Ministry of Education and Research (BMBF) under the project Berlin Big Data Center (FKZ 01IS14013A) and the BMBF project ALICE II, Autonomous Learning in Complex Environments (01IB15001B). This work was also supported by the Fraunhofer Society under the MPI-FhG collaboration project (600393).

Compliance with ethics standard

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

References

  1. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One, 10(7), e0130140.CrossRefGoogle Scholar
  2. Bachrach, Y., Finkelstein, Y., Gilad-Bachrach, R., Katzir, L., Koenigstein, N., Nice, N., & Paquet, U. (2014). Speeding up the Xbox recommender system using a euclidean transformation for inner-product spaces. In: Proceedings of the 8th ACM conference on recommender systems (RecSys) (pp. 257–264).Google Scholar
  3. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., & Müller, K. R. (2010). How to explain individual classification decisions. Journal of Machine Learning Research, 11, 1803–1831.MathSciNetzbMATHGoogle Scholar
  4. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127.CrossRefzbMATHGoogle Scholar
  5. Bengio, Y., LeCun, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436.CrossRefGoogle Scholar
  6. Beygelzimer, A., Kakade, S., & Langford, J. (2006). Cover trees for nearest neighbor. In: Proceedings of International Conference on Machine Leanring (pp. 97–104).Google Scholar
  7. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer.zbMATHGoogle Scholar
  8. Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks, 29, 1157–1166.Google Scholar
  9. Bustos, B., Kreft, S., & Skopal, T. (2012). Adapting metric indexes for searching in multi-metric spaces. Multimedia Tools and Applications, 58(3), 467–496.CrossRefGoogle Scholar
  10. Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC) (pp. 380–388).Google Scholar
  11. Cremonesi, P., Koren, Y., & Turrin, R. (2010). Performance of recommender algorithms on top-n recommendation tasks. In: Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys) (pp. 39–46).Google Scholar
  12. Datar, M., Immorlica, N., Indyk, P., & Mirrokn, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry (SCG) (pp. 253–262).Google Scholar
  13. Funk, S. (2006). Try this at home. http://sifter.org/simon/journal/20061211.html.
  14. Goemans, M. X., & Williamson, D. P. (1995). Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6), 1115–1145.MathSciNetCrossRefzbMATHGoogle Scholar
  15. Gorisse, D., Cord, M., & Precioso, F. (2012). Locality-sensitiv hashing for chi2 distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2), 402–409.CrossRefGoogle Scholar
  16. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  17. He, J., Chang, S. F., Radhakrishnan, R., & Bauer, C. (2011). Compact hashing with joint optimization of search accuracy and time. In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (pp. 753–760).Google Scholar
  18. Heinonen, J. (2001). Lectures on analysis on metric spaces. Universitext.Google Scholar
  19. Hinton, G. (2007). Learning multiple layers of representation. Trends in Cognitive Sciences, 11, 428–434.CrossRefGoogle Scholar
  20. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC) (pp. 604–613).Google Scholar
  21. Jain, P., Vijayanarasimhan, S., & Grauman, K. (2010). Hashing hyperplane queries to near points with applications to large-scale active learning. In: Advances in Neural Information Processing Systems (NIPS) (Vol. 23).Google Scholar
  22. Jégou, H., Tavenard, R., Douze, M., & Amsaleg, L. (2011). Searching in one billion vectors: re-rank with source coding. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 861–864).Google Scholar
  23. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (Vol. 25).Google Scholar
  24. Lin, K., Yang, H. F., Hsiao, J. H., & Chen, C. S. (2015). Deep learning of binary hash codes for fast image retrieval. In: Proceedings of Computer Vision and Pattern Recognition Workshops.Google Scholar
  25. Liu, G., Xu, H., & Yan, S. (2012). Exact subspace segmentation and outlier detection by low-rank representation. In: Proceedings of Artificial Intelligence and Statistics Conference (AISTATS).Google Scholar
  26. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.MathSciNetCrossRefGoogle Scholar
  27. Matsushita, Y., & Wada, T. (2009) Principal component hashing: An accelerated approximate nearest neighbor search. In: Proceedings of Pacific-Rim Symposium on Image and Video Technology (PSIVT) (pp. 374–385).Google Scholar
  28. Montavon, G., Lapuschkin, S., Binder, A., Samek, W., & Müller, K. R. (2017). Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65, 211–222.CrossRefGoogle Scholar
  29. Montavon, G., Orr, G., & Müller, K. R. (2012). Neural Networks: Tricks of the Trade. New York: Springer.CrossRefGoogle Scholar
  30. Montavon, G., Samek, W., & Müller, K. R. (2018). Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73, 1–15.MathSciNetCrossRefGoogle Scholar
  31. Moran, S., Lavrenko, V. (2015). Regularized cross-modal hashing. In: Proc. of SIGIR.Google Scholar
  32. Neyshabur, B., Srebro, N. (2015) On symmetric and asymmetric lshs for inner product search. In: ICML, vol. 32.Google Scholar
  33. Ribeiro, M.T., Singh, S., Guestrin, C. (2016). Why should I trust you? In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144.Google Scholar
  34. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.  https://doi.org/10.1007/s11263-015-0816-y.MathSciNetCrossRefGoogle Scholar
  35. Schütt, K., Arbabzadah, F., Chmiela, S., Müller, K. R., & Tkatchenko, A. (2017). Quantum-chemical insights from deep tensor neural networks. Nature Communications, 8, 13890.CrossRefGoogle Scholar
  36. Shrivastava, A., Li, P. (2014). Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In: NIPS, vol. 27.Google Scholar
  37. Shrivastava, A., Li, P. (2015). Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). Proc. of UAI.Google Scholar
  38. Simonyan, K., Vedaldi, A., Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In: ICLR Workshop 2014.Google Scholar
  39. Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 Google Scholar
  40. Song, J., Yang, Y., Huang, Z., Schen, H. T., & Luo, J. (2013). Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transaction on Multimedia, 15(8), 1997–2008.CrossRefGoogle Scholar
  41. Strecha, C., Bronstein, A. M., Bronstein, M. M., & Fua, P. (2012). LDA hash: Improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 66–78.CrossRefGoogle Scholar
  42. Tagami, Y. (2017). AnnexML: Approximate nearest neighbor search for extreme multi-label classification. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 455–464Google Scholar
  43. Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.CrossRefGoogle Scholar
  44. Wang, J., Schen, H.T., Song, J., Ji, J. (2014). Hashing for similarity search: a survey. arXiv:1408.2927v1 [cs.DS].
  45. Xu, S., Wang, S., Zhang, Y. (2013). Summarizing complex events: a cross-modal solution of storylines extraction and reconstruction. In: Proc. of EMNLP, pp. 1281–1291.Google Scholar
  46. Zeiler, M.D., Fergus, R. (2014). Visualizing and understanding convolutional networks. In: Proceedings of European Conference on Computer Vision, pp. 818–833.Google Scholar

Copyright information

© Japanese Federation of Statistical Science Associations 2018

Authors and Affiliations

  • Wiktor Pronobis
    • 1
  • Danny Panknin
    • 1
  • Johannes Kirschnick
    • 2
  • Vignesh Srinivasan
    • 3
  • Wojciech Samek
    • 3
    • 8
  • Volker Markl
    • 4
    • 8
  • Manohar Kaul
    • 5
  • Klaus-Robert Müller
    • 1
    • 6
    • 7
    • 8
  • Shinichi Nakajima
    • 1
    • 8
    • 9
  1. 1.Machine Learning GroupTechnische Unversität BerlinBerlinGermany
  2. 2.Language Technology LabDFKIBerlinGermany
  3. 3.Fraunhofer Heinrich Hertz InstituteBerlinGermany
  4. 4.Database Systems and Information Management GroupTechnische Unversität BerlinBerlinGermany
  5. 5.IIT HyderabadTelanganaIndia
  6. 6.Korea UniversitySeoulSouth Korea
  7. 7.Max Planck SocietySaarbrückenGermany
  8. 8.Berlin Big Data CenterBerlinGermany
  9. 9.Center for Advanced Intelligence Project (AIP)RIKENTokyoJapan

Personalised recommendations