Skip to main content
Log in

Sharing hash codes for multiple purposes

  • Published:
Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Abstract

Locality sensitive hashing (LSH) is a powerful tool in data science, which enables sublinear-time approximate nearest neighbor search. A variety of hashing schemes have been proposed for different dissimilarity measures. However, hash codes significantly depend on the dissimilarity, which prohibits users from adjusting the dissimilarity at query time. In this paper, we propose multiple purpose LSH (mp-LSH) which shares the hash codes for different dissimilarities. mp-LSH supports L2, cosine, and inner product dissimilarities, and their corresponding weighted sums, where the weights can be adjusted at query time. It also allows us to modify the importance of pre-defined groups of features. Thus, mp-LSH enables us, for example, to retrieve similar items to a query with the user preference taken into account, to find a similar material to a query with some properties (stability, utility, etc.) optimized, and to turn on or off a part of multi-modal information (brightness, color, audio, text, etc.) in image/video retrieval. We theoretically and empirically analyze the performance of three variants of mp-LSH, and demonstrate their usefulness on real-world data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. This assumption is reasonable for L2-NNS if the size of the sample pool is sufficiently large, and the query follows the same distribution as the samples. For MIPS, the norm of the query can be arbitrarily modified, and we set it to \(\Vert \varvec{q}\Vert _2 = 1\).

  2. http://www.grouplens.org/.

  3. http://corpus-texmex.irisa.fr/.

  4. We computed histograms on the central crop of an image (covering 50% of the area) for each rgb color channel with 8 and 32 bins. We normalized the histograms and concatenate them.

  5. http://bbdcdemo.bbdc.tu-berlin.de/

References

  • Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One, 10(7), e0130140.

    Article  Google Scholar 

  • Bachrach, Y., Finkelstein, Y., Gilad-Bachrach, R., Katzir, L., Koenigstein, N., Nice, N., & Paquet, U. (2014). Speeding up the Xbox recommender system using a euclidean transformation for inner-product spaces. In: Proceedings of the 8th ACM conference on recommender systems (RecSys) (pp. 257–264).

  • Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., & Müller, K. R. (2010). How to explain individual classification decisions. Journal of Machine Learning Research, 11, 1803–1831.

    MathSciNet  MATH  Google Scholar 

  • Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127.

    Article  MathSciNet  Google Scholar 

  • Bengio, Y., LeCun, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436.

    Article  Google Scholar 

  • Beygelzimer, A., Kakade, S., & Langford, J. (2006). Cover trees for nearest neighbor. In: Proceedings of International Conference on Machine Leanring (pp. 97–104).

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer.

    MATH  Google Scholar 

  • Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks, 29, 1157–1166.

    Google Scholar 

  • Bustos, B., Kreft, S., & Skopal, T. (2012). Adapting metric indexes for searching in multi-metric spaces. Multimedia Tools and Applications, 58(3), 467–496.

    Article  Google Scholar 

  • Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC) (pp. 380–388).

  • Cremonesi, P., Koren, Y., & Turrin, R. (2010). Performance of recommender algorithms on top-n recommendation tasks. In: Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys) (pp. 39–46).

  • Datar, M., Immorlica, N., Indyk, P., & Mirrokn, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry (SCG) (pp. 253–262).

  • Funk, S. (2006). Try this at home. http://sifter.org/simon/journal/20061211.html.

  • Goemans, M. X., & Williamson, D. P. (1995). Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6), 1115–1145.

    Article  MathSciNet  Google Scholar 

  • Gorisse, D., Cord, M., & Precioso, F. (2012). Locality-sensitiv hashing for chi2 distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2), 402–409.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Berlin: Springer.

    Book  Google Scholar 

  • He, J., Chang, S. F., Radhakrishnan, R., & Bauer, C. (2011). Compact hashing with joint optimization of search accuracy and time. In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (pp. 753–760).

  • Heinonen, J. (2001). Lectures on analysis on metric spaces. Universitext.

  • Hinton, G. (2007). Learning multiple layers of representation. Trends in Cognitive Sciences, 11, 428–434.

    Article  Google Scholar 

  • Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC) (pp. 604–613).

  • Jain, P., Vijayanarasimhan, S., & Grauman, K. (2010). Hashing hyperplane queries to near points with applications to large-scale active learning. In: Advances in Neural Information Processing Systems (NIPS) (Vol. 23).

  • Jégou, H., Tavenard, R., Douze, M., & Amsaleg, L. (2011). Searching in one billion vectors: re-rank with source coding. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 861–864).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (Vol. 25).

  • Lin, K., Yang, H. F., Hsiao, J. H., & Chen, C. S. (2015). Deep learning of binary hash codes for fast image retrieval. In: Proceedings of Computer Vision and Pattern Recognition Workshops.

  • Liu, G., Xu, H., & Yan, S. (2012). Exact subspace segmentation and outlier detection by low-rank representation. In: Proceedings of Artificial Intelligence and Statistics Conference (AISTATS).

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  MathSciNet  Google Scholar 

  • Matsushita, Y., & Wada, T. (2009) Principal component hashing: An accelerated approximate nearest neighbor search. In: Proceedings of Pacific-Rim Symposium on Image and Video Technology (PSIVT) (pp. 374–385).

    Chapter  Google Scholar 

  • Montavon, G., Lapuschkin, S., Binder, A., Samek, W., & Müller, K. R. (2017). Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65, 211–222.

    Article  Google Scholar 

  • Montavon, G., Orr, G., & Müller, K. R. (2012). Neural Networks: Tricks of the Trade. New York: Springer.

    Book  Google Scholar 

  • Montavon, G., Samek, W., & Müller, K. R. (2018). Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73, 1–15.

    Article  MathSciNet  Google Scholar 

  • Moran, S., Lavrenko, V. (2015). Regularized cross-modal hashing. In: Proc. of SIGIR.

  • Neyshabur, B., Srebro, N. (2015) On symmetric and asymmetric lshs for inner product search. In: ICML, vol. 32.

  • Ribeiro, M.T., Singh, S., Guestrin, C. (2016). Why should I trust you? In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.

    Article  MathSciNet  Google Scholar 

  • Schütt, K., Arbabzadah, F., Chmiela, S., Müller, K. R., & Tkatchenko, A. (2017). Quantum-chemical insights from deep tensor neural networks. Nature Communications, 8, 13890.

    Article  Google Scholar 

  • Shrivastava, A., Li, P. (2014). Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In: NIPS, vol. 27.

  • Shrivastava, A., Li, P. (2015). Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). Proc. of UAI.

  • Simonyan, K., Vedaldi, A., Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In: ICLR Workshop 2014.

  • Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556

  • Song, J., Yang, Y., Huang, Z., Schen, H. T., & Luo, J. (2013). Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transaction on Multimedia, 15(8), 1997–2008.

    Article  Google Scholar 

  • Strecha, C., Bronstein, A. M., Bronstein, M. M., & Fua, P. (2012). LDA hash: Improved matching with smaller descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 66–78.

    Article  Google Scholar 

  • Tagami, Y. (2017). AnnexML: Approximate nearest neighbor search for extreme multi-label classification. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 455–464

  • Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.

    Article  Google Scholar 

  • Wang, J., Schen, H.T., Song, J., Ji, J. (2014). Hashing for similarity search: a survey. arXiv:1408.2927v1 [cs.DS].

  • Xu, S., Wang, S., Zhang, Y. (2013). Summarizing complex events: a cross-modal solution of storylines extraction and reconstruction. In: Proc. of EMNLP, pp. 1281–1291.

  • Zeiler, M.D., Fergus, R. (2014). Visualizing and understanding convolutional networks. In: Proceedings of European Conference on Computer Vision, pp. 818–833.

    Google Scholar 

Download references

Acknowledgements

This work was supported by the German Research Foundation (GRK 1589/1) by the Federal Ministry of Education and Research (BMBF) under the project Berlin Big Data Center (FKZ 01IS14013A) and the BMBF project ALICE II, Autonomous Learning in Complex Environments (01IB15001B). This work was also supported by the Fraunhofer Society under the MPI-FhG collaboration project (600393).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Klaus-Robert Müller.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Wiktor Pronobis, Danny Panknin, and Johannes Kirschnick contributed equally.

Appendices

A derivation of inner product in proof of theorem 1

The inner product between the augmented vectors \(\widetilde{\varvec{q}}\) and \(\widetilde{\varvec{x}}\), defined in Eq. (10), is given by

$$\begin{aligned} \widetilde{\varvec{q}}^\top \widetilde{\varvec{x}}&= \sum\limits_{w=1}^W \sum _{g=1}^G \left( (\gamma _g^{(w)} + \lambda _g^{(w)})\varvec{q}_g^{(w){\top }} \varvec{x}_g - \frac{1}{2}\sum _{g=1}^G \gamma _g^{(w)} \left( \Vert \varvec{q}_g^{(w)}\Vert _2^2 + \Vert \varvec{x}_g\Vert _2^2 \right) \right) \\&= - \frac{1}{2} \sum _{w=1}^W \sum _{g=1}^G \Bigg ( -2\lambda _g^{(w)} \varvec{q}_g^{(w){\top }} \varvec{x}_g + \gamma _g^{(w)} \underbrace{\left( (\Vert \varvec{q}_g^{(w)}\Vert _2^2 + \Vert \varvec{x}_g\Vert _2^2) - 2\varvec{q}_g^{(w){\top }} \varvec{x}_g \right) }_{\Vert \varvec{q}_g^{(w)} - \varvec{x}_g \Vert _2^2} \Bigg ) \\&= \Vert \varvec{\lambda }\Vert _{{\tiny 1}} - \frac{{\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)} \}, \varvec{x})}{2}. \end{aligned}$$

B Lemma: inner product approximation

For \(\varvec{q}, \varvec{x}\in {\mathbb {R}}^L\), let

$$\begin{aligned} d_T(\varvec{q}, \varvec{x}) = \frac{1}{T}\sum _{t=1}^{T}\left| \varvec{H}(\varvec{q})_{t1} - \widetilde{\varvec{H}}(\varvec{x})_{t1} \right| \end{aligned}$$

with expectation

$$\begin{aligned} d(\varvec{q}, \varvec{x}) = \mathbb {E} d_T(\varvec{q}, \varvec{x}) = \mathbb {E}\left| \varvec{H}(\varvec{q})_{11} - \widetilde{\varvec{H}}(\varvec{x})_{11} \right| \end{aligned}$$

and define

$$\begin{aligned} L(\varvec{q}, \varvec{x}) = 1 - \frac{\varvec{q}^{\top }\varvec{x}}{\Vert \varvec{q}\Vert _{{\tiny 2}}}. \end{aligned}$$

Lemma 1

The following statements hold:

  1. (a):

    It holds that

    $$\begin{aligned} d(\varvec{q}, \varvec{x}) = 1 - \Vert \varvec{x}\Vert _{{\tiny 2}}\left(1 - \frac{2}{\pi }\sphericalangle (\varvec{q}, \varvec{x})\right) \end{aligned}$$
  2. (b):

    For \(\mathcal {E}_{x} = 0.2105\Vert \varvec{x}\Vert _{{\tiny 2}}\), it is

    $$\begin{aligned} |L(\varvec{q}, \varvec{x}) - d(\varvec{q}, \varvec{x})| \le \mathcal {E}_{x} \end{aligned}$$
    (19)
  3. (c):

    Let \(b(\varvec{q}, \varvec{x}) = 1 - \frac{2}{\pi }\frac{\varvec{q}^{\top }\varvec{x}}{\Vert \varvec{q}\Vert _{{\tiny 2}}}\), then for \(L(\varvec{q}, \varvec{x}) \le 1\), it is

    $$\begin{aligned} L(\varvec{q}, \varvec{x}) \le d(\varvec{q}, \varvec{x}) \le b(\varvec{q}, \varvec{x}) \le 1 \end{aligned}$$

    and for \(L(\varvec{q}, \varvec{x}) \ge 1\), it is

    $$\begin{aligned} L(\varvec{q}, \varvec{x}) \ge d(\varvec{q}, \varvec{x}) \ge b(\varvec{q}, \varvec{x}) \ge 1 \end{aligned}$$
  4. (d):

    It holds that

    $$\begin{aligned} |L(\varvec{q}, \varvec{x}) - d(\varvec{q}, \varvec{x})| \le \min \{\left(1 - \frac{2}{\pi }\right)|L(\varvec{q}, \varvec{x}) - 1|, \mathcal {E}_{x}\} \end{aligned}$$

    and for \(s_{\varvec{x}} = 0.58\Vert \varvec{x}\Vert _{{\tiny 2}}\), if \(|L(\varvec{q}, \varvec{x}) - 1| \le s_{\varvec{x}}\), it is

    $$\begin{aligned} (1 - \frac{2}{\pi })|L(\varvec{q}, \varvec{x}) - 1| \le \mathcal {E}_{x}. \end{aligned}$$

Proof (a):

Defining \(p_{col} = 1 - \frac{1}{\pi }\sphericalangle (\varvec{q},\varvec{x})\), we have

$$\begin{aligned} \mathbb {E} \left| \varvec{H}(\varvec{q})_{11} - \widetilde{\varvec{H}}(\varvec{x})_{11} \right|&= \big (1 - \Vert \varvec{x}\Vert _{{\tiny 2}}\big )p_{col} + \big (1 + \Vert \varvec{x}\Vert _{{\tiny 2}}\big )\big (1 - p_{col}\big )\\&= 1 - \Vert \varvec{x}\Vert _{{\tiny 2}}\big (2p_{col}-1\big ) = 1 - \Vert \varvec{x}\Vert _{{\tiny 2}}\left(1 - \frac{2}{\pi }\sphericalangle (\varvec{q}, \varvec{x})\right). \end{aligned}$$

Proof (b):

$$\begin{aligned} |L(\varvec{q}, \varvec{x}) - d(\varvec{q}, \varvec{x})|&= \Vert \varvec{x}\Vert _{{\tiny 2}}\left| \frac{\varvec{q}^{\top }\varvec{x}}{\Vert \varvec{q}\Vert _{{\tiny 2}}\Vert \varvec{x}\Vert _{{\tiny 2}}} - 1 + \frac{2}{\pi }\sphericalangle (\varvec{q}, \varvec{x})\right|\\&\le \Vert \varvec{x}\Vert _{{\tiny 2}} \max _{z \in [-1,1]} \left| z - 1 + \frac{2}{\pi }\arccos (z) \right|. \end{aligned}$$

For \(z^* = \sqrt{1 - \frac{4}{\pi ^2}}\), we obtain the maximum

$$\begin{aligned} \mathcal {E}_{x} = \Vert \varvec{x}\Vert _{{\tiny 2}}\left| z^* - 1 + \frac{2}{\pi }\arccos (z^*)\right| \approx 0.2105\Vert \varvec{x}\Vert _{{\tiny 2}}. \end{aligned}$$

Proof (c):

We treat the case \(L(\varvec{q}, \varvec{x}) \le 1\), noting that the others case is analogous due to symmetry. Observe that \(\frac{\varvec{q}^{\top }\varvec{x}}{\Vert \varvec{q}\Vert _{{\tiny 2}}} \ge 0\), providing

$$\begin{aligned} b(\varvec{q}, \varvec{x}) = 1 - \frac{2}{\pi }\frac{\varvec{q}^{\top }\varvec{x}}{\Vert \varvec{q}\Vert _{{\tiny 2}}} \le 1. \end{aligned}$$

As \(\arccos\) is a concave function on [0, 1], it is

$$\begin{aligned} \arccos (z)&= \arccos (0(1-z) + 1(z)) \\&\ge (1-z)\arccos (0) + z\arccos (1) = \frac{\pi }{2}(1-z). \end{aligned}$$

Define \(z = \frac{\varvec{q}^{\top }\varvec{x}}{\Vert \varvec{q}\Vert _{{\tiny 2}}\Vert \varvec{x}\Vert _{{\tiny 2}}}\). Then, we have

$$\begin{aligned} d(\varvec{q}, \varvec{x}) - L(\varvec{q}, \varvec{x}) = \Vert \varvec{x}\Vert _{{\tiny 2}}\left( z - 1 + \frac{2}{\pi }\arccos (z)\right) \ge 0, \end{aligned}$$

from which \(L(\varvec{q}, \varvec{x}) \le d(\varvec{q}, \varvec{x})\) follows. Noting that

$$\begin{aligned} \max _{z\in [0,1]} \frac{d \arccos }{\delta z}(z) = \max _{z\in [0,1]} \frac{-1}{\sqrt{1 - z^2}} = -1, \end{aligned}$$

and \(\arccos (0) = \frac{\pi }{2}\), it is

$$\begin{aligned} \arccos (z) - \arccos (0) = \int _0^z \frac{d \arccos }{\delta z}(t) dt \le -\int _0^z dt = -z, \end{aligned}$$

such that

$$\begin{aligned} \arccos (z) \le \frac{\pi }{2} - z. \end{aligned}$$

Therefore, it is

$$\begin{aligned} b(\varvec{q}, \varvec{x}) - d(\varvec{q}, \varvec{x}) = \Vert \varvec{x}\Vert _{{\tiny 2}}\left( 1 - \frac{2}{\pi }z - \frac{2}{\pi }\arccos (z)\right) \ge 0 \end{aligned}$$

assuring \(d(\varvec{q}, \varvec{x}) \le b(\varvec{q}, \varvec{x})\).

Proof (d):

The inequality follows from (b) and (c). Letting

$$\begin{aligned} s_{\varvec{x}} = \frac{\mathcal {E}_{x}}{1 - \frac{2}{\pi }} \approx 0.58\Vert \varvec{x}\Vert _{{\tiny 2}}, \end{aligned}$$

the first bound is tighter than \(\mathcal {E}_{x}\), if \(|L(\varvec{q}, \varvec{x}) - 1| \le s_{\varvec{x}}\). \(\square\)

Note that \(d_T(\varvec{q}, \varvec{x}) \rightarrow d(\varvec{q}, \varvec{x})\) as \(T\rightarrow \infty\). Therefore, all statements are also valid, replacing \(d(\varvec{q}, \varvec{x})\) by \(d_T(\varvec{q}, \varvec{x})\) with T large enough.

C Proof of Theorem 2

For \(\varvec{\eta }^{(w)} = \varvec{0}\) for \(w = 1, \ldots , W\) we have

$$\begin{aligned} {\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) = \sum _{w=1}^W \sum _{g=1}^G \gamma _g^{(w)} \Vert \varvec{q}_g^{(w)} - \varvec{x}_g\Vert _{{\tiny 2}}^2 + 2\lambda _g^{(w)} \left( 1 - \varvec{q}_g^{(w) {\top }} \varvec{x}_g\right) . \end{aligned}$$

Recall that \(\overline{\varvec{q}}_g^{\mathrm {L2+ip}} = \sum _{w=1}^W (\gamma _g^{(w)} + \lambda _g^{(w)}) \varvec{q}_g^{(w)}\). Therefore

$$\begin{aligned}&\frac{1}{T}\mathcal {D}_{\mathrm {CAT}} \Big (\varvec{H}^{\mathrm {CAT-q}}(\{\varvec{q}^{(w)}\}), \varvec{H}^{\mathrm {CAT-x}}(\varvec{x}) \Big ) \\&\qquad \qquad \qquad = \sum _{g=1}^G\bigg ( \frac{\overline{\gamma }_g}{2}\Vert \varvec{x}_g\Vert _{{\tiny 2}}^2 + \Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}\Big (1 + \Vert \varvec{x}_g\Vert _{{\tiny 2}}\big (1 - \frac{2}{T}\mathcal {C}_g(\overline{\varvec{q}}^{\mathrm {L2+ip}}, \varvec{x})\big )\Big ) \bigg ). \end{aligned}$$

We use that

$$\begin{aligned}&1 - \frac{2}{T}\mathcal {C}_g(\overline{\varvec{q}}^{\mathrm {L2+ip}}, \varvec{x}) = -1 + \frac{1}{T} \sum _{t=1}^{T} \left| \varvec{H}\left( \varvec{x}\right) _{tg} - \varvec{H}(\overline{\varvec{q}}^{\mathrm {L2+ip}})_{tg} \right| = -1 + d_T\left( \overline{\varvec{q}}_g^{\mathrm {L2+ip}}, \frac{\varvec{x}_g}{\Vert \varvec{x}_g\Vert _{{\tiny 2}}}\right) \\&\overset{T\rightarrow \infty }{\rightarrow } -1 + d\left( \overline{\varvec{q}}_g^{\mathrm {L2+ip}}, \frac{\varvec{x}_g}{\Vert \varvec{x}_g\Vert _{{\tiny 2}}}\right) \overset{(19)}{=} -1 + L\left( \overline{\varvec{q}}_g^{\mathrm {L2+ip}}, \frac{\varvec{x}_g}{\Vert \varvec{x}_g\Vert _{{\tiny 2}}}\right) + e_g = - \frac{\varvec{x}_g^{\top }\overline{\varvec{q}}_g^{\mathrm {L2+ip}}}{\Vert \varvec{x}_g\Vert _{{\tiny 2}}\Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}} + e_g,\\ \end{aligned}$$

where \(|e_g| \le \mathcal {E}_{1}\), such that

$$\begin{aligned} \frac{1}{T}\mathcal {D}_{\mathrm {CAT}}&\Big (\varvec{H}^{\mathrm {CAT-q}}(\{\varvec{q}^{(w)}\}), \varvec{H}^{\mathrm {CAT-x}}(\varvec{x}) \Big )\\&= \sum _{g=1}^G\bigg ( \frac{\overline{\gamma }_g}{2}\Vert \varvec{x}_g\Vert _{{\tiny 2}}^2 + \Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}\Big (1 - \frac{\varvec{x}_g^{\top }\overline{\varvec{q}}_g^{\mathrm {L2+ip}}}{\Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}} + \Vert \varvec{x}_g\Vert _{{\tiny 2}}e_g\Big )\bigg )\\&= \sum _{g=1}^G\bigg ( \frac{\overline{\gamma }_g}{2}\Vert \varvec{x}_g\Vert _{{\tiny 2}}^2 - \varvec{x}_g^{\top }\overline{\varvec{q}}_g^{\mathrm {L2+ip}} \bigg ) + \sum _{g=1}^G \Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}} + \underbrace{\sum _{g=1}^G\Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}\Vert \varvec{x}_g\Vert _{{\tiny 2}}e_g}_{\text {error}}\\&= \frac{1}{2}\sum _{g=1}^G \sum _{w=1}^W\left[ \gamma _g^{(w)} \Vert \varvec{q}_g^{(w)} - \varvec{x}_g\Vert _{{\tiny 2}}^2 + 2\lambda _g^{(w)} (1 - \varvec{x}_g^{\top }\varvec{q}_g^{(w)})\right] \\&\qquad \qquad \qquad + \underbrace{\sum _{g=1}^G\left( \Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}} - \frac{1}{2}\sum _{w=1}^W \left( \gamma _g^{(w)}\Vert \varvec{q}_g^{(w)}\Vert _{{\tiny 2}}^2 + 2\lambda _g^{(w)}\right) \right) }_{\text {const}} + \text { error}\\&= \frac{1}{2}{\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) + \text { const } + \text {error}. \end{aligned}$$

We can bound the error-term by

$$\begin{aligned} |\text {error}|&\le \max _{g \in \{1,\ldots ,G\}} |e_g| \sum _{g=1}^G\Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}\Vert \varvec{x}_g\Vert _{{\tiny 2}}\\&\le \mathcal {E}_{1}\left\| \left( \Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}\right) _g\right\| _{{\tiny 2}}\Vert \varvec{x}\Vert _{{\tiny 2}} \le \mathcal {E}_{1}\left\| \left( \Vert \overline{\varvec{q}}_g^{\mathrm {L2+ip}}\Vert _{{\tiny 2}}\right) _g\right\| _{{\tiny 1}}\\&\le \mathcal {E}_{1}\sum _{g=1}^G\sum _{w=1}^W (\gamma _g^{(w)} + \lambda _g^{(w)})\Vert \varvec{q}_g^{(w)}\Vert _{{\tiny 2}} \le \mathcal {E}_{1}\left( \Vert \varvec{\lambda }\Vert _{{\tiny 1}} + \Vert \varvec{\gamma }\Vert _{{\tiny 1}}\right) . \end{aligned}$$

\(\square\)

D Proof of Theorem 3

For \(\varvec{\gamma }^{(w)} = \varvec{\lambda }^{(w)} = \varvec{0}\) for \(w = 1, \ldots , W\), we have

$$\begin{aligned} {\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) = 2\sum _{w=1}^W\sum _{g=1}^G \eta _g^{(w)} \left( 1 - \frac{\varvec{q}_g^{(w) {\top }} \varvec{x}_g}{\Vert \varvec{q}_g^{(w)}\Vert _{{\tiny 2}}\Vert \varvec{x}_g\Vert _{{\tiny 2}}}\right) . \end{aligned}$$

Recall that \(\overline{\varvec{q}}_g^{\mathrm {cos}} = \sum _{w=1}^W \eta _g^{(w)} \frac{\varvec{q}_g^{(w)}}{\Vert \varvec{q}_g^{(w)}\Vert _{{\tiny 2}}}\). Therefore

$$\begin{aligned} \frac{1}{T}\mathcal {D}_{\mathrm {CAT}}&\left(\varvec{H}^{\mathrm {CAT-q}}(\{\varvec{q}^{(w)}\}), \varvec{H}^{\mathrm {CAT-x}}(\varvec{x}) \right)\\&= \sum _{g=1}^G 2\Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}}\left(1 - \frac{1}{T}\mathcal {C}_g(\overline{\varvec{q}}^{\mathrm {cos}}, \varvec{x})\right)\\&\overset{(19)}{\rightarrow } \sum _{g=1}^G \Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}}\left(1 - \frac{\varvec{x}_g^{\top }\overline{\varvec{q}}_g^{\mathrm {cos}}}{\Vert \varvec{x}_g\Vert _{{\tiny 2}}\Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}}} + e_g\right)\\&= - \sum _{g=1}^G\sum _{w=1}^W \eta _g^{(w)} \frac{\varvec{x}_g^{\top }\varvec{q}_g^{(w)}}{\Vert \varvec{x}_g\Vert _{{\tiny 2}}\Vert \varvec{q}_g^{(w)}\Vert _{{\tiny 2}}} +\sum _{g=1}^G \Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}} + \underbrace{\sum _{g=1}^Ge_g\Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}}}_{\text {error}}\\&= \sum _{g=1}^G\sum _{w=1}^W \eta _g^{(w)} \left( 1 - \frac{\varvec{x}_g^{\top }\varvec{q}_g^{(w)}}{\Vert \varvec{x}_g\Vert _{{\tiny 2}}\Vert \varvec{q}_g^{(w)}\Vert _{{\tiny 2}}}\right) +\underbrace{\sum _{g=1}^G \left( \Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}} - \sum _{w=1}^W \eta _g^{(w)}\right) }_{\text {const}} + \text { error}\\&= \frac{1}{2}{\mathcal {L}}_{\mathrm {mp}}(\{\varvec{q}^{(w)}\}, \varvec{x}) + \text {const} + \text {error}, \end{aligned}$$

where

$$\begin{aligned} |\text {error}|&\le \max _{g \in \{1,\ldots ,G\}} |e_g| \sum _{g=1}^G\Vert \overline{\varvec{q}}_g^{\mathrm {cos}}\Vert _{{\tiny 2}}\\&\le \mathcal {E}_{1}\sum _{g=1}^G\sum _{w=1}^W \eta _g^{(w)}\left\| \left. \varvec{q}_g^{(w)} \bigg / \Vert \varvec{q}_g^{(w)}\Vert _{{\tiny 2}}\right. \right\| _{{\tiny 2}} = \mathcal {E}_{1}\Vert \varvec{\eta }\Vert _{{\tiny 1}}. \end{aligned}$$

\(\square\)

E Proof of Theorem  4

Without loss of generality we prove the theorem for the plain MIPS case with \(G = 1\), \(W = 1\) and \(\lambda = 1\). Then \(\alpha = 1\) and the measure simplifies to

$$\begin{aligned} \mathcal {D}_{\mathrm {CAT}} \Big (\varvec{H}^{\mathrm {CAT-q}}(\{\varvec{q}^{(w)}\}), \varvec{H}^{\mathrm {CAT-x}}(\varvec{x}) \Big ) = Td_T(\varvec{q}^{\mathrm {ip}},\varvec{x}). \end{aligned}$$

For \(\mathcal {C}_1(\varvec{q}^{\mathrm {ip}}, \varvec{x})\) with \(\mu = \mathbb {E}\mathcal {C}_1(\varvec{q}^{\mathrm {ip}}, \varvec{x}) = T(1 - \frac{1}{\pi }\sphericalangle (\varvec{x}, \varvec{q}^{\mathrm {ip}}))\) and \(0< \delta _1 < 1\), \(\delta _2 > 0\), we use the following Chernoff-bounds:

$$\begin{aligned} \mathbb {P}\big (\mathcal {C}_1(\varvec{q}^{\mathrm {ip}}, \varvec{x}) \le (1-\delta _1)\mu \big )&\le \exp \left\{ -\frac{\mu }{2}\delta _1^2\right\} \end{aligned}$$
(20)
$$\begin{aligned} \mathbb {P}\big (\mathcal {C}_1(\varvec{q}^{\mathrm {ip}}, \varvec{x}) \ge (1+\delta _2)\mu \big )&\le \exp \left\{ -\frac{\mu }{3}\min \{\delta _2, \delta _2^2\}\right\} . \end{aligned}$$
(21)

The approximate nearest neighbor problem with \(r > 0\) and \(c > 1\) is defined as follows: if there exists an \(\varvec{x}^*\) with \({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}^*) \le r\), then we return an \(\widetilde{\varvec{x}}\) with \({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}}, \widetilde{\varvec{x}}) < cr\). For \(cr > r + \mathcal {E}_{1}\), we can set T logarithmically dependent on the dataset size to solve the approximate nearest neighbor problem for \({\mathcal {L}}_{\mathrm {ip}}\), using \(d_T\) with constant success probability: for this, we require a viable t that fulfills

$$\begin{aligned}&{\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x})> cr \Rightarrow d(\varvec{q}^{\mathrm {ip}},\varvec{x}) > t\text { and}\\&{\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le r \Rightarrow d(\varvec{q}^{\mathrm {ip}},\varvec{x}) <= t. \end{aligned}$$

Namely, set \(t = \frac{t_1 + t_2}{2}\), where

$$\begin{aligned}&t_1 = {\left\{ \begin{array}{ll} r + \mathcal {E}_{1},&{}r\le 1-s_{1}\\ 1 - \frac{2(1-r)}{\pi },&{} r\in (1-s_{1},1)\\ r,&{}r\ge 1 \end{array}\right. }\\ \text {and } &t_2 = {\left\{ \begin{array}{ll} cr,&{}cr\le 1\\ 1 + \frac{2(cr-1)}{\pi },&{} cr\in (1,1+s_{1})\\ cr - \mathcal {E}_{1},&{}cr\ge 1 + s_{1} \end{array}\right. }. \end{aligned}$$

In any case, it is \(t_2 > t_1\):

First, note that \(t_1\) and \(t_2\) are strictly monotone increasing in r and cr, respectively. It, therefore, suffices to show \(\underline{t_2} \ge t_1\) for the lower bound \(\underline{t_2}\) based on \(\underline{cr} = r + \mathcal {E}_{1}\).

(Case \(r\le 1-s_{1}\)): it is \(t_1 = r + \mathcal {E}_{1}\) and \(\underline{t_2} = \underline{cr}\), where

$$\begin{aligned} t_1 = r + \mathcal {E}_{1} = \underline{cr} = \underline{t_2} \end{aligned}$$

(Case \(r\in (1-s_{1},1-\mathcal {E}_{1}]\)): it is \(t_1 = 1 - \frac{2}{\pi }(1-r)\) and \(\underline{t_2} = \underline{cr}\), such that

$$\begin{aligned}&t_1 = 1 - \frac{2}{\pi }(1-r) \le r + \mathcal {E}_{1} = \underline{cr} = \underline{t_2}\\ \Leftrightarrow&\left(1 - \frac{2}{\pi }\right)(1-r) \le \mathcal {E}_{1} \Leftrightarrow (1-r) \le s_1 \Leftrightarrow r \ge 1 - s_1 \end{aligned}$$

(Case \(r\in (1-\mathcal {E}_{1},1]\)): it is \(t_1 = 1 - \frac{2}{\pi }(1-r)\) and \(\underline{t_2} = 1 + \frac{2}{\pi }(\underline{cr} - 1)\) with \(\underline{cr} > 1\), such that

$$\begin{aligned} t_1 = 1 - \frac{2}{\pi }(1-r) \le 1 \le 1 + \frac{2}{\pi }(\underline{cr} - 1) = \underline{t_2} \end{aligned}$$

(Case \(r\in (1,1+s_1-\mathcal {E}_{1}]\)): It is \(t_1 = r\) and \(\underline{t_2} = 1 + \frac{2}{\pi }(\underline{cr} - 1)\) such that

$$\begin{aligned}&t_1 = r \le 1 + \frac{2}{\pi }(r + \mathcal {E}_{1} - 1) = 1 + \frac{2}{\pi }(\underline{cr} - 1) = \underline{t_2}\\ \Leftrightarrow&\left(1-\frac{2}{\pi }\right)r \le \left(1-\frac{2}{\pi }\right) - \left(1-\frac{2}{\pi }\right)\mathcal {E}_{1} + \mathcal {E}_{1}\\ \Leftrightarrow&r \le 1 + s_1 - \mathcal {E}_{1} \end{aligned}$$

(Case \(r > 1+s_1-\mathcal {E}_{1}\)): it is \(t_1 = r\) and \(\underline{t_2} = \underline{cr} - \mathcal {E}_{1}\), where

$$\begin{aligned} t_1 = r = \underline{cr} - \mathcal {E}_{1} = \underline{t_2} \end{aligned}$$

\(\square\)

Now, define

$$\begin{aligned} \delta = \left| \frac{t - d(\varvec{q}^{\mathrm {ip}},\varvec{x})}{1+\Vert \varvec{x}\Vert _{{\tiny 2}}-d(\varvec{q}^{\mathrm {ip}},\varvec{x})}\right| = \left| T\frac{t - d(\varvec{q}^{\mathrm {ip}},\varvec{x})}{2\Vert \varvec{x}\Vert _{{\tiny 2}}\mu }\right| . \end{aligned}$$

For \({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le r\), we can lower bound the probability of \(d_T(\varvec{q}^{\mathrm {ip}},\varvec{x})\) not exceeding the specified threshold:

$$\begin{aligned} \mathbb {P}\big (d_T(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le t \big )&= \mathbb {P}\big ( \mathcal {C}(\varvec{q}^{\mathrm {ip}}, \varvec{x}) \ge (1-\delta )\mu \big )\\&= 1 - \mathbb {P}\big ( \mathcal {C}(\varvec{q}^{\mathrm {ip}}, \varvec{x}) \le (1-\delta )\mu \big ) \\&\overset{(20)}{\ge } 1 - \exp \left\{ -\frac{\mu }{2}\delta ^2\right\} . \end{aligned}$$

We can show \(d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le t_1\), using Lemma 1, (c) and (d): (Case \(r\le 1-s_{1}\)):

$$\begin{aligned} d(\varvec{q}^{\mathrm {ip}},\varvec{x}) - {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le \mathcal {E}_{1} \Rightarrow d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le r + \mathcal {E}_{1} \end{aligned}$$

(Case \(r \in (1-s_{1},1)\)):

$$\begin{aligned}&d(\varvec{q}^{\mathrm {ip}},\varvec{x}) - {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le \left(1-\frac{2}{\pi }\right)(1 - {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}))\\ \Rightarrow \,&d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le 1 - \frac{2}{\pi }(1 - {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x})) \le 1 - \frac{2}{\pi }(1 - r) = t_1 \end{aligned}$$

(Case \(r \ge 1\)): For \({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le 1\) it is \(d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le 1\). Else \(d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x})\), such that

$$\begin{aligned} d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le \max \{1, {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x})\} \le r = t_1 \end{aligned}$$

Thus, we can bound

$$\begin{aligned} \delta \overset{d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le t_1 < t}{\ge } \frac{T(t - t_1)}{2\Vert \varvec{x}\Vert _{{\tiny 2}}\mu } \overset{\Vert \varvec{x}\Vert _{{\tiny 2}} \le 1}{\ge } \frac{T(t - t_2)}{2\mu } = \frac{T(t_2-t_1)}{4\mu } \end{aligned}$$

and

$$\begin{aligned} \delta ^2\mu \ge \frac{T^2(t_2-t_1)^2}{16\mu } \overset{\mu \le T}{\ge } \frac{T(t_2-t_1)^2}{16}, \end{aligned}$$

such that

$$\begin{aligned} \mathbb {P}\big ( d_T(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le t \big ) \ge 1 - \exp \left\{ -\frac{(t_2-t_1)^2}{32}T\right\} . \end{aligned}$$

For \({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) > cr\), we can upper bound the probability of \(d_T(\varvec{q}^{\mathrm {ip}},\varvec{x})\) dropping below the specified threshold:

$$\begin{aligned} \mathbb {P}\big ( d_T(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le t \big ) = \mathbb {P}\big ( \mathcal {C}(\varvec{q}^{\mathrm {ip}}, \varvec{x}) \ge (1+\delta )\mu \big )\\ \overset{(21)}{\le } \exp \left\{ -\frac{\mu }{3}\min \{\delta , \delta ^2\}\right\} . \end{aligned}$$

We can show \(d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \ge t_2\), using Lemma 1, (c) and (d):

(Case \(cr\le 1\)): for \({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) \ge 1\) it is \(d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \ge 1\). Else \(d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \ge {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x})\), such that

$$\begin{aligned} d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \ge \min \{1, {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x})\} \ge cr = t_2 \end{aligned}$$

(Case \(cr\in (1, 1+s_1)\)):

$$\begin{aligned}&{\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) - d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le \left(1-\frac{2}{\pi }\right)({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) - 1)\\ \Rightarrow \,&d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \ge 1 + \frac{2}{\pi }({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) - 1) \ge 1 - \frac{2}{\pi }(cr - 1) = t_2 \end{aligned}$$

(Case \(cr \ge 1+s_1\)):

$$\begin{aligned} {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}},\varvec{x}) - d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le \mathcal {E}_{1} \Rightarrow d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \ge cr - \mathcal {E}_{1} = t_2. \end{aligned}$$

Thus, we can bound

$$\begin{aligned} \delta \overset{d(\varvec{q}^{\mathrm {ip}},\varvec{x}) \ge t_2 > t}{\ge } \frac{T(t_2 - t)}{2\Vert \varvec{x}\Vert _{{\tiny 2}}\mu } \overset{\Vert \varvec{x}\Vert _{{\tiny 2}} \le 1}{\ge } \frac{T(t_2 - t)}{2\mu } = \frac{T(t_2 - t_1)}{4\mu }, \end{aligned}$$

such that

$$\begin{aligned} \mathbb {P}\big ( d_T(\varvec{q}^{\mathrm {ip}},\varvec{x}) \le t \big )&\le \exp \left\{ -\min \left\{ \frac{T(t_2 - t_1)}{12}, \frac{T^2(t_2 - t_1)^2}{48\mu }\right\} \right\} \\&\overset{\mu \le T}{\le } \exp \left\{ -\min \left\{ \frac{T(t_2 - t_1)}{12}, \frac{T(t_2 - t_1)^2}{48}\right\} \right\} \\&= \exp \left\{ -\frac{T}{3}\min \left\{ \frac{t_2 - t_1}{4}, \left( \frac{t_2 - t_1}{4}\right) ^2\right\} \right\} \\&\overset{\frac{t_2 - t_1}{4} < 1}{=} \exp \left\{ -\frac{T}{3}\left( \frac{t_2 - t_1}{4}\right) ^2\right\} = \exp \left\{ -\frac{(t_2 - t_1)^2}{48}T\right\} . \end{aligned}$$

Now, define the events

$$\begin{aligned}&E_1(\varvec{q}^{\mathrm {ip}},\varvec{x}):{\text { either }} \;{\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}}, \varvec{x}) > r\;{\text or}\; d_T(\varvec{q}^{\mathrm {ip}}, \varvec{x}) \le t,\end{aligned}$$
(22)
$$\begin{aligned}&E_2(\varvec{q}^{\mathrm {ip}}):\forall \varvec{x}\in X:{\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}}, \varvec{x})> cr \Rightarrow d_T(\varvec{q}^{\mathrm {ip}}, \varvec{x}) > t. \end{aligned}$$
(23)

Assume that there exists \(\varvec{x}^*\) with \({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}}, \varvec{x}^*) \le r\). Then the algorithm is successful if both, \(E_1(\varvec{q}^{\mathrm {ip}}, \varvec{x}^*)\) and \(E_2(\varvec{q}^{\mathrm {ip}})\) hold simultaneously. Let \(T \ge \frac{48}{(t_2-t_1)^2}\log (\frac{n}{\varepsilon })\). It is

$$\begin{aligned} \mathbb {P}\big (E_2(\varvec{q}^{\mathrm {ip}})\big )&\!=\! 1 \!- \! \mathbb {P}\big (\exists \varvec{x}\in X: {\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}}, \varvec{x}) \!>\! cr, d_T(\varvec{q}^{\mathrm {ip}}, \varvec{x}^*)\! \le \! t\big )\\&\ge 1 - \sum _{\varvec{x}\in X} \mathbb {P}\big ({\mathcal {L}}_{\mathrm {ip}}(\varvec{q}^{\mathrm {ip}}, \varvec{x}) > cr, d_T(\varvec{q}^{\mathrm {ip}}, \varvec{x}) \le t\big )\\&\ge 1 - n\exp \left\{ -\frac{(t_2-t_1)^2}{48}T\right\} \ge 1 - \varepsilon . \end{aligned}$$

In addition, it holds that

$$\begin{aligned} \mathbb {P}\big (E_1(\varvec{q}^{\mathrm {ip}}, \varvec{x}^*)\big ) \ge 1 - \left( \frac{\varepsilon }{n}\right) ^\frac{3}{2}. \end{aligned}$$

Therefore, the probability of the algorithm to perform approximate nearest neighbor search correctly is larger than

$$\begin{aligned} \mathbb {P}\big (E_2(\varvec{q}^{\mathrm {ip}}), E_1(\varvec{q}^{\mathrm {ip}}, \varvec{x}^*)\big )&\ge 1 - \mathbb {P}\big (\lnot E_2(\varvec{q}^{\mathrm {ip}})\big ) - \mathbb {P}\big (\lnot E_1(\varvec{q}^{\mathrm {ip}}, \varvec{x}^*)\big ) \ge 1 - \varepsilon - \left( \frac{\varepsilon }{n}\right) ^\frac{3}{2}. \end{aligned}$$

F Details of cover tree

Here, we detail how to selectively explore the hash buckets with the code dissimilarity measure in non-increasing order. The difficulty is in that the dissimilarity \(\mathcal {D}\) is a linear combination of metrics, where the weights are selected at query time. Such a metric is referred to as a dynamic metric function or a multi-metric Bustos (2012). We use a tree data structure, called the cover tree Beygelzimer et al. (2006), to index the metric space.

We begin the description of the cover tree by introducing the expansion constant and the base of the expansion constant.

Expansion constant (\(\kappa\)) Heinonen (2001): is defined as the smallest value \(\kappa \ge \psi\) such that every ball in the dataset \({\mathcal {X}}\) can be covered by \(\kappa\) balls in \({\mathcal {X}}\) of radius equal \(1/\psi\). Here, \(\psi\) is the base of the expansion constant.

Data structure: Given a set of data points \({\mathcal {X}}\), the cover tree \(\mathcal {T}\) is a leveled tree where each level is associated with an integer label \(i\), which decreases as the tree is descended. For ease of explanation, let \(B_{\psi ^i}(\varvec{x})\) denote a closed ball centered at point \(\varvec{x}\) with radius \(\psi ^i\), i.e., \(B_{\psi ^i}(\varvec{x}) = \{ p \in {\mathcal {X}}: \mathcal {D}(p,\varvec{x}) \le \psi ^i\}\). At every level \(i\) of \(\mathcal {T}\) (except the root), we create a union of possibly overlapping closed balls with radius \(\psi ^i\) that cover (or contain) all the data points \({\mathcal {X}}\). The centers of this covering set of balls are stored in nodes at level \(i\) of \(\mathcal {T}\). Let \(\mathcal {C}_i\) denote the set of nodes at level \(i\). The cover tree \(\mathcal {T}\) obeys the following three invariants at all levels:

  1. 1.

    (Nesting) \(\mathcal {C}_i\subset \mathcal {C}_{i-1}\). Once a point \(\varvec{x}\in {\mathcal {X}}\) is in a node in \(\mathcal {C}_i\), then it also appears in all its successor nodes.

  2. 2.

    (Covering) For every \(\varvec{x}' \in \mathcal {C}_{i-1}\), there exists a \(\varvec{x}\in \mathcal {C}_{i}\) where \(\varvec{x}'\) lies inside \(B_{\psi ^i}(\varvec{x})\), and exactly one such \(\varvec{x}\) is a parent of \(\varvec{x}'\).

  3. 3.

    (Separation) For all \(\varvec{x}_1,\varvec{x}_2 \in \mathcal {C}_{i}\), \(\varvec{x}_1\) lies outside \(B_{\psi ^i}(\varvec{x}_2)\) and \(\varvec{x}_2\) lies outside \(B_{\psi ^i}(\varvec{x}_1)\).

This structure has a space bound of O(N), where N is the number of samples.

Construction: We use the batch construction method (Beygelzimer et al. 2006), where the cover tree \(\mathcal {T}\) is built in a top–down fashion. Initially, we pick a data point \(\varvec{x}^{(0)}\) and an integer s, such that the closed ball \(B_{\psi ^{s}} (\varvec{x}^{(0)})\) is the tightest fit that covers the entire dataset \({\mathcal {X}}\).

This point \(\varvec{x}^{(0)}\) is placed in a single node, called the root of the tree \(\mathcal {T}\). We denote the root node as \(\mathcal {C}_{i}\) (where \(i= s\)). To generate the set \(\mathcal {C}_{i-1}\) of the child nodes for \(\mathcal {C}_i\), we greedily pick a set of points (including point \(\varvec{x}^{(0)}\) from \(\mathcal {C}_i\) to satisfy the Nesting invariant) and generate closed balls of radius \(\psi ^{i-1}\) centered on them, in such a way that: (a) all center points lie inside \(B_{\psi ^i}(\varvec{x}^{(0)})\) (Covering invariant), (b) no center point intersects with other balls of radius \(\psi ^{i-1}\) at level \(i-1\) (Separation invariant), and (c) the union of these closed balls covers the entire dataset \({\mathcal {X}}\). These chosen center points form the set of nodes \(\mathcal {C}_{i-1}\). Child nodes are recursively generated from each node in \(\mathcal {C}_{i-1}\), until each data point in \({\mathcal {X}}\) is the center of a closed ball and resides in a leaf node of \(\mathcal {T}\).

Note that, while we construct our cover tree, we use our distance function \(\mathcal {D}\) with all the weights set to 1.0, which upper bounds all subsequent distance metrics that depend on the queries. The construction time complexity is \(O(\kappa ^{12} N \ln N)\).

To achieve a more compact cover tree, we store only element identification numbers (IDs) in the cover tree, and not the original vectors. Furthermore, we store the hash bits using compressed representation bit-sets that reduce the storage size compared to a naive implementation down to T bits.

Querying: The nearest neighbor query in a cover tree is illustrated in Algorithm 1. The search for the nearest neighbor begins at the root of the cover tree and descends levelwise. On each descent, we build a candidate set \(\mathcal {C}\) (Line 3), which holds all the child nodes (center points of our closed balls). We then prune away centers (nodes) in \(\mathcal {C}\) (Line 4) that cannot possibly lead to a nearest neighbor to the query point \(\varvec{q}\), if we descended down them.

The pruning mechanism is predicated on a proven result in Beygelzimer et al. (2006) which states that for any point \(\varvec{x}\in \mathcal {C}_{i-1}\), the distance between \(\varvec{x}\) and any descendant \(\varvec{x}'\) is upper bounded by \(\psi ^i\). Therefore, on Line 4, the \(\min _{\varvec{x}' \in \mathcal {C}}\mathcal {D}(\varvec{q}, \varvec{x}')\) term on the right-hand side of the inequality computes the shortest distance from every center point to the query point \(\varvec{q}\). Any center point whose distance from \(\varvec{q}\) exceeds \(\min _{\varvec{x}' \in \mathcal {C}}\mathcal {D}(\varvec{q}, \varvec{x}') + \psi ^i\) cannot possibly have a descendant that can replace the current closest center point to \(\varvec{q}\) and hence can safely be pruned. We add an additional check (lines 5–6) to speedup the search by not always descending to the leaf node. The time complexity of querying the cover tree is \(O(\kappa ^{12} \ln N)\).

Effect of multi-metric distance while querying: It is important to note that minimizing overlap between the closed balls on higher levels (i.e., closer to the root) of the cover tree can allow us to effectively prune a very large portion of the search space and compute the nearest neighbor faster.

Recall that the cover tree is constructed by setting our distance function \(\mathcal {D}\) with all the weights set to 1.0. During querying, we allow \(\mathcal {D}\) to be a linear combination of metrics, where the weights lie in the range [0, 1], which means that the distance metric \(\mathcal {D}\) used during querying always under-estimates the distances and reports lower distances. During querying, the cover tree’s structure is still intact and all the invariant properties satisfied. The main difference occurs on Line 4 with the \(\min _{\varvec{x}' \in \mathcal {C}}\mathcal {D}(\varvec{q}, \varvec{x}')\) term, which is the shortest distance from a center point to the query \(\varvec{q}\) (using the new distance metric). Interestingly, this new distance gets even smaller, thus reducing our search radius (i.e., \(\min _{\varvec{x}' \in \mathcal {C}}\mathcal {D}(\varvec{q}, \varvec{x}') + \psi ^i\)) centered at \(\varvec{q}\), which in turn implies that at every level, we manage to prune more center points, as the overlap between the closed balls also is reduced.

Streaming: The cover tree lends itself naturally to the setting where nearest neighbor computations have to be performed on a stream of data points. This is because the cover tree allows dynamic insertion and deletion of points. The time complexity for both these operations is \(O(\kappa ^{6} \ln N)\), which is faster than querying.

Parameter choice: In our implementation for experiment, we set the base of expansion constant to \(\psi = 1.2\), which we empirically found to work best on the texmex dataset.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pronobis, W., Panknin, D., Kirschnick, J. et al. Sharing hash codes for multiple purposes. Jpn J Stat Data Sci 1, 215–246 (2018). https://doi.org/10.1007/s42081-018-0010-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42081-018-0010-x

Keywords

Navigation