Advertisement

Engineering Efficient and Effective Non-metric Space Library

  • Leonid Boytsov
  • Bilegsaikhan Naidan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8199)

Abstract

We present a new similarity search library and discuss a variety of design and performance issues related to its development. We adopt a position that engineering is equally important to design of the algorithms and pursue a goal of producing realistic benchmarks. To this end, we pay attention to various performance aspects and utilize modern hardware, which provides a high degree of parallelization. Since we focus on realistic measurements, performance of the methods should not be measured using merely the number of distance computations performed, because other costs, such as computation of a cheaper distance function, which approximates the original one, are oftentimes substantial. The paper includes preliminary experimental results, which support this point of view. Rather than looking for the best method, we want to ensure that the library implements competitive baselines, which can be useful for future work.

Keywords

benchmarks (non)-metric spaces Bregman divergences 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amato, G., Rabitti, F., Savino, P., Zezula, P.: Region proximity in metric spaces and its use for approximate similarity search. ACM Trans. Inf. Syst. 21(2), 192–227 (2003)CrossRefGoogle Scholar
  2. 2.
    Amato, G., Savino, P.: Approximate similarity search in metric spaces using inverted files. In: Proceedings of the 3rd International Conference on Scalable Information Systems, InfoScale 2008, pp. 28:1–28:10. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels (2008)Google Scholar
  3. 3.
    Bhattacharya, P., Neamtiu, I.: Assessing programming language impact on development and maintenance: a study on C and C++. In: 33rd International Conference on Software Engineering (ICSE), pp. 171–180 (2011)Google Scholar
  4. 4.
    Cayton, L.: Fast nearest neighbor retrieval for bregman divergences. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 112–119. ACM, New York (2008)CrossRefGoogle Scholar
  5. 5.
    Chávez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L.: Searching in metric spaces. ACM Computing Surveys 33(3), 273–321 (2001)CrossRefGoogle Scholar
  6. 6.
    Chávez, E., Navarro, G.: Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces. Information Processing Letters 85(1), 39–46 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Dong, W., Wang, Z., Josephson, W., Charikar, M., Li, K.: Modeling lsh for performance tuning. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, pp. 669–678. ACM, New York (2008)Google Scholar
  8. 8.
    Drepper, U.: What every programmer should know about memory (2007), http://www.akkadia.org/drepper/cpumemory.pdf (last checked August 2012)
  9. 9.
    Eddelbuettel, D., Francois, R.: Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40(8), 1–18 (2011)Google Scholar
  10. 10.
    Elizarov, R.: Millions quotes per second in pure Java (2013), http://blog.devexperts.com/millions-quotes-per-second-in-pure-java/ (last accessed on May 14, 2013)
  11. 11.
    Esuli, A.: Use of permutation prefixes for efficient and scalable approximate similarity search. Inf. Process. Manage. 48(5), 889–902 (2012)CrossRefGoogle Scholar
  12. 12.
    Faloutsos, C.: Searching Multimedia Databases by Content. Kluwer Academic Publisher (1996)Google Scholar
  13. 13.
    Figueroa, K., Navarro, G., Chávez, E.: Metric Spaces Library (2007), http://www.sisap.org/Metric_Space_Library.html
  14. 14.
    Figueroa, K., Fredriksson, K.: Speeding up permutation based indexing with indexing. In: Proceedings of the 2009 Second International Workshop on Similarity Search and Applications, SISAP 2009, pp. 107–114. IEEE Computer Society, Washington, DC (2009)CrossRefGoogle Scholar
  15. 15.
    Fredriksson, K.: Engineering efficient metric indexes. Pattern Recognition Letters 28(1), 75–84 (2007)CrossRefGoogle Scholar
  16. 16.
    Fulgham, B.: The computer language benchmarks game (2013), http://benchmarksgame.alioth.debian.org/ (last accessed on May 14, 2013)
  17. 17.
    Gonzalez, E., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering permutations. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(9), 1647–1658 (2008)CrossRefGoogle Scholar
  18. 18.
    Gonzalez, E.C., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering permutations. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(9), 1647–1658 (2008)CrossRefGoogle Scholar
  19. 19.
    Hedges, L.V., Vevea, J.L.: Fixed-and random-effects models in meta-analysis. Psychological Methods 3(4), 486–504 (1998)CrossRefGoogle Scholar
  20. 20.
    Hundt, R.: Loop recognition in C++/Java/Go/Scala. In: Proceedings of Scala Days 2011 (2011)Google Scholar
  21. 21.
    Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds.) Handbook of Discrete and Computational Geometry, pp. 877–892. Chapman and Hall/CRC (2004)Google Scholar
  22. 22.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998)CrossRefGoogle Scholar
  23. 23.
    Jacobs, D., Weinshall, D., Gdalyahu, Y.: Classification with nonmetric distances: Image retrieval and class representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(6), 583–600 (2000)CrossRefGoogle Scholar
  24. 24.
    King, G.: How not to lie with statistics: Avoiding common mistakes in quantitative political science. American Journal of Political Science, 666–687 (1986)Google Scholar
  25. 25.
    King, R.S.: The top 10 programming languages (the data). IEEE Spectrum 48(10), 84–84 (2011)CrossRefGoogle Scholar
  26. 26.
    Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 614–623. ACM, New York (1998)Google Scholar
  27. 27.
    Lokoč, J., Hetland, M.L., Skopal, T., Beecks, C.: Ptolemaic indexing of the signature quadratic form distance. In: Proceedings of the Fourth International Conference on SImilarity Search and APplications, SISAP 2011, pp. 9–16. ACM, New York (2011)Google Scholar
  28. 28.
    Mu, Y., Yan, S.: Non-metric locality-sensitive hashing. In: AAAI (2010)Google Scholar
  29. 29.
    Novak, D., Kyselak, M., Zezula, P.: On locality-sensitive indexing in generic metric spaces. In: Proceedings of the Third International Conference on SImilarity Search and APplications, SISAP 2010, pp. 59–66. ACM, New York (2010)CrossRefGoogle Scholar
  30. 30.
    Parri, J., Shapiro, D., Bolic, M., Groza, V.: Returning control to the programmer: Simd intrinsics for virtual machines. Commun. ACM 54(4), 38–43 (2011)CrossRefGoogle Scholar
  31. 31.
    Pestov, V.: Indexability, concentration, and VC theory. Journal of Discrete Algorithms 13, 2–18 (2012); Best Papers from the 3rd International Conference on Similarity Search and Applications (SISAP 2010)Google Scholar
  32. 32.
    Pestov, V.: Is the k-NN classifier in high dimensions affected by the curse of dimensionality? Computers & Mathematics with Applications (2012)Google Scholar
  33. 33.
    Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc. (2005)Google Scholar
  34. 34.
    Scott, D.W., Thompson, J.R.: Probability density estimation in higher dimensions. Technical Report, Rice University, Texas Huston (1983)Google Scholar
  35. 35.
    Shafi, A., Carpenter, B., Baker, M., Hussain, A.: A comparative study of java and c performance in two large-scale parallel applications. Concurrency and Computation: Practice and Experience 21(15), 1882–1906 (2009)CrossRefGoogle Scholar
  36. 36.
    Skopal, T.: Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Trans. Database Syst. 32(4) (November 2007)Google Scholar
  37. 37.
    Skopal, T., Bustos, B.: On nonmetric similarity search problems in complex domains. ACM Comput. Surv. 43(4), 34:1–34:50 (October 2011)Google Scholar
  38. 38.
    Uhlmann, J.: Satisfying general proximity similarity queries with metric trees. Information Processing Letters 40, 175–179 (1991)CrossRefGoogle Scholar
  39. 39.
    Vivanco, R.A., Pizzi, N.J.: Scientific computing with Java and C++: a case study using functional magnetic resonance neuroimages. Software: Practice and Experience 35(3), 237–254 (2005)CrossRefGoogle Scholar
  40. 40.
    Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 194–205. Morgan Kaufmann (August 1998)Google Scholar
  41. 41.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag New York, Inc., Secaucus (2005)Google Scholar
  42. 42.
    Zezula, P., Savino, P., Amato, G., Rabitti, F.: Approximate similarity retrieval with m-trees. The VLDB Journal 7(4), 275–293 (1998)CrossRefGoogle Scholar
  43. 43.
    Zhang, Z., Ooi, B.C., Parthasarathy, S., Tung, A.K.H.: Similarity search on bregman divergence: towards non-metric indexing. Proc. VLDB Endow. 2(1), 13–24 (2009)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Leonid Boytsov
    • 1
  • Bilegsaikhan Naidan
    • 2
  1. 1.Language Technologies InstituteCarnegie Mellon UniversityPittsburghUSA
  2. 2.Department of Computer and Information ScienceNorwegian University of Science and TechnologyTrondheimNorway

Personalised recommendations