Advertisement

Extreme pivots: a pivot selection strategy for faster metric search

  • Guillermo Ruiz
  • Edgar Chavez
  • Ubaldo Ruiz
  • Eric S. TellezEmail author
Regular Paper
  • 29 Downloads

Abstract

This manuscript presents the extreme pivots (EP) metric index, a data structure, to speed up exact proximity searching in the metric space model. For the EP, we designed an automatic rule to select the best pivots for a dataset working on limited memory resources. The net effect is that our approach solves queries efficiently with a small memory footprint, and without a prohibitive construction time. In contrast with other related structures, our performance is achieved automatically without dealing directly with the index’s parameters, using optimization techniques over a model of the index. The EP’s model is studied in-depth in this contribution. In practical terms, an interested user only needs to provide the available memory and a sample of the query distribution as parameters. The resulting index is quickly built, and has a good trade-off among memory usage, preprocessing, and search time. We provide an extensive experimental comparison with state-of-the-art searching methods. We also carefully compared the performance of metric indexes in several scenarios, firstly with synthetic data to characterize performance as a function of the intrinsic dimension and the size of the database, and also in different real-world datasets with excellent results.

Keywords

Nearest neighbors search Pivot-based metric indexes Extreme pivots 

Notes

References

  1. 1.
    Arya S, Mount D, Netanyahu N, Silverman R, Wu Y (1998) An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J ACM 45(6):891–923MathSciNetCrossRefGoogle Scholar
  2. 2.
    Böhm C, Berchtold S, Keim DA (2001) Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput Surv 33(3):322–373.  https://doi.org/10.1145/502807.502809 CrossRefGoogle Scholar
  3. 3.
    Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Piccioli T, Rabitti F (2009) CoPhIR: a test collection for content-based image retrieval. CoRR abs/0905.4627v2. http://cophir.isti.cnr.it
  4. 4.
    Burges CJC (2010) Dimension reduction: a guided tour (foundations and trends(r) in machine learning), 1st edn. Now Publishers Inc, Microsoft Research, Boston.  https://doi.org/10.1561/2200000002 CrossRefGoogle Scholar
  5. 5.
    Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit Lett 24(14):2357–2366CrossRefGoogle Scholar
  6. 6.
    Celik C (2002) Priority vantage points structures for similarity queries in metric spaces. In: EurAsia-ICT ’02: proceedings of the 1st EurAsian conference on information and communication technology. Springer, London, pp 256–263zbMATHGoogle Scholar
  7. 7.
    Celik C (2008) Effective use of space for pivot-based metric indexing structures. In: SISAP ’08: proceedings of the 1st international workshop on similarity search and applications (sisap 2008). IEEE Computer Society, Washington, pp 113–120.  https://doi.org/10.1109/SISAP.2008.22
  8. 8.
    Chávez E, Marroquin JL, Baeza-Yates R (1999) Spaghettis: an array based algorithm for similarity queries in metric spaces. In: String processing and information retrieval symposium, 1999 and international workshop on groupware, pp 38–46. IEEEGoogle Scholar
  9. 9.
    Chávez E, Navarro G (2003) Probabilistic proximity search: fighting the curse of dimensionality in metric spaces. Inf Process Lett 85:39–46MathSciNetCrossRefGoogle Scholar
  10. 10.
    Chávez E, Navarro G (2005) A compact space decomposition for effective metric indexing. Pattern Recognit Lett 26:1363–1376.  https://doi.org/10.1016/j.patrec.2004.11.014 CrossRefGoogle Scholar
  11. 11.
    Chavez E, Navarro G, Baeza-Yates R, Marroquin JL (2001) Searching in metric spaces. ACM Comput Surv 33(3):273–321.  https://doi.org/10.1145/502807.502808 CrossRefGoogle Scholar
  12. 12.
    Chen L, Gao Y, Zheng B, Jensen CS, Yang H, Yang K (2017) Pivot-based metric indexing. Proc VLDB Endow 10(10):1058–1069.  https://doi.org/10.14778/3115404.3115411 CrossRefGoogle Scholar
  13. 13.
    Chávez E, Ludueña V, Reyes N, Roggero P (2016) Faster proximity searching with the distal sat. Inf Syst 59:15–47.  https://doi.org/10.1016/j.is.2015.10.014 CrossRefGoogle Scholar
  14. 14.
    Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data bases, VLDB ’97. Morgan Kaufmann Publishers Inc., San Francisco, pp 426–435. http://dl.acm.org/citation.cfm?id=645923.671005
  15. 15.
    Cormen TH, Leiserson C, Rivest RL, Stein CELC (2001) Introduction to algorithms, 2nd edn. McGraw-Hill Inc, New YorkzbMATHGoogle Scholar
  16. 16.
    Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4):517–580.  https://doi.org/10.1145/958942.958948 CrossRefGoogle Scholar
  17. 17.
    Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces (survey article). ACM Trans Database Syst (TODS) 28(4):517–580CrossRefGoogle Scholar
  18. 18.
    Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R (2005) idistance: an adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst 30(2):364–397.  https://doi.org/10.1145/1071610.1071612 CrossRefGoogle Scholar
  19. 19.
    Micó ML, Oncina J, Vidal E (1994) A new version of the nearest-neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements. Pattern Recognit Lett 15:9–17.  https://doi.org/10.1016/0167-8655(94)90095-7 CrossRefGoogle Scholar
  20. 20.
    Mirylenka K, Giannakopoulos G, Do LM, Palpanas T (2017) On classifier behavior in the presence of mislabeling noise. Data Min Knowl Discov 31(3):661–701.  https://doi.org/10.1007/s10618-016-0484-8 MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Navarro G (2002) Searching in metric spaces by spatial approximation. Very Large Databases J (VLDBJ) 11(1):28–46CrossRefGoogle Scholar
  22. 22.
    Novak D, Batko M (2009) Metric index: an efficient and scalable solution for similarity search. In: Second international workshop on similarity search and applications, 2009. SISAP ’09, pp. 65–73.  https://doi.org/10.1109/SISAP.2009.26
  23. 23.
    Pedreira O, Brisaboa N (2007) Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen J, Italiano G, van der Hoek W, Meinel C, Sack H, Plášil F (eds) SOFSEM 2007: theory and practice of computer science. Lecture notes in computer science, vol 4362. Springer, Berlin, pp 434–445.  https://doi.org/10.1007/978-3-540-69507-3_37 zbMATHGoogle Scholar
  24. 24.
    Pestov V (2007) Intrinsic dimension of a dataset: what properties does one expect? In: Proceedings of 20th International Joint Conference on Neural Networks, pp 1775–1780Google Scholar
  25. 25.
    Pestov V (2008) An axiomatic approach to intrinsic dimension of a dataset. Neural Netw 21(2–3):204–213CrossRefGoogle Scholar
  26. 26.
    Pestov V (2010) Indexability, concentration, and VC theory. In: Proceedings of 3rd international conference on similarity search and applications (SISAP), pp 3–12Google Scholar
  27. 27.
    Pestov V (2010) Intrinsic dimensionality. ACM SIGSPATIAL 2:8–11.  https://doi.org/10.1145/1862413.1862416 CrossRefGoogle Scholar
  28. 28.
    Ruiz G, Santoyo F, Chávez E, Figueroa K, Tellez ES (2013) Extreme pivots for faster metric indexes. In: Brisaboa N, Pedreira O, Zezula P (eds) Similarity search and applications. Springer, Berlin, pp 115–126CrossRefGoogle Scholar
  29. 29.
    Samet H (2006) Foundations of multidimensional and metric data structures. Morgan Kaufmann, Los AltoszbMATHGoogle Scholar
  30. 30.
    Shaft U, Ramakrishnan R (2006) Theory of nearest neighbors indexability. ACM Trans Database Syst 31:814–838.  https://doi.org/10.1145/1166074.1166077 CrossRefGoogle Scholar
  31. 31.
    Skopal T (2004) Pivoting m-tree: a metric access method for efficient similarity search. In: DATESO’04, pp 27–37Google Scholar
  32. 32.
    Skopal T (2010) Where are you heading, metric access methods?: a provocative survey. In: Proceedings of the 3rd international conference on similarity search and applications, SISAP’10. ACM, New York, pp 13–21.  https://doi.org/10.1145/1862344.1862347
  33. 33.
    Skopal T, Bustos B (2011) On nonmetric similarity search problems in complex domains. ACM Comput Surv 43(4), art. 34CrossRefGoogle Scholar
  34. 34.
    Tellez E, Ruiz G, Chavez E (2016) Singleton indexes for nearest neighbor search. Inf Syst 60:50–68.  https://doi.org/10.1016/j.is.2016.03.003 CrossRefGoogle Scholar
  35. 35.
    Theiler J (1990) Estimating fractal dimension. J Opt Soc Am A 7(6):1055–1073.  https://doi.org/10.1364/JOSAA.7.001055 MathSciNetCrossRefGoogle Scholar
  36. 36.
    Vidal Ruiz E (1986) An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognit Lett 4:145–157CrossRefGoogle Scholar
  37. 37.
    Volnyansky I, Pestov V (2009) Curse of dimensionality in pivot based indexes. In: Proceedings of 2nd international workshop on similarity search and applications (SISAP), pp 39–46.  https://doi.org/10.1109/SISAP.2009.9
  38. 38.
    Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the 4th annual ACM-SIAM symposium on discrete algorithms, SODA ’93. Society for Industrial and Applied Mathematics, Philadelphia, pp 311–321. http://dl.acm.org/citation.cfm?id=313559.313789
  39. 39.
    Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search—the metric space approach. Advances in database systems, vol 32. Springer, BelrinCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.CONACyT-CentroGeo AguascalientesAguascalientesMexico
  2. 2.CICESEEnsenadaMexico
  3. 3.CONACyT-CICESEEnsenadaMexico
  4. 4.CONACyT - INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y ComunicaciónAguascalientesMexico

Personalised recommendations