Pivot Selection Strategies for Permutation-Based Similarity Search

  • Giuseppe Amato
  • Andrea Esuli
  • Fabrizio Falchi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8199)

Abstract

Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query. This, of course assumes that similar objects are represented by similar permutations of the pivots.

In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection strategies do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivots selection strategies on three permutation-based similarity access methods. Among those, we propose a novel strategy specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested strategies. Second, there is not a strategy that is universally the best for all permutation-based access methods; rather different strategies are optimal for different methods.

Keywords

permutation-based pivot metric space similarity search inverted files content based image retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amato, G., Gennaro, C., Savino, P.: Mi-file: Using inverted files for scalable approximate similarity search. Multimedia Tools and Applications- An International Journal (November 2012) (online first)Google Scholar
  2. 2.
    Amato, G., Savino, P.: Approximate similarity search in metric spaces using inverted files. In: Proceedings of the 3rd International Conference on Scalable Information Systems, InfoScale 2008, pp. 28:1–28:10. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, Brussels (2008)Google Scholar
  3. 3.
    Batko, M., Falchi, F., Lucchese, C., Novak, D., Perego, R., Rabitti, F., Sedmidubsky, J., Zezula, P.: Building a web-scale image similarity search system. In: Multimedia Tools and ApplicationsGoogle Scholar
  4. 4.
    Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, R., Piccioli, T., Rabitti, F.: Cophir: a test collection for content-based image retrieval. CoRR, abs/0905.4627 (2009)Google Scholar
  5. 5.
    Brin, S.: Near neighbor search in large metric spaces. In: Proceedings of 21th International Conference on Very Large Data Bases, VLDB 1995, Zurich, Switzerland, September 11-15, pp. 574–584. Morgan Kaufmann (1995)Google Scholar
  6. 6.
    Bustos, B., Pedreira, O., Brisaboa, N.: A dynamic pivot selection technique for similarity search. In: IEEE 24th International Conference on Data Engineering Workshop, ICDEW 2008, pp. 394–401 (2008)Google Scholar
  7. 7.
    Bustos, B., Navarro, G., Chávez, E.: Pivot selection techniques for proximity searching in metric spaces. Pattern Recogn. Lett. 24(14), 2357–2366 (2003)MATHCrossRefGoogle Scholar
  8. 8.
    Chávez, E., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering permutations. IEEE Trans. Pattern Anal. Mach. Intell. 30(9), 1647–1658 (2008)CrossRefGoogle Scholar
  9. 9.
    Dasgupta, S.: Performance guarantees for hierarchical clustering. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS(LNAI), vol. 2375, pp. 351–363. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  10. 10.
    Esuli, A.: Mipai: Using the pp-index to build an efficient and scalable similarity search system. In: SISAP, pp. 146–148 (2009)Google Scholar
  11. 11.
    Esuli, A.: Use of permutation prefixes for efficient and scalable approximate similarity search. Information Processing & Management 48(5), 889–902 (2012)CrossRefGoogle Scholar
  12. 12.
    Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2003, pp. 28–36. Society for Industrial and Applied Mathematics, Philadelphia (2003)Google Scholar
  13. 13.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of 25th International Conference on Very Large Data Bases, VLDB 1999, pp. 518–529 (1999)Google Scholar
  14. 14.
    Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293–306 (1985)MATHCrossRefGoogle Scholar
  15. 15.
    Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. John Wiley and Sons, New York (1990)CrossRefGoogle Scholar
  16. 16.
    Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li Multi-probe, K.: lsh: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference Very Large Data Bases, VLDB 2007, Vienna, Austria, pp. 950–961 (2007)Google Scholar
  17. 17.
    Mao, R., Miranker, W.L., Miranker, D.P.: Dimension reduction for distance-based indexing. In: Proceedings of the Third International Conference on SImilarity Search and APplications, SISAP 2010, pp. 25–32. ACM, New York (2010)CrossRefGoogle Scholar
  18. 18.
    Micó, M.L., Oncina, J., Vidal, E.: A new version of the nearest-neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15(1), 9–17 (1994)CrossRefGoogle Scholar
  19. 19.
    Novak, D., Batko, M., Zezula, P.: Metric index: An efficient and scalable solution for precise and approximate similarity search. Inf. Syst. 36(4), 721–733 (2011)CrossRefGoogle Scholar
  20. 20.
    Novak, D., Kyselak, M., Zezula, P.: On locality-sensitive indexing in generic metric spaces. In: Proceedings of the Third International Conference on SImilarity Search and APplications, SISAP 2010, pp. 59–66. ACM, New York (2010)CrossRefGoogle Scholar
  21. 21.
    Paredes, R., Navarro, G.: Optimal incremental sorting. In: In Proc. 8th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 171–182. SIAM Press (2006)Google Scholar
  22. 22.
    Pedreira, O., Brisaboa, N.R.: Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 434–445. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  23. 23.
    Shapiro, M.: The choice of reference points in best-match file searching. Commun. ACM 20(5), 339–343 (1977)CrossRefGoogle Scholar
  24. 24.
    Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 1993, pp. 311–321. Society for Industrial and Applied Mathematics, Philadelphia (1993)Google Scholar
  25. 25.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search - The Metric Space Approach. Advances in Database Systems, vol. 32, pp. 1–191. Kluwer (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Giuseppe Amato
    • 1
  • Andrea Esuli
    • 1
  • Fabrizio Falchi
    • 1
  1. 1.Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo”PisaItaly

Personalised recommendations