On Index-Free Similarity Search in Metric Spaces

  • Tomáš Skopal
  • Benjamin Bustos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5690)

Abstract

Metric access methods (MAMs) serve as a tool for speeding similarity queries. However, all MAMs developed so far are index-based; they need to build an index on a given database. The indexing itself is either static (the whole database is indexed at once) or dynamic (insertions/deletions are supported), but there is always a preprocessing step needed. In this paper, we propose D-file, the first MAM that requires no indexing at all. This feature is especially beneficial in domains like data mining, streaming databases, etc., where the production of data is much more intensive than querying. Thus, in such environments the indexing is the bottleneck of the entire production/querying scheme. The idea of D-file is an extension of the trivial sequential file (an abstraction over the original database, actually) by so-called D-cache. The D-cache is a main-memory structure that keeps track of distance computations spent by processing all similarity queries so far (within a runtime session). Based on the distances stored in D-cache, the D-file can cheaply determine lower bounds of some distances while the distances alone have not to be explicitly computed, which results in faster queries. Our experimental evaluation shows that query efficiency of D-file is comparable to the index-based state-of-the-art MAMs, however, for zero indexing costs.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: Genbank. Nucleic Acids Res. 28(1), 15–18 (2000)CrossRefGoogle Scholar
  2. 2.
    Böhm, C., Berchtold, S., Keim, D.: Searching in High-Dimensional Spaces – Index Structures for Improving the Performance of Multimedia Databases. ACM Computing Surveys 33(3), 322–373 (2001)CrossRefGoogle Scholar
  3. 3.
    Brin, S.: Near neighbor search in large metric spaces. In: Proc. 21st Conference on Very Large Databases (VLDB 1995), pp. 574–584. Morgan Kaufmann, San Francisco (1995)Google Scholar
  4. 4.
    Carson, S.D.: A system for adaptive disk rearrangement. Software - Practice and Experience (SPE) 20(3), 225–242 (1990)CrossRefGoogle Scholar
  5. 5.
    Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Computing Surveys 33(3), 273–321 (2001)CrossRefGoogle Scholar
  6. 6.
    Ciaccia, P., Patella, M., Zezula, P.: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: VLDB 1997, pp. 426–435 (1997)Google Scholar
  7. 7.
    Effelsberg, W., Haerder, T.: Principles of database buffer management. ACM Transactions on Database Systems (TODS) 9(4), 560–595 (1984)CrossRefGoogle Scholar
  8. 8.
    Falchi, F., Lucchese, C., Orlando, S., Perego, R., Rabitti, F.: A metric cache for similarity search. In: LSDS-IR 2008: Proceeding of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval, pp. 43–50. ACM Press, New York (2008)CrossRefGoogle Scholar
  9. 9.
    Falchi, F., Lucchese, C., Orlando, S., Perego, R., Rabitti, F.: Caching content-based queries for robust and efficient image retrieval. In: EDBT 2009: Proceedings of the 12th International Conference on Extending Database Technology, pp. 780–790. ACM Press, New York (2009)Google Scholar
  10. 10.
    Hettich, S., Bay, S.: The UCI KDD archive (1999), http://kdd.ics.uci.edu
  11. 11.
    Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4), 517–580 (2003)CrossRefGoogle Scholar
  12. 12.
    Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, San Francisco (2006)MATHGoogle Scholar
  13. 13.
    Skopal, T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search. In: Proceedings of the 4th annual workshop DATESO, Desná, Czech Republic, ISBN 80-248-0457-3, also available at CEUR, vol. 98, pp. 21–31 (2004) ISSN 1613-0073, http://www.ceur-ws.org/Vol-98
  14. 14.
    Skopal, T., Pokorný, J., Snášel, V.: Nearest Neighbours Search Using the PM-Tree. In: Zhou, L.-z., Ooi, B.-C., Meng, X. (eds.) DASFAA 2005. LNCS, vol. 3453, pp. 803–815. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  15. 15.
    Uhlmann, J.: Satisfying general proximity/similarity queries with metric trees. Information Processing Letters 40(4), 175–179 (1991)CrossRefMATHGoogle Scholar
  16. 16.
    Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Computing Surveys 33(2), 209–271 (2001)CrossRefGoogle Scholar
  17. 17.
    Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB 1998: Proceedings of the 24rd International Conference on Very Large Data Bases, pp. 194–205. Morgan Kaufmann Publishers Inc., San Francisco (1998)Google Scholar
  18. 18.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer, Secaucus (2005)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Tomáš Skopal
    • 1
  • Benjamin Bustos
    • 2
  1. 1.Department of Software Engineering, FMPCharles University in PraguePragueCzech Republic
  2. 2.Department of Computer ScienceUniversity of ChileSantiagoChile

Personalised recommendations