Measuring the Difficulty of Distance-Based Indexing

  • Matthew Skala
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3772)

Abstract

Data structures for similarity search are commonly evaluated on data in vector spaces, but distance-based data structures are also applicable to non-vector spaces with no natural concept of dimensionality. The intrinsic dimensionality statistic of Chávez and Navarro provides a way to compare the performance of similarity indexing and search algorithms across different spaces, and predict the performance of index data structures on non-vector spaces by relating them to equivalent vector spaces. We characterise its asymptotic behaviour, and give experimental results to calibrate these comparisons.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arnold, B.C., Balakrishnan, N., Nagaraja, H.N.: A First Course in Order Statistics. Wiley series in probability and mathematical statistics. John Wiley & Sons, Inc., New York (1992)Google Scholar
  2. 2.
    Baeza-Yates, R.A., Cunto, W., Manber, U., Wu, S.: Proximity matching using fixed-queries trees. In: Crochemore, M., Gusfield, D. (eds.) CPM 1994. LNCS, vol. 807, pp. 198–212. Springer, Heidelberg (1994)Google Scholar
  3. 3.
    Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: An efficient and robust access method for points and rectangles. In: SIGMOD (International Conference on Management of Data), pp. 322–331 (1990)Google Scholar
  4. 4.
    Berchtold, S., Böhm, C., Kriegel, H.P.: The pyramid-tree: Breaking the curse of dimensionality. In: SIGMOD (International Conference on Management of Data), pp. 142–153 (1998)Google Scholar
  5. 5.
    Bozkaya, T., Ozsoyoglu, M.: Indexing large metric spaces for similarity search queries. ACM Transactions on Database Systems 24, 361–404 (1999)CrossRefGoogle Scholar
  6. 6.
    Chávez, E., Navarro, G.: Measuring the dimensionality of general metric spaces. Technical Report TR/DCC-00-1, Department of Computer Science, University of Chile (2000), Submitted, Online, ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/metricmodel.ps.gz
  7. 7.
    Damiani, E., De Capitani di Vimercati, S., Paraboschi, S., Samarai, P.: An open digest-based technique for spam detection. In: 2004 International Workshop on Security in Parallel and Distributed Systems, San Francisco, CA, USA (2004)Google Scholar
  8. 8.
    Fisher, R.A., Tippett, L.H.C.: Limiting forms of the frequency distribution of the largest or smallest member of a sample. Proceedings of the Cambridge Philosophical Society 24, 180–190 (1928)MATHCrossRefGoogle Scholar
  9. 9.
    Galambos, J.: The Asymptotic Theory of Extreme Order Statistics, 2nd edn. Robert E. Krieger Publishing Company, Malabar (1987)MATHGoogle Scholar
  10. 10.
    Grumbach, S., Tahi, F.: A new challenge for compression algorithms: Genetic sequences. Journal of Information Processing and Management 30, 875–886 (1994)MATHCrossRefGoogle Scholar
  11. 11.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. SIGMOD Record (ACM Special Interest Group on Management of Data) 14, 47–57 (1984)Google Scholar
  12. 12.
    Hall, P.: On the rate of convergence of normal extremes. Journal of Applied Probability 16, 433–439 (1979)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Katayama, N., Satoh, S.: The SR-tree: an index structure for high-dimensional nearest neighbor queries. In: SIGMOD (International Conference on Management of Data), pp. 369–380 (1997)Google Scholar
  14. 14.
    Li, M., Badger, J.H., Xin, C., Kwong, S., Kearney, P., Zhang, H.: An information based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 149–154 (2001)CrossRefGoogle Scholar
  15. 15.
    Norwood, C.: cmeclax: Digest:Nilsimsa 0.06. Computer software (2002), Online, http://search.cpan.org/~vipul/Digest-Nilsimsa-0.06/
  16. 16.
    Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance based indexing for string proximity search. In: ICDE (International Conference on Data Engineering). IEEE Computer Society, Los Alamitos (2003)Google Scholar
  17. 17.
    Sellis, T.K., Roussopoulos, N., Faloutsos, C.: The R+-tree: A dynamic index for multi-dimensional objects. In: VLDB 1987 (International Conference on Very Large Data Bases), pp. 507–518. Morgan Kaufmann, San Francisco (1987)Google Scholar
  18. 18.
    SpamArchive.org: Donate your spam to science. Web site (2005), Online, http://www.spamarchive.org/
  19. 19.
    Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Information Processing Letters 40, 175–179 (1991)MATHCrossRefGoogle Scholar
  20. 20.
    Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA (Symposium on Discrete Algorithms), pp. 311–321. SIAM, Philadelphia (1993)Google Scholar
  21. 21.
    Yianilos, P.N.: Excluded middle vantage point forests for nearest neighbour search. In: Goodrich, M.T., McGeoch, C.C. (eds.) ALENEX 1999. LNCS, vol. 1619. Springer, Heidelberg (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Matthew Skala
    • 1
  1. 1.University of WaterlooWaterlooCanada

Personalised recommendations