The VLDB Journal

, Volume 17, Issue 5, pp 1231–1251 | Cite as

Reference-based indexing for metric spaces with costly distance measures

  • Jayendra Venkateswaran
  • Tamer Kahveci
  • Christopher Jermaine
  • Deepak Lachwani
Regular Paper

Abstract

We consider the problem of similarity search in databases with costly metric distance measures. Given limited main memory, our goal is to develop a reference-based index that reduces the number of comparisons in order to answer a query. The idea in reference-based indexing is to select a small set of reference objects that serve as a surrogate for the other objects in the database. We consider novel strategies for selection of references and assigning references to database objects. For dynamic databases with frequent updates, we propose two incremental versions of the selection algorithm. Our experimental results show that our selection and assignment methods far outperform competing methods.

Keywords

Reference-indexing Metric measures Edit distance Earth mover’s distance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R., Perleberg, C.: Fast and practical approximate string matching. In: CPM, pp. 185–192 (1992)Google Scholar
  2. 2.
    Baeza-Yates, R.A., Cunto, W., Manber, U., Wu, S.: Proximity matching using fixed-queries trees. In: CPM ’94: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, pp. 198–212. Springer, London (1994)Google Scholar
  3. 3.
    Baeza-Yates R.A. and Navarro G. (1999). Faster approximate string matching. Algorithmica 23(2): 127–158 MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Bairoch A., Boeckmann B., Ferro S. and Gasteiger E. (2004). Swiss-Prot: juggling between evolution and stability. Briefings Bioinf. 1: 39–55 CrossRefGoogle Scholar
  5. 5.
    Benson D., Karsch-Mizrachi I., Lipman D., Ostell J., Rapp B. and Wheeler D. (2000). GenBank. Nucl. Acids Res. 28(1): 15–18 CrossRefGoogle Scholar
  6. 6.
    Bhattacharya, A., Ljosa, V., Pan, J.Y., Verardo, M.R., Yang, H., Faloutsos, C., Singh, A.K.: ViVo: Visual vocabulary construction for mining biomedical images. In: ICDM, pp. 50–57 (2005)Google Scholar
  7. 7.
    Bozkaya, T., Ozsoyoglu, M.: Distance-based indexing for high-dimensional metric spaces. In: ACM SIGMOD, pp. 357–368 (1997)Google Scholar
  8. 8.
    Brisaboa, N.R., Fariña, A., Pedreira, O., Reyes, N.: Similarity search using sparse pivots for efficient multimedia information retrieval. In: ISM ’06: Proceedings of the Eighth IEEE International Symposium on Multimedia (2006)Google Scholar
  9. 9.
    Burkhard W.A. and Keller R.M. (1973). Some approaches to best-match file searching. Commun. ACM 16(4): 230–236 MATHCrossRefGoogle Scholar
  10. 10.
    Bustos B., Navarro G. and Chavez E. (2003). Pivot selection techniques for proximity searching in metric spaces. Pattern Recogn. Lett. 24(14): 2357–2366 MATHCrossRefGoogle Scholar
  11. 11.
    Chan, S., Martinez, K., Lewis, P.H., Lahanier, C., Stevenson, J.: Handling sub-image queries in content-based retrieval of high resolution art images. In: ICHIM, pp. 157–163 (2001)Google Scholar
  12. 12.
    Chavez, E., Marroquin, J.L., Baeza-Yates, R.: Spaghettis: an array based algorithm for similarity queries in metric spaces. In: SPIRE ’99: Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware, p. 38. IEEE Computer Society, Washington (1999)Google Scholar
  13. 13.
    Chavez E., Marroquin J.L. and Navarro G. (2001). Fixed queries array: a fast and economical data structure for proximity searching. Multimedia Tools Appl. 14(2): 113–135 MATHCrossRefGoogle Scholar
  14. 14.
    Chavez E., Navarro G., Baeza-Yates R. and Marroquin J.L. (2001). Searching in metric spaces. ACM Comput. Surv. 33(3): 273–321 CrossRefGoogle Scholar
  15. 15.
    Ciaccia, P., Patella, M., Zezula, P.: M-Tree: An efficient access method for similarity search in metric spaces. In: The VLDB Journal, pp. 426–435 (1997)Google Scholar
  16. 16.
    Filho, R.F.S., Traina, A.J.M., Traina, C., Faloutsos, C.: Similarity search without tears: The OMNI family of all-purpose access methods. In: ICDE, pp. 623–630 (2001)Google Scholar
  17. 17.
    Gumbel E.J. (1958). Statistics of Extremes. Columbia University Press, New York MATHGoogle Scholar
  18. 18.
    Gusfield D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, 1st edn. Cambridge University Press, Cambridge Google Scholar
  19. 19.
    Hjaltason G.R. and Samet H. (2003). Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4): 517–580 CrossRefGoogle Scholar
  20. 20.
    Kahveci, T., Singh, A.: An efficient index structure for string databases. In: VLDB, pp. 351–360. Rome (2001)Google Scholar
  21. 21.
    Leuken, R.H.V., Veltkamp, R.C., Typke, R.: Selecting vantage objects for similarity indexing. In: ICPR ’06: Proceedings of the 18th International Conference on Pattern Recognition, pp. 453–456. IEEE Computer Society, Washington (2006)Google Scholar
  22. 22.
    Ljosa, V., Bhattacharya, A., Singh, A.K.: Indexing spatially sensitive distance measures using multi-resolution lower bounds. In: EDBT, pp. 865–883 (2006)Google Scholar
  23. 23.
    Mico M.L., Oncina J. and Vidal E. (1994). A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15: 9–17 CrossRefGoogle Scholar
  24. 24.
    Myers E.W. (1986). An o(ND) difference algorithm and its variations. Algorithmica 1(2): 251–266 MATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    Needleman S.B. and Wunsch C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. JMB 48: 443–53 CrossRefGoogle Scholar
  26. 26.
    Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV ’98: Proceedings of the Sixth International Conference on Computer Vision, p. 59. IEEE Computer Society, Washington (1998)Google Scholar
  27. 27.
    Ruiz E.V. (1986). An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recogn. Lett. 4(3): 145–157 CrossRefGoogle Scholar
  28. 28.
    Samet, H.: Foundations of Multidimensional Metric and Data Structures. Morgan Kaufmann (2006)Google Scholar
  29. 293.
    Skopal, T., Pokorný, J., Snásel, V.: PM-tree: Pivoting metric tree for similarity search in multimedia databases. In: ADBIS (Local Proceedings) (2004)Google Scholar
  30. 30.
    Traina, C., Traina, A.J.M., Filho, R.F.S., Faloutsos, C.: How to improve the pruning ability of dynamic metric access methods. In: CIKM, pp. 219–226 (2002)Google Scholar
  31. 31.
    Traina, C., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-trees: high performance metric trees minimizing overlap between nodes. In: EDBT, pp. 51–65 (2000)Google Scholar
  32. 32.
    Ukkonen E. (1985). Algorithms for approximate string matching. Inf. Control 64: 100–118 MATHCrossRefMathSciNetGoogle Scholar
  33. 33.
    Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.M.: Reference-based indexing of sequence databases. In: VLDB, pp. 906–917 (2006)Google Scholar
  34. 34.
    Vieira, M.R., Traina, C., Chino, F.J.T., Traina, A.J.M.: DBM-tree: a dynamic metric access method sensitive to local density data. In: SBBD, pp. 163–177 (2004)Google Scholar
  35. 35.
    Vitter J.S. (1985). Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1): 37–57 MATHCrossRefMathSciNetGoogle Scholar
  36. 36.
    Vleugels, J., Veltkamp, R.: Efficient image retrieval through vantage objects. In: VISUAL, pp. 575–584. Springer, Heidelberg (1999)Google Scholar
  37. 37.
    Yianilos, P.: Data structures and algorithms for nearest Neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993)Google Scholar
  38. 38.
    Yianilos, P.: Excluded middle vantage point forests for nearest neighbor search. In: DIMACS Implementation Challenge: Near Neighbor Searches Workshop (1999)Google Scholar
  39. 39.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: the Metric Space Approach. Springer, Heidelberg (2006)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Jayendra Venkateswaran
    • 1
  • Tamer Kahveci
    • 1
  • Christopher Jermaine
    • 1
  • Deepak Lachwani
    • 2
  1. 1.Computer and Information Science and EngineeringUniversity of FloridaGainesvilleUSA
  2. 2.Google Inc.Mountain ViewUSA

Personalised recommendations