Efficient Similarity Search for Hierarchical Data in Large Databases

  • Karin Kailing
  • Hans-Peter Kriegel
  • Stefan Schönauer
  • Thomas Seidl
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2992)

Abstract

Structured and semi-structured object representations are getting more and more important for modern database applications. Examples for such data are hierarchical structures including chemical compounds, XML data or image data. As a key feature, database systems have to support the search for similar objects where it is important to take into account both the structure and the content features of the objects. A successful approach is to use the edit distance for tree structured data. As the computation of this measure is NP-complete, constrained edit distances have been successfully applied to trees. While yielding good results, they are still computationally complex and, therefore, of limited benefit for searching in large databases. In this paper, we propose a filter and refinement architecture to overcome this problem. We present a set of new filter methods for structural and for content-based information in tree-structured data as well as ways to flexibly combine different filter criteria. The efficiency of our methods, resulting from the good selectivity of the filters is demonstrated in extensive experiments with real-world applications.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Jiang, T., Wang, L., Zhang, K.: Alignment of trees - an alternative to tree edit. In: Crochemore, M., Gusfield, D. (eds.) CPM 1994. LNCS, vol. 807, pp. 75–86. Springer, Heidelberg (1994)Google Scholar
  2. 2.
    Selkow, S.: The tree-to-tree editing problem. Information Processing Letters 6, 576–584 (1977)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Zhang, K.: A constrained editing distance between unordered labeled trees. Algorithmica 15, 205–222 (1996)MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Wang, J.T.L., Zhang, K., Chang, G., Shasha, D.: Finding approximate pattersn in undirected acyclic graphs. Pattern Recognition 35, 473–483 (2002)MATHCrossRefGoogle Scholar
  5. 5.
    Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. 5th Int. Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA, pp. 61–66 (2002)Google Scholar
  6. 6.
    Sebastian, T.B., Klein, P.N., Kimia, B.B.: Recognition of shapes by editing shock graphs. In: Proc. 8th Int. Conf. on Computer Vision (ICCV 2001), Vancouver, BC, Canada, vol. 1, pp. 755–762 (2001)Google Scholar
  7. 7.
    Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19, 255–259 (1998)MATHCrossRefGoogle Scholar
  8. 8.
    Chartrand, G., Kubicki, G., Schultz, M.: Graph similarity and distance in graphs. Aequationes Mathematicae 55, 129–145 (1998)MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Kubicka, E., Kubicki, G., Vakalis, I.: Using graph distance in object recognition. In: Proc. ACM Computer Science Conference, pp. 43–48 (1990)Google Scholar
  10. 10.
    Papadopoulos, A., Manolopoulos, Y.: Structure-based similarity search with graph histograms. In: Proc. DEXA/IWOSS Int.Workshop on Similarity Search, pp. 174–178 (1999)Google Scholar
  11. 11.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady 10, 707–710 (1966)MathSciNetGoogle Scholar
  12. 12.
    Wagner, R.A., Fisher, M.J.: The string-to-string correction problem. Journal of the ACM 21, 168–173 (1974)MATHCrossRefGoogle Scholar
  13. 13.
    Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Information Processing Letters 42, 133–139 (1992)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Zhang, K., Wang, J., Shasha, D.: On the editing distance between undirected acyclic graphs. International Journal of Foundations of Computer Science 7, 43–57 (1996)MATHCrossRefGoogle Scholar
  15. 15.
    Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Lomet, D.B. (ed.) FODO 1993. LNCS, vol. 730, pp. 69–84. Springer, Heidelberg (1993)Google Scholar
  16. 16.
    Seidl, T., Kriegel, H.P.: Optimal multi-step k-nearest neighbor search. In: Haas, L.M., Tiwary, A. (eds.) Proc. ACM SIGMOD Int. Conf. on Managment of Data, pp. 154–165. ACM Press, New York (1998)Google Scholar
  17. 17.
    Berchtold, S., Keim, D., Kriegel, H.P.: The X-tree: An index structure for high-dimensional data. In: 22nd Conference on Very Large Databases, Bombay, India, pp. 28–39 (1996)Google Scholar
  18. 18.
    Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.: Searching in metric spaces. ACM Computing Surveys 33, 273–321 (2001)CrossRefGoogle Scholar
  19. 19.
    Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: VLDB 1997, Proc. 23rd Int. Conf. on Very Large Databases, Athens, Greece, August 25-29, pp. 426–435 (1997)Google Scholar
  20. 20.
    Wang, J., Zhang, K., Jeong, K., Shasha, D.: A system for approximate tree matching. IEEE Transactions on Knowledge and Data Engineering 6, 559–571 (1994)CrossRefGoogle Scholar
  21. 21.
    Ester, M., Kriegel, H.P., Schubert, M.: Web site mining: A new way to spot competitors, customers and suppliers in the world wide web. In: Proc. 8th Int. Conf on Knowledge Discovery in Databases (SIGKDD 2002), Edmonton, Alberta, Canada, pp. 249–258 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Karin Kailing
    • 1
  • Hans-Peter Kriegel
    • 1
  • Stefan Schönauer
    • 1
  • Thomas Seidl
    • 2
  1. 1.Institute for Computer ScienceUniversity of Munich 
  2. 2.Department of Computer Science IXRWTH Aachen University 

Personalised recommendations