The VLDB Journal

, Volume 21, Issue 4, pp 463–488 | Cite as

Windowed pq-grams for approximate joins of data-centric XML

  • Nikolaus Augsten
  • Michael Böhlen
  • Curtis Dyreson
  • Johann Gamper
Regular Paper

Abstract

In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique.

Keywords

Hierarchical data XML Unordered tree Tree distance Similarity join Approximate matching pq-Grams 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cobéna, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 41–52. San Jose, California (2002)Google Scholar
  2. 2.
    Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Ting Y.: Approximate XML joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 287–298. Madison, Wisconsin (2002)Google Scholar
  3. 3.
    Lee K.-H., Choy Y.-C., Cho S.-B.: An efficient algorithm to compute differences between structured documents. IEEE Trans. Knowl. Data Eng. (TKDE) 16(8), 965–979 (2004)CrossRefGoogle Scholar
  4. 4.
    Zhang K., Statman R., Shasha D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Duszynski, S., Knodel, J., Naab, M., Hein, D., Schitter, C.: Variant comparison—a technique for visualizing software variants. In Working Conference on Reverse Engineering, pp. 229–233. Antwerp, Belgium (2008)Google Scholar
  6. 6.
    Aoki K.F., Yamaguchi A., Ueda N., Akutsu T., Mamitsuka H., Goto S., Kanehisa M.: KCaM (KEGG carbohydrate matcher): a software tool for analyzing the structures of carbohydrate sugar chains. Nucleic Acids Res. 32, 267–272 (2004)CrossRefGoogle Scholar
  7. 7.
    Horesh Y., Mehr R., Unger R.: Designing an A* algorithm for calculating edit distance between rooted-unordered trees. J. Comput. Biol. 13(6), 1165–1176 (2006)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Chawathe, S.S., Garcia-Molina, H.: Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 26–37. Tucson, Arizona, United States (1997)Google Scholar
  9. 9.
    Tai K.-C.: The tree-to-tree correction problem. J. ACM (JACM) 26(3), 422–433 (1979)MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Chen W.: New algorithm for ordered tree-to-tree correction problem. J. Algorithms 40(2), 135–158 (2001)MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. In Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP 2007), vol. 4596 of LNCS, pp. 146–157. Wroclaw, Poland (2007)Google Scholar
  12. 12.
    Klein, P.N.: Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms, vol. 1461 of LNCS, pp. 91–102. Venice, Italy (1998)Google Scholar
  13. 13.
    Zhang K., Shasha D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)MathSciNetMATHCrossRefGoogle Scholar
  14. 14.
    Garofalakis M., Kumar A.: XML stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. (TODS) 30(1), 279–332 (2005)CrossRefGoogle Scholar
  15. 15.
    Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 301–312. Trondheim, Norway (2005)Google Scholar
  16. 16.
    Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1), (2010)Google Scholar
  17. 17.
    Ribeiro, L., Härder, T.: Evaluating performance and quality of XML-based similarity joins. In Advances in Databases and Information Systems (ADBIS), vol. 5207 of LNCS, pp. 246–261. Pori, Finland (2008)Google Scholar
  18. 18.
    Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 814–823. Cancún, Mexico (2008)Google Scholar
  19. 19.
    Ribeiro, L.A., Härder, T., Pimenta, F.S.: A cluster-based approach to XML similarity joins. In Proceedings of the International Database Engineering and Applications Symposium (IDEAS), pp. 182–193. Cetraro, Calabria, Italy (2009)Google Scholar
  20. 20.
    Ukkonen E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)MathSciNetMATHCrossRefGoogle Scholar
  21. 21.
    Tatikonda, S., Parthasarathy, S.: Hashing tree-structured data: methods and applications. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 429–440. Long Beach, CA, USA (2010)Google Scholar
  22. 22.
    Tekli J., Chbeir R., Yétongnon K.: An overview on XML similarity: background, current trends and future directions. Comput. Sci. Rev. 3(3), 151–173 (2009)CrossRefGoogle Scholar
  23. 23.
    Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: flexible structure and full-text querying for XML. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 83–94 (2004)Google Scholar
  24. 24.
    Buttler, D.: A short survey of document structure similarity algorithms. In Proceedings of the International Conference on Internet Computing, pp. 3–9. Las Vegas, Nevada, USA (2004)Google Scholar
  25. 25.
    Kriegel, H.-P., Schönauer, S.: Similarity search in structured data. In Data Warehousing and Knowledge Discovery (DaWaK), pp. 309–319 (2003)Google Scholar
  26. 26.
    Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 493–504. Montreal, Canada (1996)Google Scholar
  27. 27.
    Wang, Y., DeWitt, D.J., Cai, J.-y.: X-Diff: an effective change detection algorithm for XML documents. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 519–530. Bangalore, India (2003)Google Scholar
  28. 28.
    Weis, M., Naumann, F.: DogmatiX tracks down duplicates in XML. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 431–442. Baltimore, Maryland, USA (2005)Google Scholar
  29. 29.
    Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neighborhoods. In Proceedings of the International Conference on Extending Database Technology (EDBT), vol. 3896 of LNCS, pp. 773–791. Munich, Germany (2006)Google Scholar
  30. 30.
    Sanz I., Mesiti M., Guerrini G., Berlanga R.: Fragment-based approximate retrieval in highly heterogeneous XML collections. Data Knowl. Eng. 64(1), 266–293 (2008)CrossRefGoogle Scholar
  31. 31.
    Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 310–321. Madison, Wisconsin (2002)Google Scholar
  32. 32.
    Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic twig joins on indexed XML documents. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 273–284. Berlin, Germany (2003)Google Scholar
  33. 33.
    Dalamagas T., Cheng T., Winkel K.-J., Sellis T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)CrossRefGoogle Scholar
  34. 34.
    Flesca S., Manco G., Masciari E., Pontieri L., Pugliese A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. (TKDE) 17(2), 160–175 (2005)CrossRefGoogle Scholar
  35. 35.
    Helmer, S.: Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 1022–1032. Vienna, Austria (2007)Google Scholar
  36. 36.
    Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In Proceedings of the Fifth International Workshop on the Web and Databases (WebDB 2002), pp. 61–66. Madison, Wisconsin, USA (2002)Google Scholar
  37. 37.
    Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 754–765. Baltimore, Maryland, USA (2005)Google Scholar
  38. 38.
    Garofalakis, M., Kumar, A.: Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pp. 143–154. San Diego, California (2003)Google Scholar
  39. 39.
    Rijsbergen, C.J. van: Information Retrieval, 2nd edn. Butterworth-Heinemann (1979)Google Scholar
  40. 40.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search—The Metric Space Approach, vol. 32 of Advances in Database Systems. Springer-Verlag New York, Inc., Secaucus, NJ, USA (2006)Google Scholar
  41. 41.
    Yianilos, P.N.: Normalized forms for two common metrics. Technical report, NEC Research Institute, 1991 (2002)Google Scholar
  42. 42.
    Karp R.M., Rabin M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)MathSciNetMATHCrossRefGoogle Scholar
  43. 43.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754 (2004)Google Scholar
  44. 44.
    Augsten, N., Böhlen, M., Gamper, J.: An incrementally maintainable index for approximate lookups in hierarchical data. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 247–258, Seoul, Korea (2006)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Nikolaus Augsten
    • 1
  • Michael Böhlen
    • 2
  • Curtis Dyreson
    • 3
  • Johann Gamper
    • 1
  1. 1.Free University of Bozen-BolzanoBolzanoItaly
  2. 2.University of ZürichZurichSwitzerland
  3. 3.Utah State UniversityLoganUSA

Personalised recommendations