Skip to main content
Log in

Windowed pq-grams for approximate joins of data-centric XML

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Cobéna, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 41–52. San Jose, California (2002)

  2. Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Ting Y.: Approximate XML joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 287–298. Madison, Wisconsin (2002)

  3. Lee K.-H., Choy Y.-C., Cho S.-B.: An efficient algorithm to compute differences between structured documents. IEEE Trans. Knowl. Data Eng. (TKDE) 16(8), 965–979 (2004)

    Article  Google Scholar 

  4. Zhang K., Statman R., Shasha D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  5. Duszynski, S., Knodel, J., Naab, M., Hein, D., Schitter, C.: Variant comparison—a technique for visualizing software variants. In Working Conference on Reverse Engineering, pp. 229–233. Antwerp, Belgium (2008)

  6. Aoki K.F., Yamaguchi A., Ueda N., Akutsu T., Mamitsuka H., Goto S., Kanehisa M.: KCaM (KEGG carbohydrate matcher): a software tool for analyzing the structures of carbohydrate sugar chains. Nucleic Acids Res. 32, 267–272 (2004)

    Article  Google Scholar 

  7. Horesh Y., Mehr R., Unger R.: Designing an A* algorithm for calculating edit distance between rooted-unordered trees. J. Comput. Biol. 13(6), 1165–1176 (2006)

    Article  MathSciNet  Google Scholar 

  8. Chawathe, S.S., Garcia-Molina, H.: Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 26–37. Tucson, Arizona, United States (1997)

  9. Tai K.-C.: The tree-to-tree correction problem. J. ACM (JACM) 26(3), 422–433 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  10. Chen W.: New algorithm for ordered tree-to-tree correction problem. J. Algorithms 40(2), 135–158 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  11. Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. In Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP 2007), vol. 4596 of LNCS, pp. 146–157. Wroclaw, Poland (2007)

  12. Klein, P.N.: Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms, vol. 1461 of LNCS, pp. 91–102. Venice, Italy (1998)

  13. Zhang K., Shasha D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  14. Garofalakis M., Kumar A.: XML stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. (TODS) 30(1), 279–332 (2005)

    Article  Google Scholar 

  15. Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 301–312. Trondheim, Norway (2005)

  16. Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1), (2010)

  17. Ribeiro, L., Härder, T.: Evaluating performance and quality of XML-based similarity joins. In Advances in Databases and Information Systems (ADBIS), vol. 5207 of LNCS, pp. 246–261. Pori, Finland (2008)

  18. Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 814–823. Cancún, Mexico (2008)

  19. Ribeiro, L.A., Härder, T., Pimenta, F.S.: A cluster-based approach to XML similarity joins. In Proceedings of the International Database Engineering and Applications Symposium (IDEAS), pp. 182–193. Cetraro, Calabria, Italy (2009)

  20. Ukkonen E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  21. Tatikonda, S., Parthasarathy, S.: Hashing tree-structured data: methods and applications. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 429–440. Long Beach, CA, USA (2010)

  22. Tekli J., Chbeir R., Yétongnon K.: An overview on XML similarity: background, current trends and future directions. Comput. Sci. Rev. 3(3), 151–173 (2009)

    Article  Google Scholar 

  23. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: flexible structure and full-text querying for XML. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 83–94 (2004)

  24. Buttler, D.: A short survey of document structure similarity algorithms. In Proceedings of the International Conference on Internet Computing, pp. 3–9. Las Vegas, Nevada, USA (2004)

  25. Kriegel, H.-P., Schönauer, S.: Similarity search in structured data. In Data Warehousing and Knowledge Discovery (DaWaK), pp. 309–319 (2003)

  26. Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 493–504. Montreal, Canada (1996)

  27. Wang, Y., DeWitt, D.J., Cai, J.-y.: X-Diff: an effective change detection algorithm for XML documents. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 519–530. Bangalore, India (2003)

  28. Weis, M., Naumann, F.: DogmatiX tracks down duplicates in XML. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 431–442. Baltimore, Maryland, USA (2005)

  29. Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neighborhoods. In Proceedings of the International Conference on Extending Database Technology (EDBT), vol. 3896 of LNCS, pp. 773–791. Munich, Germany (2006)

  30. Sanz I., Mesiti M., Guerrini G., Berlanga R.: Fragment-based approximate retrieval in highly heterogeneous XML collections. Data Knowl. Eng. 64(1), 266–293 (2008)

    Article  Google Scholar 

  31. Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 310–321. Madison, Wisconsin (2002)

  32. Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic twig joins on indexed XML documents. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 273–284. Berlin, Germany (2003)

  33. Dalamagas T., Cheng T., Winkel K.-J., Sellis T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)

    Article  Google Scholar 

  34. Flesca S., Manco G., Masciari E., Pontieri L., Pugliese A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. (TKDE) 17(2), 160–175 (2005)

    Article  Google Scholar 

  35. Helmer, S.: Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 1022–1032. Vienna, Austria (2007)

  36. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In Proceedings of the Fifth International Workshop on the Web and Databases (WebDB 2002), pp. 61–66. Madison, Wisconsin, USA (2002)

  37. Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 754–765. Baltimore, Maryland, USA (2005)

  38. Garofalakis, M., Kumar, A.: Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pp. 143–154. San Diego, California (2003)

  39. Rijsbergen, C.J. van: Information Retrieval, 2nd edn. Butterworth-Heinemann (1979)

  40. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search—The Metric Space Approach, vol. 32 of Advances in Database Systems. Springer-Verlag New York, Inc., Secaucus, NJ, USA (2006)

  41. Yianilos, P.N.: Normalized forms for two common metrics. Technical report, NEC Research Institute, 1991 (2002)

  42. Karp R.M., Rabin M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  43. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754 (2004)

  44. Augsten, N., Böhlen, M., Gamper, J.: An incrementally maintainable index for approximate lookups in hierarchical data. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 247–258, Seoul, Korea (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaus Augsten.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Augsten, N., Böhlen, M., Dyreson, C. et al. Windowed pq-grams for approximate joins of data-centric XML. The VLDB Journal 21, 463–488 (2012). https://doi.org/10.1007/s00778-011-0254-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-011-0254-6

Keywords

Navigation