Windowed pq-grams for approximate joins of data-centric XML

Augsten, Nikolaus; Böhlen, Michael; Dyreson, Curtis; Gamper, Johann

doi:10.1007/s00778-011-0254-6

Windowed pq-grams for approximate joins of data-centric XML

Regular Paper
Published: 22 September 2011

Volume 21, pages 463–488, (2012)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Nikolaus Augsten¹,
Michael Böhlen²,
Curtis Dyreson³ &
…
Johann Gamper¹

299 Accesses
12 Citations
Explore all metrics

Abstract

In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Cobéna, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 41–52. San Jose, California (2002)
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Ting Y.: Approximate XML joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 287–298. Madison, Wisconsin (2002)
Lee K.-H., Choy Y.-C., Cho S.-B.: An efficient algorithm to compute differences between structured documents. IEEE Trans. Knowl. Data Eng. (TKDE) 16(8), 965–979 (2004)
Article Google Scholar
Zhang K., Statman R., Shasha D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)
Article MathSciNet MATH Google Scholar
Duszynski, S., Knodel, J., Naab, M., Hein, D., Schitter, C.: Variant comparison—a technique for visualizing software variants. In Working Conference on Reverse Engineering, pp. 229–233. Antwerp, Belgium (2008)
Aoki K.F., Yamaguchi A., Ueda N., Akutsu T., Mamitsuka H., Goto S., Kanehisa M.: KCaM (KEGG carbohydrate matcher): a software tool for analyzing the structures of carbohydrate sugar chains. Nucleic Acids Res. 32, 267–272 (2004)
Article Google Scholar
Horesh Y., Mehr R., Unger R.: Designing an A* algorithm for calculating edit distance between rooted-unordered trees. J. Comput. Biol. 13(6), 1165–1176 (2006)
Article MathSciNet Google Scholar
Chawathe, S.S., Garcia-Molina, H.: Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 26–37. Tucson, Arizona, United States (1997)
Tai K.-C.: The tree-to-tree correction problem. J. ACM (JACM) 26(3), 422–433 (1979)
Article MathSciNet MATH Google Scholar
Chen W.: New algorithm for ordered tree-to-tree correction problem. J. Algorithms 40(2), 135–158 (2001)
Article MathSciNet MATH Google Scholar
Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. In Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP 2007), vol. 4596 of LNCS, pp. 146–157. Wroclaw, Poland (2007)
Klein, P.N.: Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms, vol. 1461 of LNCS, pp. 91–102. Venice, Italy (1998)
Zhang K., Shasha D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Article MathSciNet MATH Google Scholar
Garofalakis M., Kumar A.: XML stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. (TODS) 30(1), 279–332 (2005)
Article Google Scholar
Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 301–312. Trondheim, Norway (2005)
Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1), (2010)
Ribeiro, L., Härder, T.: Evaluating performance and quality of XML-based similarity joins. In Advances in Databases and Information Systems (ADBIS), vol. 5207 of LNCS, pp. 246–261. Pori, Finland (2008)
Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 814–823. Cancún, Mexico (2008)
Ribeiro, L.A., Härder, T., Pimenta, F.S.: A cluster-based approach to XML similarity joins. In Proceedings of the International Database Engineering and Applications Symposium (IDEAS), pp. 182–193. Cetraro, Calabria, Italy (2009)
Ukkonen E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)
Article MathSciNet MATH Google Scholar
Tatikonda, S., Parthasarathy, S.: Hashing tree-structured data: methods and applications. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 429–440. Long Beach, CA, USA (2010)
Tekli J., Chbeir R., Yétongnon K.: An overview on XML similarity: background, current trends and future directions. Comput. Sci. Rev. 3(3), 151–173 (2009)
Article Google Scholar
Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: flexible structure and full-text querying for XML. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 83–94 (2004)
Buttler, D.: A short survey of document structure similarity algorithms. In Proceedings of the International Conference on Internet Computing, pp. 3–9. Las Vegas, Nevada, USA (2004)
Kriegel, H.-P., Schönauer, S.: Similarity search in structured data. In Data Warehousing and Knowledge Discovery (DaWaK), pp. 309–319 (2003)
Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 493–504. Montreal, Canada (1996)
Wang, Y., DeWitt, D.J., Cai, J.-y.: X-Diff: an effective change detection algorithm for XML documents. In Proceedings of the International Conference on Data Engineering (ICDE), pp. 519–530. Bangalore, India (2003)
Weis, M., Naumann, F.: DogmatiX tracks down duplicates in XML. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 431–442. Baltimore, Maryland, USA (2005)
Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neighborhoods. In Proceedings of the International Conference on Extending Database Technology (EDBT), vol. 3896 of LNCS, pp. 773–791. Munich, Germany (2006)
Sanz I., Mesiti M., Guerrini G., Berlanga R.: Fragment-based approximate retrieval in highly heterogeneous XML collections. Data Knowl. Eng. 64(1), 266–293 (2008)
Article Google Scholar
Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 310–321. Madison, Wisconsin (2002)
Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic twig joins on indexed XML documents. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 273–284. Berlin, Germany (2003)
Dalamagas T., Cheng T., Winkel K.-J., Sellis T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)
Article Google Scholar
Flesca S., Manco G., Masciari E., Pontieri L., Pugliese A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. (TKDE) 17(2), 160–175 (2005)
Article Google Scholar
Helmer, S.: Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 1022–1032. Vienna, Austria (2007)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In Proceedings of the Fifth International Workshop on the Web and Databases (WebDB 2002), pp. 61–66. Madison, Wisconsin, USA (2002)
Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 754–765. Baltimore, Maryland, USA (2005)
Garofalakis, M., Kumar, A.: Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pp. 143–154. San Diego, California (2003)
Rijsbergen, C.J. van: Information Retrieval, 2nd edn. Butterworth-Heinemann (1979)
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search—The Metric Space Approach, vol. 32 of Advances in Database Systems. Springer-Verlag New York, Inc., Secaucus, NJ, USA (2006)
Yianilos, P.N.: Normalized forms for two common metrics. Technical report, NEC Research Institute, 1991 (2002)
Karp R.M., Rabin M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)
Article MathSciNet MATH Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754 (2004)
Augsten, N., Böhlen, M., Gamper, J.: An incrementally maintainable index for approximate lookups in hierarchical data. In Proceedings of the International Conference on Very Large Databases (VLDB), pp. 247–258, Seoul, Korea (2006)

Download references

Author information

Authors and Affiliations

Free University of Bozen-Bolzano, Bolzano, Italy
Nikolaus Augsten & Johann Gamper
University of Zürich, Zurich, Switzerland
Michael Böhlen
Utah State University, Logan, UT, USA
Curtis Dyreson

Authors

Nikolaus Augsten
View author publications
You can also search for this author in PubMed Google Scholar
Michael Böhlen
View author publications
You can also search for this author in PubMed Google Scholar
Curtis Dyreson
View author publications
You can also search for this author in PubMed Google Scholar
Johann Gamper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaus Augsten.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Augsten, N., Böhlen, M., Dyreson, C. et al. Windowed pq-grams for approximate joins of data-centric XML. The VLDB Journal 21, 463–488 (2012). https://doi.org/10.1007/s00778-011-0254-6

Download citation

Received: 22 October 2010
Revised: 11 August 2011
Accepted: 30 August 2011
Published: 22 September 2011
Issue Date: August 2012
DOI: https://doi.org/10.1007/s00778-011-0254-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Windowed pq-grams for approximate joins of data-centric XML

Abstract

Access this article

Similar content being viewed by others

Extend tree edit distance for effective object identification

Nearest Keyword Search on Probabilistic XML Data

SETJoin: a novel top-k similarity join algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Windowed pq-grams for approximate joins of data-centric XML

Abstract

Access this article

Similar content being viewed by others

Extend tree edit distance for effective object identification

Nearest Keyword Search on Probabilistic XML Data

SETJoin: a novel top-k similarity join algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation