pq-Hash: An Efficient Method for Approximate XML Joins

Li, Fei; Wang, Hongzhi; Hao, Liang; Li, Jianzhong; Gao, Hong

doi:10.1007/978-3-642-16720-1_13

Fei Li²⁵,
Hongzhi Wang²⁵,
Liang Hao²⁵,
Jianzhong Li²⁵ &
…
Hong Gao²⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6185))

Included in the following conference series:

International Conference on Web-Age Information Management

1359 Accesses
1 Citations

Abstract

Approximate matching between large tree sets is broadly used in many applications such as data integration and XML de-duplication. However, most existing methods suffer for low efficiency, thus do not scale to large tree sets.

pq-gram is a widely-used method with high quality of matches. In this paper, we propose pq-hash as an improvement to pq-gram. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. Sort-merge and hash join technique is applied based on these pq-arrays to avoid nested-loop join. From theoretical analysis and experimental results, retaining high join quality, pq-hash gains much higher efficiency than pq-gram.

Supported by the National Science Foundation of China (No 60703012, 60773063), the NSFC-RGC of China(No. 60831160525), National Grant of Fundamental Research 973 Program of China (No.2006CB303000), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Postdoctor Foundtaion of China (No. 20090450126), Development Program for Outstanding Young Teachers in Harbin Institute of Technology (no. HITQNJS.2009.052.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Approximate joins for data-centric XML. In: ICDE, pp. 814–823 (2008)
Google Scholar
Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB, pp. 301–312 (2005)
Google Scholar
Augsten, N., Böhlen, M.H., Gamper, J.: An incrementally maintainable index for approximate lookups in hierarchical data. In: VLDB, pp. 247–258 (2006)
Google Scholar
Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)
Article MathSciNet MATH Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC, pp. 327–336 (1998)
Google Scholar
Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: ICDE, pp. 41–52 (2002)
Google Scholar
Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13(1), 64–78 (2001)
Article Google Scholar
Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. In: Arge, L., Cachin, C., Jurdziński, T., Tarlecki, A. (eds.) ICALP 2007. LNCS, vol. 4596, pp. 146–157. Springer, Heidelberg (2007)
Chapter Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)
Google Scholar
Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: WebDB (Informal Proceedings), pp. 129–134 (2000)
Google Scholar
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31(2), 249–260 (1987)
Article MathSciNet MATH Google Scholar
Klein, P.N.: Computing the edit-distance between unrooted ordered trees. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 91–102. Springer, Heidelberg (1998)
Google Scholar
Lee, K.-H., Choy, Y.-C., Cho, S.-B.: An efficient algorithm to compute differences between structured documents. IEEE Trans. Knowl. Data Eng. 16(8), 965–979 (2004)
Article Google Scholar
Metwally, A., Agrawal, D., Abbadi, A.E.: Detectives: detecting coalition hit inflation attacks in advertising networks streams. In: WWW, pp. 241–250 (2007)
Google Scholar
Tai, K.-C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)
Article MathSciNet MATH Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

The School of Computer Science and Technology, Harbin Institute of Technology, China
Fei Li, Hongzhi Wang, Liang Hao, Jianzhong Li & Hong Gao

Authors

Fei Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Hao
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD, Australia
Heng Tao Shen
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
David R. Cheriton School of Computer Science, University of Waterloo, Canada
M. Tamer Özsu
Peking University, China
Lei Zou
Renmin University of China, China
Jiaheng Lu
National University of Singapore, Singapore
Tok-Wang Ling
Northeastern University, 110004, Shenyang, China
Ge Yu
College of Computer Science, Zhejiang University, 310027, Hangzhou, P.R. China
Yi Zhuang
University of Melbourne, Australia
Jie Shao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, F., Wang, H., Hao, L., Li, J., Gao, H. (2010). pq-Hash: An Efficient Method for Approximate XML Joins. In: Shen, H.T., et al. Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6185. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16720-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-16720-1_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16719-5
Online ISBN: 978-3-642-16720-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics