Advertisement

pq-Hash: An Efficient Method for Approximate XML Joins

  • Fei Li
  • Hongzhi Wang
  • Liang Hao
  • Jianzhong Li
  • Hong Gao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6185)

Abstract

Approximate matching between large tree sets is broadly used in many applications such as data integration and XML de-duplication. However, most existing methods suffer for low efficiency, thus do not scale to large tree sets.

pq-gram is a widely-used method with high quality of matches. In this paper, we propose pq-hash as an improvement to pq-gram. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. Sort-merge and hash join technique is applied based on these pq-arrays to avoid nested-loop join. From theoretical analysis and experimental results, retaining high join quality, pq-hash gains much higher efficiency than pq-gram.

Keywords

Hash Function Anchor Node Equal Ratio Approximate Match Tree Edit Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Approximate joins for data-centric XML. In: ICDE, pp. 814–823 (2008)Google Scholar
  2. 2.
    Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB, pp. 301–312 (2005)Google Scholar
  3. 3.
    Augsten, N., Böhlen, M.H., Gamper, J.: An incrementally maintainable index for approximate lookups in hierarchical data. In: VLDB, pp. 247–258 (2006)Google Scholar
  4. 4.
    Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC, pp. 327–336 (1998)Google Scholar
  6. 6.
    Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: ICDE, pp. 41–52 (2002)Google Scholar
  7. 7.
    Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13(1), 64–78 (2001)CrossRefGoogle Scholar
  8. 8.
    Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. In: Arge, L., Cachin, C., Jurdziński, T., Tarlecki, A. (eds.) ICALP 2007. LNCS, vol. 4596, pp. 146–157. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)Google Scholar
  10. 10.
    Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: WebDB (Informal Proceedings), pp. 129–134 (2000)Google Scholar
  11. 11.
    Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31(2), 249–260 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Klein, P.N.: Computing the edit-distance between unrooted ordered trees. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 91–102. Springer, Heidelberg (1998)Google Scholar
  13. 13.
    Lee, K.-H., Choy, Y.-C., Cho, S.-B.: An efficient algorithm to compute differences between structured documents. IEEE Trans. Knowl. Data Eng. 16(8), 965–979 (2004)CrossRefGoogle Scholar
  14. 14.
    Metwally, A., Agrawal, D., Abbadi, A.E.: Detectives: detecting coalition hit inflation attacks in advertising networks streams. In: WWW, pp. 241–250 (2007)Google Scholar
  15. 15.
    Tai, K.-C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Fei Li
    • 1
  • Hongzhi Wang
    • 1
  • Liang Hao
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.The School of Computer Science and TechnologyHarbin Institute of TechnologyChina

Personalised recommendations