Skip to main content
Log in

Structure-Preserving Hashing for Tree-Structured Data

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Many kinds of data are tree-structured, e.g., XML documents. In this paper, a structure-preserving hashing method for rooted unordered trees is proposed, to compress a tree into compact signature while preserving its structural information, enabling efficient structural similarity estimation and search, duplication detection, etc. The proposed method exploits subpaths of fixed length in a tree. We provide theoretical analysis, showing that under moderate conditions, the signature contains enough information to reconstruct the original tree. And with the signature, similarity between trees can be estimated efficiently. Our proposed method has the advantage of a linear construction time complexity, compared to the quadratic worst-case construction time complexity of the embedded pivot method [24]. A quantitative analysis of the relation to tree edit distance is also provided. Experiments of XML document de-duplication are tested on real world data, showing the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The Treebank dataset is obtained from http://www.cs.washington.edu/research/xmldatasets/.

  2. https://doi.org/10.5281/zenodo.3785364.

References

  1. Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Windowed pq-grams for approximate joins of data-centric XML. VLDB J. 21(4), 463–488 (2012)

    Article  Google Scholar 

  2. Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005, pp. 301–312. ACM (2005)

  3. Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1–3), 217–239 (2005)

    Article  MathSciNet  Google Scholar 

  4. Bohman, T., Cooper, C., Frieze, A.M.: Min-wise independent linear permutations. Electron. J. Comb. 7 (2000)

  5. Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, Proceedings, pp. 21–29. IEEE (1997)

  6. Buttler, D.: A short survey of document structure similarity algorithms. In: Proceedings of the International Conference on Internet Computing, IC ’04, Las Vegas, Nevada, USA, June 21-24, 2004, Volume 1, pp. 3–9. CSREA Press (2004)

  7. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004, pp. 253–262. ACM (2004)

  8. Garofalakis, M.N., Kumar, A.: XML stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. 30(1), 279–332 (2005)

    Article  Google Scholar 

  9. Har-Peled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory Comput. 8(1), 321–350 (2012)

    Article  MathSciNet  Google Scholar 

  10. Hassanat, A.B.: Two-point-based binary search trees for accelerating big data classification using knn. PloS one 13(11), e0207772 (2018)

    Article  Google Scholar 

  11. Hassanat, A.B.A.: Furthest-pair-based binary search tree for speeding big data classification using k-nearest neighbors. Big Data 6(3), 225–235 (2018)

    Article  Google Scholar 

  12. Hassanat, A.B.A.: Furthest-pair-based decision trees: Experimental results on big data classification. Inf. 9(11), 284 (2018)

    Google Scholar 

  13. Hassanat, A.B.A.: Norm-based binary search trees for speeding up KNN big data classification. Comput. 7(4), 54 (2018)

    Article  Google Scholar 

  14. Ji, J., Li, J., Tian, Q., Yan, S., Zhang, B.: Angular-similarity-preserving binary signatures for linear subspaces. IEEE Trans. Image Process. 24(11), 4372–4380 (2015)

    Article  MathSciNet  Google Scholar 

  15. Ji, J., Li, J., Yan, S., Tian, Q., Zhang, B.: Min-max hash for jaccard similarity. In: 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013, pp. 301–309. IEEE Computer Society (2013)

  16. Ji, J., Yan, S., Li, J., Gao, G., Tian, Q., Zhang, B.: Batch-orthogonal locality-sensitive hashingfor angular similarity. IEEE Trans. Pattern Anal. Mach. Intell. 36(10), 1963–1974 (2014)

    Article  Google Scholar 

  17. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)

    Article  MathSciNet  Google Scholar 

  18. Kimura, D., Kashima, H.: Fast computation of subpath kernel for trees. CoRR abs/1206.4642 (2012)

  19. Li, P., König, A.C.: b-bit minwise hashing. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pp. 671–680. ACM (2010)

  20. Lin, Z., Wang, H., McClean, S.I.: A multidimensional sequence approach to measuring tree similarity. IEEE Trans. Knowl. Data Eng. 24(2), 197–208 (2012)

    Article  Google Scholar 

  21. Marçais, G., DeBlasio, D.F., Pandey, P., Kingsford, C.: Locality-sensitive hashing for the edit distance. Bioinform. 35(14), i127–i135 (2019)

    Article  Google Scholar 

  22. Shapira, D., Storer, J.A.: Edit distance with move operations. J. Discrete Algorithms 5(2), 380–392 (2007)

    Article  MathSciNet  Google Scholar 

  23. Shervashidze, N., Schweitzer, P., van Leeuwen, E.J., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-lehman graph kernels. J. Mach. Learn. Res. 12, 2539–2561 (2011)

    MathSciNet  MATH  Google Scholar 

  24. Tatikonda, S., Parthasarathy, S.: Hashing tree-structured data: Methods and applications. In: Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA, pp. 429–440. IEEE Computer Society (2010)

  25. Teixeira, C.H.C., Silva, A., Jr, W.M.: Min-hash fingerprints for graph kernels: A trade-off among accuracy, efficiency, and compression. J. Inf. Data Manag. 3(3), 227–242 (2012)

  26. Zhang, K., Jiang, T.: Some MAX snp-hard results concerning unordered labeled trees. Inf. Process. Lett. 49(5), 249–254 (1994)

    Article  MathSciNet  Google Scholar 

  27. Zhang, K., Statman, R., Shasha, D.E.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)

    Article  MathSciNet  Google Scholar 

  28. Zhang, W., Ji, J., Zhu, J., Li, J., Xu, H., Zhang, B.: Bithash: An efficient bitwise locality sensitive hashing method with applications. Knowl. Based Syst. 97, 40–47 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (Nos.61702130, 61866007), Guangxi Natural Science Foundation (Nos. 2020GXNSFAA297186, 2020GXNSFAA159137), Guangxi Project of technology base and special talent (No. AD19110022), Guangxi Science and Technology Major Project (No. 2018AA32001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinlin Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Z., Niu, L., Ji, J. et al. Structure-Preserving Hashing for Tree-Structured Data. SIViP 16, 2045–2053 (2022). https://doi.org/10.1007/s11760-022-02166-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02166-7

Keywords

Navigation