A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts

  • Jie Su
  • Junpeng Bao
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 135)


The semi-structured texts including Xml and Html texts are a basic information format in the Internet and World Wide Web. The text content values and the tree-organized structure are two aspects of a semi-structured text. Usually, the same text contents with different structures imply different objects. So the structural similarity of semi-structured texts is an essential key point to search, index, retrieve, query, or compare information in web pages. We presents a Wavelet Transform Based Structural Similarity Model (WTBSSM) in order to fast measure the structural similarity of semi-structured texts and compress the structural information into a short vector so as to develop an efficient semi-structured text index system. This paper introduces the Binary Encoding Method to convert a semi-structured text into a {-1, 1} sequence. Then the text structure signals are decomposed by means of Discrete Wavelet Transform to get the approximation coefficients, which is only a half length of the original signals. Finally, the structure similarity is measured by the Euclidean distance of approximation coefficients. The experimental results show that the WTBSSM can keep almost the same distance distribution to the direct distance of the original signals with a half or a quarter of information. The comparisons with a method of shorten DWT coefficients suggests that WTBSSM is better than it.


Semi-structured Text Structural Similarity Wavelet Transform 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Zheng, S.H., Zhou, A.Y., Zhang, L.: Similarity Measure and Structural Index of XML Documents. Chinese Journal of Computers 26(9), 1116–1122 (2003)MathSciNetGoogle Scholar
  2. 2.
    Tekli, J., Chbeir, R., Yetongnon, K.: A Fine-Grained XML Structural Comparison Approach. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 582–598. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Xie, T., Sha, C., Wang, X., Zhou, A.: Approximate Top-k Structural Similarity Search over XML Documents. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds.) APWeb 2006. LNCS, vol. 3841, pp. 319–330. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Moon, H.J., Yoo, J.W., Choi, J.: An Effective Detection Method for Clustering Similar XML DTDs Using Tag Sequences. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part II. LNCS, vol. 4706, pp. 849–860. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Viyanon, W., Madria, S.K.: XML-SIM-CHANGE: Structure and Content Semantic Similarity Detection among XML Document Versions. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 1061–1078. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Leung, H.P., Chung, F.L., Chan, S.C.: On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowledge and Information Systems 7, 476–498 (2005)CrossRefGoogle Scholar
  7. 7.
    Flesca, S., Manco, G., Masciari, E., Pontieri, L.: Fast Detection of XML Structural Similarity. IEEE Transactions on Knowledge and Data Engineering 17(2), 160–175 (2005)CrossRefGoogle Scholar
  8. 8.
    Yang, J.W., Chen, X.O.: Similarity measures for XML documents based on kernel matrix learning. Journal of Software 17(5), 991–1000 (2006)MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Jeong, B., Lee, D., Cho, H., Kulvatunyou, B.: A kernel method for measuring structural similarity between XML documents. In: Proceedings of the 20th International Conference on Industrial Engineering and other Applications of Applied Intelligent Systems, pp. 572–581 (2007)Google Scholar
  10. 10.
    Zhang, L.J., Li, Z.H., Chen, Q., Li, N.: Structure and Content Similarity for Clustering XML Documents. In: Shen, H.T., Pei, J., Özsu, M.T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y., Shao, J. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 116–124. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Antonellis, P., Makris, C., Tsirakis, N.: XEdge: Clustering Homogeneous and Heterogeneous XML Documents Using Edge Summaries. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 1081–1088 (2008)Google Scholar
  12. 12.
    Kim, W.: XML document similarity measure in terms of the structure and contents. In: Proceedings of the 2nd WSEAS International Conference on Computer Engineering and Applications, pp. 205–212 (2008)Google Scholar
  13. 13.
    Wen, L., Amagasa, T., Kitagawa, H.: An Approach for XML Similarity Join Using Tree Serialization. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 562–570. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Bertino, E., Guerrini, G., Mesiti, M.: Measuring the structural similarity among XML documents and DTDs. Journal of Intelligent Information Systems 30(1), 55–92 (2008)CrossRefGoogle Scholar
  15. 15.
  16. 16.
    Chan, F.K.P., Fu, A.W., Yu, C.: Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping. IEEE Transactions on Knowledge and Data Engineering 15(3), 686–705 (2003)CrossRefGoogle Scholar
  17. 17.
    Liu, B., Wang, Z., Li, J.-T., Wang, W., Shi, B.-L.: Tight Bounds on the Estimation Distance Using Wavelet. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 460–471. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Berlin Heidelberg 2012

Authors and Affiliations

  • Jie Su
    • 1
  • Junpeng Bao
    • 1
  1. 1.Department of Computer Science & TechnologyXi’an Jiaotong UniversityXi’anP.R. China

Personalised recommendations