Skip to main content

Semantic Structural Similarity Measure for Clustering XML Documents

  • Conference paper
Book cover Web Information Systems and Mining (WISM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5854))

Included in the following conference series:

Abstract

Clustering XML documents semantically has become a major challenge in XML data managements. The key research issue is to find the similarity functions of XML documents. However, previous work gave more importance to the topology structure than to the semantic information. In this paper, the computation of similarity between two XML documents is based on both structural and semantic information. Then a minimal spanning tree clustering method is used to cluster XML documents. The experiment results show that the new method performs better than baseline similarity measure in terms of purity and rand index.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Xu, Y., Olman, V., Xu, D.: Minimum Spanning Trees for Gene Expression Data Clustering

    Google Scholar 

  2. Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18, 54–64 (1969)

    Article  MathSciNet  Google Scholar 

  3. Page, R.L.: A Minimal Spanning Tree Clustering Method. Communications of the ACM 17(6), 321–323 (1974)

    Article  MathSciNet  Google Scholar 

  4. Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)

    Article  Google Scholar 

  5. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inform. Process. Lett. 42(3), 133–139 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  6. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A Bag of Paths Model for Measuring Structural Similarity in Web Documents. In: SIGKDD 2003, Washington, DC, USA, August 24-27 (2003)

    Google Scholar 

  7. Nayak, R., Iryadi, W.: XML schema clustering with semantic and hierarchical similarity measures. Knowledge-Based Systems 20, 336–349 (2007)

    Article  Google Scholar 

  8. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Fifth International Workshop on the Web and Databases (2002)

    Google Scholar 

  9. Nayak, R.: Investigating Semantic Measures in XML Clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI 2006 (2006)

    Google Scholar 

  10. Bertinoa, E., Guerrinib, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)

    Article  Google Scholar 

  11. Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proceedings of 7th Conf. on Information and Knowledge Management, pp. 292–199. ACM, New York (2002)

    Google Scholar 

  12. Vinson, A.R., Heuser, C.A., da Silva, A.S., de Moura, E.S.: An Approach to XML Path Matching. In: Workshop on Web Information And Data Management. Proceedings of the 9th annual ACM international workshop on Web information and data management. SESSION: XML and semi-structured data, pp. 17–24 (2007) ISBN:978-1-59593-829-9

    Google Scholar 

  13. Левенштейн, В.И.: Двоичные коды с исправлением выпадений, вставок и замещений символов. Доклады Академий Наук СССР 163(4), 845–848 (1965); Appeared in English as: Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)

    Google Scholar 

  14. http://wordnet.princeton.edu/

  15. Ling, S., Jun, M., Lian, L., Zhumin, C.: Fuzzy similarity from conceptual relations. In: Proceedings of 2006 IEEE Asia-Pacific Conference on Services Computing, APSCC 2006, pp. 3–10 (2006)

    Google Scholar 

  16. Ling, S., Jun, M., Dongmei, Z., Li, L., Zhumin, C.: Fuzzy semantic similarity methods and their application to information retrieval. Journal of Computational Information Systems 3(3), 917–924 (2007)

    Google Scholar 

  17. Song, L., Ma, J., Liu, H., Lian, L., Zhang, D.: Fuzzy Semantic Similarity between Ontological Concepts. In: Advances and Innovations in systems, computing sciences and software engineering, pp. 275–280. Springer, Heidelberg, ISBN:978-1-4020-6263-6

    Google Scholar 

  18. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms, 2nd edn. The Massachusetts Institute of Technology Press, ISBN : 0-262-03293-7.2001

    Google Scholar 

  19. Yang, C.C., Liu, N.: Measuring Similarity of Semi-structured Documents with Context Weights. In: Annual ACM Conference on Research and Development in Information Retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, pp. 719–720 (2006) ISBN:1-59593-369-7

    Google Scholar 

  20. http://www.cs.wisc.edu/niagara/data/

  21. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  22. http://www.cs.washington.edu/research/xmldatasets

  23. http://www.acm.org/sigs/sigmod/record/xml

  24. http://monetdb.cwi.nl/xml

  25. http://www.alphaworks.ibm.com/tech/xmlgenerator

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Song, L., Ma, J., Lei, J., Zhang, D., Wang, Z. (2009). Semantic Structural Similarity Measure for Clustering XML Documents. In: Liu, W., Luo, X., Wang, F.L., Lei, J. (eds) Web Information Systems and Mining. WISM 2009. Lecture Notes in Computer Science, vol 5854. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-05250-7_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-05250-7_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-05249-1

  • Online ISBN: 978-3-642-05250-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics