Semantic Structural Similarity Measure for Clustering XML Documents

Song, Ling; Ma, Jun; Lei, Jingsheng; Zhang, Dongmei; Wang, Zhen

doi:10.1007/978-3-642-05250-7_25

Ling Song^19,20,
Jun Ma¹⁹,
Jingsheng Lei²¹,
Dongmei Zhang^19,20 &
…
Zhen Wang²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5854))

Included in the following conference series:

International Conference on Web Information Systems and Mining

1021 Accesses
3 Citations

Abstract

Clustering XML documents semantically has become a major challenge in XML data managements. The key research issue is to find the similarity functions of XML documents. However, previous work gave more importance to the topology structure than to the semantic information. In this paper, the computation of similarity between two XML documents is based on both structural and semantic information. Then a minimal spanning tree clustering method is used to cluster XML documents. The experiment results show that the new method performs better than baseline similarity measure in terms of purity and rand index.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Xu, Y., Olman, V., Xu, D.: Minimum Spanning Trees for Gene Expression Data Clustering
Google Scholar
Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18, 54–64 (1969)
Article MathSciNet Google Scholar
Page, R.L.: A Minimal Spanning Tree Clustering Method. Communications of the ACM 17(6), 321–323 (1974)
Article MathSciNet Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)
Article Google Scholar
Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inform. Process. Lett. 42(3), 133–139 (1992)
Article MATH MathSciNet Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A Bag of Paths Model for Measuring Structural Similarity in Web Documents. In: SIGKDD 2003, Washington, DC, USA, August 24-27 (2003)
Google Scholar
Nayak, R., Iryadi, W.: XML schema clustering with semantic and hierarchical similarity measures. Knowledge-Based Systems 20, 336–349 (2007)
Article Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Fifth International Workshop on the Web and Databases (2002)
Google Scholar
Nayak, R.: Investigating Semantic Measures in XML Clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI 2006 (2006)
Google Scholar
Bertinoa, E., Guerrinib, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)
Article Google Scholar
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proceedings of 7th Conf. on Information and Knowledge Management, pp. 292–199. ACM, New York (2002)
Google Scholar
Vinson, A.R., Heuser, C.A., da Silva, A.S., de Moura, E.S.: An Approach to XML Path Matching. In: Workshop on Web Information And Data Management. Proceedings of the 9th annual ACM international workshop on Web information and data management. SESSION: XML and semi-structured data, pp. 17–24 (2007) ISBN:978-1-59593-829-9
Google Scholar
Левенштейн, В.И.: Двоичные коды с исправлением выпадений, вставок и замещений символов. Доклады Академий Наук СССР 163(4), 845–848 (1965); Appeared in English as: Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)
Google Scholar
http://wordnet.princeton.edu/
Ling, S., Jun, M., Lian, L., Zhumin, C.: Fuzzy similarity from conceptual relations. In: Proceedings of 2006 IEEE Asia-Pacific Conference on Services Computing, APSCC 2006, pp. 3–10 (2006)
Google Scholar
Ling, S., Jun, M., Dongmei, Z., Li, L., Zhumin, C.: Fuzzy semantic similarity methods and their application to information retrieval. Journal of Computational Information Systems 3(3), 917–924 (2007)
Google Scholar
Song, L., Ma, J., Liu, H., Lian, L., Zhang, D.: Fuzzy Semantic Similarity between Ontological Concepts. In: Advances and Innovations in systems, computing sciences and software engineering, pp. 275–280. Springer, Heidelberg, ISBN:978-1-4020-6263-6
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms, 2nd edn. The Massachusetts Institute of Technology Press, ISBN : 0-262-03293-7.2001
Google Scholar
Yang, C.C., Liu, N.: Measuring Similarity of Semi-structured Documents with Context Weights. In: Annual ACM Conference on Research and Development in Information Retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, pp. 719–720 (2006) ISBN:1-59593-369-7
Google Scholar
http://www.cs.wisc.edu/niagara/data/
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2006)
Google Scholar
http://www.cs.washington.edu/research/xmldatasets
http://www.acm.org/sigs/sigmod/record/xml
http://monetdb.cwi.nl/xml
http://www.alphaworks.ibm.com/tech/xmlgenerator

Download references

Author information

Authors and Affiliations

School of Computer Science &Technology, Shandong University, 250101, China
Ling Song, Jun Ma & Dongmei Zhang
School of Computer Science & Technology, Shandong Jianzhu University, 250101, China
Ling Song, Dongmei Zhang & Zhen Wang
College of Computer, Nanjing University of Posts and Telecommunications, Nanjing, 210003, China
Jingsheng Lei

Authors

Ling Song
View author publications
You can also search for this author in PubMed Google Scholar
Jun Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jingsheng Lei
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong, China
Wenyin Liu & Fu Lee Wang &
Key Laboratory of Grid Technology, Digital Content Analysis and Semantic Grid Group, Shanghai University, 200072, Shanghai, China
Xiangfeng Luo
College of Information Science and Technology, Hainan University, 570228, Haikou, China
Jingsheng Lei

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, L., Ma, J., Lei, J., Zhang, D., Wang, Z. (2009). Semantic Structural Similarity Measure for Clustering XML Documents. In: Liu, W., Luo, X., Wang, F.L., Lei, J. (eds) Web Information Systems and Mining. WISM 2009. Lecture Notes in Computer Science, vol 5854. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-05250-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-05250-7_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05249-1
Online ISBN: 978-3-642-05250-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics