Abstract
Clustering XML documents semantically has become a major challenge in XML data managements. The key research issue is to find the similarity functions of XML documents. However, previous work gave more importance to the topology structure than to the semantic information. In this paper, the computation of similarity between two XML documents is based on both structural and semantic information. Then a minimal spanning tree clustering method is used to cluster XML documents. The experiment results show that the new method performs better than baseline similarity measure in terms of purity and rand index.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Xu, Y., Olman, V., Xu, D.: Minimum Spanning Trees for Gene Expression Data Clustering
Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18, 54–64 (1969)
Page, R.L.: A Minimal Spanning Tree Clustering Method. Communications of the ACM 17(6), 321–323 (1974)
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)
Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inform. Process. Lett. 42(3), 133–139 (1992)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A Bag of Paths Model for Measuring Structural Similarity in Web Documents. In: SIGKDD 2003, Washington, DC, USA, August 24-27 (2003)
Nayak, R., Iryadi, W.: XML schema clustering with semantic and hierarchical similarity measures. Knowledge-Based Systems 20, 336–349 (2007)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Fifth International Workshop on the Web and Databases (2002)
Nayak, R.: Investigating Semantic Measures in XML Clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI 2006 (2006)
Bertinoa, E., Guerrinib, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proceedings of 7th Conf. on Information and Knowledge Management, pp. 292–199. ACM, New York (2002)
Vinson, A.R., Heuser, C.A., da Silva, A.S., de Moura, E.S.: An Approach to XML Path Matching. In: Workshop on Web Information And Data Management. Proceedings of the 9th annual ACM international workshop on Web information and data management. SESSION: XML and semi-structured data, pp. 17–24 (2007) ISBN:978-1-59593-829-9
Левенштейн, В.И.: Двоичные коды с исправлением выпадений, вставок и замещений символов. Доклады Академий Наук СССР 163(4), 845–848 (1965); Appeared in English as: Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)
Ling, S., Jun, M., Lian, L., Zhumin, C.: Fuzzy similarity from conceptual relations. In: Proceedings of 2006 IEEE Asia-Pacific Conference on Services Computing, APSCC 2006, pp. 3–10 (2006)
Ling, S., Jun, M., Dongmei, Z., Li, L., Zhumin, C.: Fuzzy semantic similarity methods and their application to information retrieval. Journal of Computational Information Systems 3(3), 917–924 (2007)
Song, L., Ma, J., Liu, H., Lian, L., Zhang, D.: Fuzzy Semantic Similarity between Ontological Concepts. In: Advances and Innovations in systems, computing sciences and software engineering, pp. 275–280. Springer, Heidelberg, ISBN:978-1-4020-6263-6
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms, 2nd edn. The Massachusetts Institute of Technology Press, ISBN : 0-262-03293-7.2001
Yang, C.C., Liu, N.: Measuring Similarity of Semi-structured Documents with Context Weights. In: Annual ACM Conference on Research and Development in Information Retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, pp. 719–720 (2006) ISBN:1-59593-369-7
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Song, L., Ma, J., Lei, J., Zhang, D., Wang, Z. (2009). Semantic Structural Similarity Measure for Clustering XML Documents. In: Liu, W., Luo, X., Wang, F.L., Lei, J. (eds) Web Information Systems and Mining. WISM 2009. Lecture Notes in Computer Science, vol 5854. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-05250-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-05250-7_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05249-1
Online ISBN: 978-3-642-05250-7
eBook Packages: Computer ScienceComputer Science (R0)