Abstract
We present a novel clustering algorithm to group the XML documents by similar structures. We introduce a Level structure format to represent the XML documents for efficient processing. We develop a global criterion function that do not require the pair-wise similarity to be computed between two individual documents, rather measures the similarity at clustering level utilising structural information of the XML documents. The experimental analysis shows the method to be fast and accurate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)
Boag, S., Chamberlin, D., Fernández, M., Florescu, D., Robie, J., Siméon, J.: XQuery 1.0: An XML Query Language. W3C Working Draft (September 2005), http://www.w3.org/TR/2005/WD-xquery-20050915/
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Fast Detection of XML Structural Similarities. IEEE Transaction on Knowledge and Data Engineering 7(2), 160–175 (2005)
Guardalben, G.: Integrating XML and Relational Database Technologies: A Position Paper, HiT Software Inc. (2004), from http://www.hitsw.com/products_services/whitepapers/integrating_xml_rdb/integrating_xml_white_paper.pdf (retrieved May 1, 2005)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999)
Leung, H.-p., Chung, F.-l., Chan, S.C.-f.: On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowledge and Information Systems 7(4), 476–498 (2005)
Nayak, R., Iryadi, W.: XMine: A methodology for mining XML structure. In: The Eighth Asia Pacific Web Conference, China (to appear) (January 2006)
Nayak, R., Xia, F.B.: Automatic integration of heterogenous XML-schemas. In: Proceedings of the International Conferences on Information Integration and Web-based Applications & Services, Jakarta, Indonesia, Sec 27-29, pp. 427–437 (2004)
Nayak, R., Witt, R., Tonev, A.: Data Mining and XML documents. In: International Conference on Internet Computing, USA (2002)
Xylem, L.: Xylem: A dynamic Warehouse for XML data of the Web. In: IDEAS 2001, pp. 3–7 (2001)
Yergeau, F., Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E.: Extensible Markup Language (XML) 1.0. W3C Recommendation, 3rd edn. (February 2004), http://www.w3.org/TR/2004/REC-XML-20040204/
Ying, Y., Guan, X., You, J.: CLOPE: A Fast and effective clustering algorithm for transactional data (2002)
Wang, K., Xu, C.: Clustering Transactions Using Large Items. In: The proceedings of ACM CIKM 1999, Kansas, Missouri (1999)
Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal Computing 18(6), 1245–1262 (1989)
Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets. In: The 2002 ACM CIKM, USA (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nayak, R., Xu, S. (2006). XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_35
Download citation
DOI: https://doi.org/10.1007/11731139_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)