XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents

  • Richi Nayak
  • Sumei Xu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3918)


We present a novel clustering algorithm to group the XML documents by similar structures. We introduce a Level structure format to represent the XML documents for efficient processing. We develop a global criterion function that do not require the pair-wise similarity to be computed between two individual documents, rather measures the similarity at clustering level utilising structural information of the XML documents. The experimental analysis shows the method to be fast and accurate.


Level Structure Cluster Solution Tree Edit Distance Global Similarity Measure Effective Cluster Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)CrossRefGoogle Scholar
  2. 2.
    Boag, S., Chamberlin, D., Fernández, M., Florescu, D., Robie, J., Siméon, J.: XQuery 1.0: An XML Query Language. W3C Working Draft (September 2005),
  3. 3.
    Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Fast Detection of XML Structural Similarities. IEEE Transaction on Knowledge and Data Engineering 7(2), 160–175 (2005)CrossRefGoogle Scholar
  4. 4.
    Guardalben, G.: Integrating XML and Relational Database Technologies: A Position Paper, HiT Software Inc. (2004), from (retrieved May 1, 2005)
  5. 5.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999)CrossRefGoogle Scholar
  6. 6.
    Leung, H.-p., Chung, F.-l., Chan, S.C.-f.: On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowledge and Information Systems 7(4), 476–498 (2005)CrossRefGoogle Scholar
  7. 7.
    Nayak, R., Iryadi, W.: XMine: A methodology for mining XML structure. In: The Eighth Asia Pacific Web Conference, China (to appear) (January 2006)Google Scholar
  8. 8.
    Nayak, R., Xia, F.B.: Automatic integration of heterogenous XML-schemas. In: Proceedings of the International Conferences on Information Integration and Web-based Applications & Services, Jakarta, Indonesia, Sec 27-29, pp. 427–437 (2004)Google Scholar
  9. 9.
    Nayak, R., Witt, R., Tonev, A.: Data Mining and XML documents. In: International Conference on Internet Computing, USA (2002)Google Scholar
  10. 10.
    Xylem, L.: Xylem: A dynamic Warehouse for XML data of the Web. In: IDEAS 2001, pp. 3–7 (2001)Google Scholar
  11. 11.
    Yergeau, F., Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E.: Extensible Markup Language (XML) 1.0. W3C Recommendation, 3rd edn. (February 2004),
  12. 12.
    Ying, Y., Guan, X., You, J.: CLOPE: A Fast and effective clustering algorithm for transactional data (2002)Google Scholar
  13. 13.
    Wang, K., Xu, C.: Clustering Transactions Using Large Items. In: The proceedings of ACM CIKM 1999, Kansas, Missouri (1999)Google Scholar
  14. 14.
    Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal Computing 18(6), 1245–1262 (1989)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets. In: The 2002 ACM CIKM, USA (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Richi Nayak
    • 1
  • Sumei Xu
    • 1
  1. 1.School of Information SystemsQueensland University of TechnologyBrisbaneAustralia

Personalised recommendations