Clustering XML Documents Using Frequent Subtrees

  • Sangeetha Kutty
  • Tien Tran
  • Richi Nayak
  • Yuefeng Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5631)


This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy.


Clustering XML document mining Frequent mining Frequent subtrees INEX Wikipedia Structure and content 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Nayak, R., Witt, R., Tonev, A.: Data Mining and XML Documents. In: International Conference on Internet Computing (2002)Google Scholar
  2. 2.
    Tran, T., Nayak, R.: Evaluating the Performance of XML Document Clustering by Structure Only. In: Comparative Evaluation of XML Information Retrieval Systems, pp. 473–484 (2007)Google Scholar
  3. 3.
    Kutty, S., Nayak, R., Li, Y.: PCITMiner-Prefix-based Closed Induced Tree Miner for finding closed induced frequent subtrees. In: Sixth Australasian Data Mining Conference (AusDM 2007). ACS, Gold Coast (2007)Google Scholar
  4. 4.
    Nayak, R.: Investigating Semantic Measures in XML Clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 1042–1045. IEEE Computer Society Press, Los Alamitos (2006)Google Scholar
  5. 5.
    Aggarwal, C.C., et al.: Xproj: a framework for projected structural clustering of xml documents. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 46–55. ACM, San Jose (2007)CrossRefGoogle Scholar
  6. 6.
    Chi, Y., et al.: Frequent Subtree Mining-An Overview. In: Fundamenta Informaticae, pp. 161–198. IOS Press, Amsterdam (2005)Google Scholar
  7. 7.
    Kutty, S., Nayak, R., Li, Y.: XML Data Mining: Process and Applications. In: Song, M., Wu, Y.-F. (eds.) Handbook of Research on Text and Web Mining Technologies. Idea Group Inc., USA (2008)Google Scholar
  8. 8.
    Rijsbergen, C.J.v.: Information Retrieval. Butterworth, London (1979)Google Scholar
  9. 9.
    Fox, C.: A stop list for general text. ACM SIGIR Forum 24(1-2), 19–35 (1989)CrossRefGoogle Scholar
  10. 10.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  11. 11.
    Karypis, G.: CLUTO-Software for Clustering High-Dimensional Datasets | Karypis Lab, May 25 (2007),

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Sangeetha Kutty
    • 1
  • Tien Tran
    • 1
  • Richi Nayak
    • 1
  • Yuefeng Li
    • 1
  1. 1.Faculty of Science and TechnologyQueensland University of TechnologyBrisbaneAustralia

Personalised recommendations