Document Clustering with K-tree

  • Christopher M. De Vries
  • Shlomo Geva
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5631)


This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.


INEX XML Mining Clustering K-tree Tree Vector Quantization Text Classification Support Vector Machine 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)Google Scholar
  2. 2.
    Geva, S.: K-tree: a height balanced tree structured vector quantizer. In: Proceedings of the 2000 IEEE Signal Processing Society Workshop Neural Networks for Signal Processing X, vol. 1, pp. 271–280 (2000)Google Scholar
  3. 3.
    Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267–273. ACM Press, New York (2003)CrossRefGoogle Scholar
  4. 4.
    Surdeanu, M., Turmo, J., Ageno, A.: A hybrid unsupervised approach for document clustering. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 685–690. ACM Press, New York (2005)Google Scholar
  5. 5.
    Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining. ICDM 2003, November 2003, pp. 541–544 (2003)Google Scholar
  6. 6.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques 34, 35 (2000)Google Scholar
  7. 7.
    Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Text categorization with Support Vector Machines: Learning with many relevant features, pp. 137–142 (1998)Google Scholar
  8. 8.
    Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 45–66 (2002)MATHGoogle Scholar
  9. 9.
    Salton, G., Fox, E.A., Wu, H.: Extended boolean information retrieval. Communications of the ACM 26(11), 1022–1036 (1983)MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Robertson, S., Jones, K.: Simple, proven approaches to text retrieval. Update (1997)Google Scholar
  11. 11.
    Porter, M.: An algorithm for suffix stripping. Program: Electronic Library and Information Systems 40(3), 211–218 (2006)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research 6, 1453–1484 (2005)MathSciNetMATHGoogle Scholar
  13. 13.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Shannon, C., Weaver, W.: The mathematical theory of communication. University of Illinois Press (1949)Google Scholar
  15. 15.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA 2007: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Philadelphia, PA, USA, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)Google Scholar
  16. 16.
    Lin, C.: Projected Gradient Methods for Nonnegative Matrix Factorization. Neural Computation 19(10), 2756–2779 (2007)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Christopher M. De Vries
    • 1
  • Shlomo Geva
    • 1
  1. 1.Faculty of Science and TechnologyQueensland University of TechnologyBrisbaneAustralia

Personalised recommendations