Skip to main content

Document Clustering with K-tree

  • Conference paper
Advances in Focused Retrieval (INEX 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5631))

Abstract

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)

    Google Scholar 

  2. Geva, S.: K-tree: a height balanced tree structured vector quantizer. In: Proceedings of the 2000 IEEE Signal Processing Society Workshop Neural Networks for Signal Processing X, vol. 1, pp. 271–280 (2000)

    Google Scholar 

  3. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267–273. ACM Press, New York (2003)

    Chapter  Google Scholar 

  4. Surdeanu, M., Turmo, J., Ageno, A.: A hybrid unsupervised approach for document clustering. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 685–690. ACM Press, New York (2005)

    Google Scholar 

  5. Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining. ICDM 2003, November 2003, pp. 541–544 (2003)

    Google Scholar 

  6. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques 34, 35 (2000)

    Google Scholar 

  7. Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Text categorization with Support Vector Machines: Learning with many relevant features, pp. 137–142 (1998)

    Google Scholar 

  8. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 45–66 (2002)

    MATH  Google Scholar 

  9. Salton, G., Fox, E.A., Wu, H.: Extended boolean information retrieval. Communications of the ACM 26(11), 1022–1036 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  10. Robertson, S., Jones, K.: Simple, proven approaches to text retrieval. Update (1997)

    Google Scholar 

  11. Porter, M.: An algorithm for suffix stripping. Program: Electronic Library and Information Systems 40(3), 211–218 (2006)

    Article  MathSciNet  Google Scholar 

  12. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research 6, 1453–1484 (2005)

    MathSciNet  MATH  Google Scholar 

  13. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  14. Shannon, C., Weaver, W.: The mathematical theory of communication. University of Illinois Press (1949)

    Google Scholar 

  15. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA 2007: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Philadelphia, PA, USA, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)

    Google Scholar 

  16. Lin, C.: Projected Gradient Methods for Nonnegative Matrix Factorization. Neural Computation 19(10), 2756–2779 (2007)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

De Vries, C.M., Geva, S. (2009). Document Clustering with K-tree. In: Geva, S., Kamps, J., Trotman, A. (eds) Advances in Focused Retrieval. INEX 2008. Lecture Notes in Computer Science, vol 5631. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03761-0_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03761-0_43

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03760-3

  • Online ISBN: 978-3-642-03761-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics