Document Clustering with K-tree

De Vries, Christopher M.; Geva, Shlomo

doi:10.1007/978-3-642-03761-0_43

Christopher M. De Vries¹⁹ &
Shlomo Geva¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5631))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

430 Accesses
5 Citations

Abstract

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Google Scholar
Geva, S.: K-tree: a height balanced tree structured vector quantizer. In: Proceedings of the 2000 IEEE Signal Processing Society Workshop Neural Networks for Signal Processing X, vol. 1, pp. 271–280 (2000)
Google Scholar
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267–273. ACM Press, New York (2003)
Chapter Google Scholar
Surdeanu, M., Turmo, J., Ageno, A.: A hybrid unsupervised approach for document clustering. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 685–690. ACM Press, New York (2005)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining. ICDM 2003, November 2003, pp. 541–544 (2003)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques 34, 35 (2000)
Google Scholar
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Text categorization with Support Vector Machines: Learning with many relevant features, pp. 137–142 (1998)
Google Scholar
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, 45–66 (2002)
MATH Google Scholar
Salton, G., Fox, E.A., Wu, H.: Extended boolean information retrieval. Communications of the ACM 26(11), 1022–1036 (1983)
Article MathSciNet MATH Google Scholar
Robertson, S., Jones, K.: Simple, proven approaches to text retrieval. Update (1997)
Google Scholar
Porter, M.: An algorithm for suffix stripping. Program: Electronic Library and Information Systems 40(3), 211–218 (2006)
Article MathSciNet Google Scholar
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research 6, 1453–1484 (2005)
MathSciNet MATH Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Article MathSciNet MATH Google Scholar
Shannon, C., Weaver, W.: The mathematical theory of communication. University of Illinois Press (1949)
Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA 2007: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Philadelphia, PA, USA, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Google Scholar
Lin, C.: Projected Gradient Methods for Nonnegative Matrix Factorization. Neural Computation 19(10), 2756–2779 (2007)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia
Christopher M. De Vries & Shlomo Geva

Authors

Christopher M. De Vries
View author publications
You can also search for this author in PubMed Google Scholar
Shlomo Geva
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science and Technology, Queensland University of Technology, GPO Box 2434, 4001, Brisband, Qld, Australia
Shlomo Geva
Archives and Information Studies/Humanities, University of Amsterdam, Turfdraagsterpad 9, 1012 XT, Amsterdam, The Netherlands
Jaap Kamps
Department of Computer Science, University of Otago, P.O. Box 56, 9054, Dunedin, New Zealand
Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Vries, C.M., Geva, S. (2009). Document Clustering with K-tree. In: Geva, S., Kamps, J., Trotman, A. (eds) Advances in Focused Retrieval. INEX 2008. Lecture Notes in Computer Science, vol 5631. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03761-0_43

Download citation

DOI: https://doi.org/10.1007/978-3-642-03761-0_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03760-3
Online ISBN: 978-3-642-03761-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics