Advertisement

Text Clustering with String Kernels in R

  • Alexandros Karatzoglou
  • Ingo Feinerer
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

We present a package which provides a general framework, including tools and algorithms, for text mining in R using the S4 class system. Using this package and the kernlab R package we explore the use of kernel methods for clustering (e.g., kernel k-means and spectral clustering) on a set of text documents, using string kernels. We compare these methods to a more traditional clustering technique like k-means on a bag of word representation of the text and evaluate the viability of kernel-based methods as a text clustering technique.

Keywords

Spectral Cluster Recall Rate Text Document Kernel Matrix String Length 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. CANCEDDA, N., GAUSSIER, E., GOUTTE, C. and RENDERS, J.M. (2003): Word-sequence Kernels. Journal of Machine Learning Research, 3, 1059–1082.MathSciNetMATHGoogle Scholar
  2. FOWLKES, C., BELONGIE, S., CHUNG, F. and MALIK J. (2004): Spectral Grouping Using the Nystrom Method. Transactions on Pattern Analysis and Machine Intelligence, 26,2, 214–225.CrossRefGoogle Scholar
  3. HERBRICH, R. (2002): Learning Kernel Classifiers Theory and Algorithms. MIT Press.Google Scholar
  4. JOACHIMS, T. (1999): Making Large-scale SVM Learning Practical. In: Advances in Kernel Methods — Support Vector Learning.Google Scholar
  5. JOACHIMS, T. (2002): Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. The Kluwer International Series In Engineering And Computer Science. Kluwer Academic Publishers, Boston.CrossRefGoogle Scholar
  6. LEWIS, D. (1997): Reuters-21578 Text Categorization Test Collection.Google Scholar
  7. LODHI, H., SAUNDERS, C., SHAWE-TAYLOR, J., CRISTIANINI, N. and WATKINS, C. (2002): Text Classification Using String Kernels. Journal of Machine Learning Research, 2, 419–444.MATHGoogle Scholar
  8. NG, A., JORDAN, M. and WEISS, Y. (2001): On Spectral Clustering: Analysis and an Algorithm. Advances in Neural Information Processing Systems, 14.Google Scholar
  9. R DEVELOPMENT CORE TEAM (2006): R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.Google Scholar
  10. SHI, J. and MALIK, J. (2000): Normalized Cuts and Image Segmentation. Transactions on Pattern Analysis and Machine Intelligence, 22,8, 888–905.CrossRefGoogle Scholar
  11. TEMPLE LANG, D. (2005): Rstem: Interface to Snowball Implementation of Porter’s Word Stemming Algorithm. R Package Version 0.2-0.Google Scholar
  12. VISHWANATHAN, S. and SMOLA, A. (2004): Fast Kernels for String and Tree Matching. In: K. Tsuda, B. Schölkopf and J.P. Vert (Eds.): Kernels and Bioinformatics. MIT Press, Cambridge.Google Scholar
  13. WATKINS, C. (2000): Dynamic Alignment Kernels. In: A.J. Smola, P.L. Bartlett, B. Schölkopf and D. Schuurmans (Eds.): Advances in Large Margin Classifiers. MIT Press, Cambridge, 39–50.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Alexandros Karatzoglou
    • 1
  • Ingo Feinerer
    • 2
  1. 1.Department of Statistics and Probability TheoryTechnische Universität WienWienAustria
  2. 2.Department of Statistics and MathematicsWirtschaftsuniversität WienWienAustria

Personalised recommendations