Machine Learning

, Volume 46, Issue 1–3, pp 423–444 | Cite as

Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

  • Edda Leopold
  • Jörg Kindermann


The choice of the kernel function is crucial to most applications of support vector machines. In this paper, however, we show that in the case of text classification, term-frequency transformations have a larger impact on the performance of SVM than the kernel itself. We discuss the role of importance-weights (e.g. document frequency and redundancy), which is not yet fully understood in the light of model complexity and calculation cost, and we show that time consuming lemmatization or stemming can be avoided even when classifying a highly inflectional language like German.

support vector machines text classification lemmatization stemming kernel functions 


  1. Altmann, G. (1988). Wiederholungen in texten [Repetitions in texts]. Bochum, Germany: Brockmeyer.Google Scholar
  2. Balasubrahmanyan, V. K. & Naranan, S. (1996). Quantitative linguistics and complex system studies. Journal of Quantitative Linguistics, 3:3, 177-228.Google Scholar
  3. Bookstein, A. & Swanson, Don R. (1974). Probabilistic models for automatic indexing. Journal of the American Society of Information Science, 25, 312-318.Google Scholar
  4. Chitashvili, R. J. & Baayen, R. H. (1993). Word frequency distributions. In G. Altmann & L. Hřebíček (Eds.). Quantitative Text Analysis (pp. 46-135). Trier, Germany: wvt.Google Scholar
  5. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information Retrieval and Knowledge Management (ACM-CIKM-98) (pp. 148-155).Google Scholar
  6. Grotjahn, R. (1982). Ein statistisches Modell für die Verteilung der Wortl¨ange [A statistical model for the distribution of word length]. Zeitschrift f¨ur Sprachwissenschaft, 1, 44-75.Google Scholar
  7. Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing, Part I. Journal of the American Society for Information Science, 26, 197-206.Google Scholar
  8. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning (ECML '98), Lecture Notes in Computer Science, Number 1398 (pp. 137-142).Google Scholar
  9. Kral´k, J. (1977). An application of exponential distribution law in quantitative linguistics. Prague Studies in Mathematical Linguistics, 5, 223-235.Google Scholar
  10. Krylov, Ju. K. (1995). A stationary model of coherent text generation. Journal of Quantitative Linguistics, 2:2, 157-167.Google Scholar
  11. Lezius,W., Rapp, R., & Wettler,M. (1998). A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German. In Proceedings of the COLING-ACL 1998 (pp. 743-747).Google Scholar
  12. Mandelbrot, B. (1953). On the theory of word frequencies and on related Markovian models of discourse. In R. Jakobson (Ed.), Structure of Language and its Mathematical Aspects, Proceedings of Symposia in Applied Mathematics (Vol. XII, pp. 190-210). Providence, RI: American Mathematical Society.Google Scholar
  13. Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language processing, Cambridge, MA: MIT-Press.Google Scholar
  14. Margulis, E. L. (1993). Modelling documents with multiple poisson distributions. Information Processing and Management, 29, 215-228.Google Scholar
  15. Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:1/2, 103-134.Google Scholar
  16. Orlov, Ju. K. (1982). Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie 'Sprache-Rede' in der statistischen Linquistik)[Linguostatistics: Establishing language norms of analysis of the speech process (The antinomy 'language-speech' in statistical linguistics.).] In Ju. K. Orlov, M. G. Boroda, & I. S. NadarejČvili (Eds.). Sprache, Text, Kunst. Quantitative Analysen (pp. 1-55). Bochum, Germany: Brockmeyer.Google Scholar
  17. Porter, M. F. (1980) An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14:3, 130-137.Google Scholar
  18. Rieger, B. B. (1999). Semiotics and computational linguistics. On semiotic cognitive information processintg. In Zadeh, L. A. & J. Kacprzyk (Eds.). Computing with words in information/intelligent systems I. foundations (pp. 93-118). Heidelberg, Germany: Physica.Google Scholar
  19. Salton, G. & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw Hill.Google Scholar
  20. Stricker, M.,Vichot, F., Dreyfus, G., & Wolinski F. (2000).Vers la conception automatique de filtres d'informations efficaces [Towards the automatic design of efficient custom filters]. In Reconnaissance des Formes et Intelligence Artificielle (RFIA '2000) (pp. 129-137).Google Scholar
  21. Vapnik, Vladimir N. (1998). Statistical learning theory. New York: Wiley.Google Scholar
  22. Wimmer, G., Köhler, R., Grotjahn, R., & Altmann, G. (1994). Towards a theory of word length distribution. Journal of Quantitative Linguistics, 1, 98-106.Google Scholar
  23. Zipf, G. K. (1949). Human behavior and the principle of least effort. An introduction to human ecology. Cambridge, MA: Addison-Wesley.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Edda Leopold
    • 1
  • Jörg Kindermann
    • 1
  1. 1.GMD German National Research Center for Information TechnologyInstitute for Autonomous intelligent Systems, Schloss BirlinghovenSankt AugustinGermany

Personalised recommendations