Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks

  • Fabrice Colas
  • Pavel Brazdil
Part of the IFIP International Federation for Information Processing book series (IFIPAICT, volume 217)

Summary

Document classification has already been widely studied. In fact, some studies compared feature selection techniques or feature space transformation whereas some others compared the performance of different algorithms. Recently, following the rising interest towards the Support Vector Machine, various studies showed that SVM outperforms other classification algorithms. So should we just not bother about other classification algorithms and opt always for SVM ?

We have decided to investigate this issue and compared SVM to kNN and naive Bayes on binary classification tasks. An important issue is to compare optimized versions of these algorithms, which is what we have done. Our results show all the classifiers achieved comparable performance on most problems. One surprising result is that SVM was not a clear winner, despite quite good overall performance. If a suitable preprocessing is used with kNN, this algorithm continues to achieve very good results and scales up well with the number of documents, which is not the case for SVM. As for naive Bayes, it also achieved good performance.

References

  1. 1.
    W. Daelemans, V. Hoste, F. D. Meulder, and B. Naudts. Combined optimization of feature selection and algorithm parameters in machine learning of language. In Proceedings of the European Conference of Machine Learning, pages 84–95, 2003.Google Scholar
  2. 2.
    S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management, pages 148–155, 1998.Google Scholar
  3. 3.
    J. Fürnkranz. Pairwise classification as an ensemble technique. In Proceedings of the 13th European Conference on Machine Learning, pages 97–110, 2002.Google Scholar
  4. 4.
    T. Joachims. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines. 1998.Google Scholar
  5. 5.
    A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 1998.Google Scholar
  6. 6.
    A.K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.Google Scholar
  7. 7.
    T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.Google Scholar
  8. 8.
    J. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report 98-14, Microsoft Research, 1998.Google Scholar
  9. 9.
    M. Rogati and Y. Yang. High-performing feature selection for text classification. In Proceedings of the 11th International Conference on Information and Knowledge Management, pages 659–661, 2002.Google Scholar
  10. 10.
    Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, pages 69–90, 1999.Google Scholar
  11. 11.
    Y. Yang. A scalability analysis of classifiers in text categorization. In Proceedings 26th ACM International Conference on Research and Development in Information Retrieval, 2003.Google Scholar
  12. 12.
    Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42–49, 1999.Google Scholar
  13. 13.
    Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 412–420, 1997.Google Scholar
  14. 14.
    T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, pages 5–31, 2001.Google Scholar

Copyright information

© International Federation for Information Processing 2006

Authors and Affiliations

  • Fabrice Colas
    • 1
  • Pavel Brazdil
    • 2
  1. 1.LIACSLeiden UniversityThe Netherlands
  2. 2.LIACC-NIAADUniversity of PortoPortugal

Personalised recommendations