Interactions Between Document Representation and Feature Selection in Text Categorization

  • Miloš Radovanović
  • Mirjana Ivanović
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4080)


Many studies in automated Text Categorization focus on the performance of classifiers, with or without considering feature selection methods, but almost as a rule taking into account just one document representation. Only relatively recently did detailed studies on the impact of various document representations step into the spotlight, showing that there may be statistically significant differences in classifier performance even among variations of the classical bag-of-words model. This paper examines the relationship between the idf transform and several widely used feature selection methods, in the context of Naïve Bayes and Support Vector Machines classifiers, on datasets extracted from the dmoz ontology of Web-page descriptions. The described experimental study shows that the idf transform considerably effects the distribution of classification performance over feature selection reduction rates, and offers an evaluation method which permits the discovery of relationships between different document representations and feature selection methods which is independent of absolute differences in classification performance.


Support Vector Machine Feature Selection Information Gain Feature Selection Method Term Frequency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications. WIT Press, Southampton (2005)Google Scholar
  2. 2.
    Leopold, E., Kindermann, J.: Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning 46, 423–444 (2002)MATHGoogle Scholar
  3. 3.
    Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F.: Vers la conception automatique de filtres d’informations efficaces. In: Proceedings of RFIA 2000, Reconnaissance des Formes et Intelligence Artificielle, pp. 129–137 (2000)Google Scholar
  4. 4.
    Wu, X., Srihari, R.K., Zheng, Z.: Document Representation for One-Class SVM. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 489–500. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for Text Categorization Revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  6. 6.
    Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of ICML 2003, 20th International Conference on Machine Learning (2003)Google Scholar
  7. 7.
    Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol. 138, pp. 81–98. Physica-Verlag, Heidelberg (2004)Google Scholar
  8. 8.
    Radovanović, M., Ivanović, M.: Document Representations for Classification of Short Web-Page Descriptions. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 544–553. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Radovanović, M., Ivanović, M.: CatS: A classification-powered meta-search engine. In: Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol. 23. Springer, Heidelberg (2006)Google Scholar
  10. 10.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2005)MATHGoogle Scholar
  11. 11.
    Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)Google Scholar
  12. 12.
    Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of ICML 2004, 21st International Conference on Machine Learning, Baniff, Canada (2004)Google Scholar
  13. 13.
    Ferragina, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of WWW 2005, 14th International World Wide Web Conference, Chiba, Japan, pp. 801–810 (2005)Google Scholar
  14. 14.
    Salton, G. (ed.): The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs (1971)Google Scholar
  15. 15.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  16. 16.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  17. 17.
    Mladenić, D.: Machine Learning on non-homogenous, distributed text data. PhD thesis, University of Ljubljana, Slovenia (1998)Google Scholar
  18. 18.
    Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF. In: ECML 1997. LNCS, vol. 1224, pp. 412–420. Springer, Heidelberg (1997)Google Scholar
  19. 19.
    Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge (1999)Google Scholar
  20. 20.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Miloš Radovanović
    • 1
  • Mirjana Ivanović
    • 1
  1. 1.Department of Mathematics and InformaticsUniversity of Novi Sad, Faculty of ScienceNovi SadSerbia and Montenegro

Personalised recommendations