Document Representations for Classification of Short Web-Page Descriptions

  • Miloš Radovanović
  • Mirjana Ivanović
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4081)


Motivated by applying Text Categorization to sorting Web search results, this paper describes an extensive experimental study of the impact of bag-of-words document representations on the performance of five major classifiers – Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts represent short Web-page descriptions from the dmoz Open Directory Web-page ontology. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics – accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships.


Term Frequency Word Sense Disambiguation Inverse Document Frequency Sequential Minimal Optimization Document Representation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications, WIT Press, Southampton (2005)Google Scholar
  2. 2.
    Radovanović, M., Ivanović, M.: CatS: A classification-powered meta-search engine. In: Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol. 23, Springer, Heidelberg (2006)Google Scholar
  3. 3.
    Mladenić, D.: Text-learning and related intelligent agents. IEEE Intelligent Systems, Special Issue on Applications of Intelligent Information Retrieval 14(4), 44–54 (1999)Google Scholar
  4. 4.
    Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of ICML04, 21st International Conference on Machine Learning, Baniff, Canada (2004)Google Scholar
  5. 5.
    Leopold, E., Kindermann, J.: Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning 46, 423–444 (2002)MATHCrossRefGoogle Scholar
  6. 6.
    Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F.: Vers la conception automatique de filtres d’informations efficaces. In: Proceedings of RFIA 2000, Reconnaissance des Formes et Intelligence Artificielle, pp. 129–137 (2000)Google Scholar
  7. 7.
    Wu, X., Srihari, R., Zheng, Z.: Document representation for one-class SVM. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, Springer, Heidelberg (2004)Google Scholar
  8. 8.
    Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial naive bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  9. 9.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar
  10. 10.
    Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of ICML 2003, 20th International Conference on Machine Learning (2003)Google Scholar
  11. 11.
    Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge (1999)Google Scholar
  12. 12.
    Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Machine Learning 37(3), 277–296 (1999)MATHCrossRefGoogle Scholar
  13. 13.
    Aha, D., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991)Google Scholar
  14. 14.
    Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Miloš Radovanović
    • 1
  • Mirjana Ivanović
    • 1
  1. 1.Faculty of Science, Department of Mathematics and InformaticsUniversity of Novi SadNovi SadSerbia and Montenegro

Personalised recommendations