News Articles Classification Using Random Forests and Weighted Multimodal Features

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8849)


This research investigates the problem of news articles classification. The classification is performed using N-gram textual features extracted from text and visual features generated from one representative image. The application domain is news articles written in English that belong to four categories: Business-Finance, Lifestyle-Leisure, Science-Technology and Sports downloaded from three well-known news web-sites (BBC, Reuters, and TheGuardian). Various classification experiments have been performed with the Random Forests machine learning method using N-gram textual features and visual features from a representative image. Using the N-gram textual features alone led to much better accuracy results (84.4%) than using the visual features alone (53%). However, the use of both N-gram textual features and visual features led to slightly better accuracy results (86.2%). The main contribution of this work is the introduction of a news article classification framework based on Random Forests and multimodal features (textual and visual), as well as the late fusion strategy that makes use of Random Forests operational capabilities.


Document classification Supervised learning Multimodal News articles N-gram features Random Forests Visual features Fusion 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Schneider, K.-M.: Techniques for improving the performance of naive Bayes for text classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  2. 2.
    Zeng, A., Huang, Y.: A text classification algorithm based on rocchio and hierarchical clustering. In: Huang, D.-S., Gan, Y., Bevilacqua, V., Figueroa, J.C. (eds.) ICIC 2011. LNCS, vol. 6838, pp. 432–439. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  3. 3.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  4. 4.
    Toutanova, K.: Competitive generative models with structure learning for NLP classification tasks. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 576–584 (2006)Google Scholar
  5. 5.
    Ho, A.K.N., Ragot, N., Ramel, J.Y., Eglin, V., Sidere, N.: Document Classification in a Non-stationary Environment: A One-Class SVM Approach. In: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 616–620 (2013)Google Scholar
  6. 6.
    Klassen, M., Paturi, N.: Web document classification by keywords using random forests. In: Zavoral, F., Yaghob, J., Pichappan, P., El-Qawasmeh, E. (eds.) NDT 2010, Part II. CCIS, vol. 88, pp. 256–261. Springer, Heidelberg (2010)Google Scholar
  7. 7.
    Caropreso, M.F., Matwin, S., Sebastiani, F.: Statistical phrases in automated text categorization. Centre National de la Recherche Scientifique, Paris (2000)Google Scholar
  8. 8.
    Braga, I., Monard, M., Matsubara, E.: Combining unigrams and bigrams in semi-supervised text classification. In: Proceedings of Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence (EPIA 2009), Aveiro, pp. 489–500 (2009)Google Scholar
  9. 9.
    Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Information Sciences 158, 69–88 (2004)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Aung, W.T., Hla, K.H.M.S.: Random forest classifier for multi-category classification of web pages. In: IEEE Asia-Pacific Services Computing Conference, APSCC 2009, pp. 372–376. IEEE (2009)Google Scholar
  11. 11.
    Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure-based features. International Journal on Document Analysis and Recognition 3(4), 232–247 (2001)CrossRefGoogle Scholar
  12. 12.
    Chen, N., Shatkay, H., Blostein, D.: Exploring a new space of features for document classification: figure clustering. In: Proceedings of the 2006 Conference of the Center for Advanced Studies on Collaborative Research, p. 35. IBM Corp. (2006)Google Scholar
  13. 13.
    Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C.: BLEWS: Using Blogs to Provide Context for News Articles. In: ICWSM (2008)Google Scholar
  14. 14.
    Bandari, R., Asur, S., Huberman, B.A.: The Pulse of News in Social Media: Forecasting Popularity. In: ICWSM (2012)Google Scholar
  15. 15.
    Swezey, R.M., Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: Automatic detection of news articles of interest to regional communities. IJCSNS 12(6), 100 (2012)Google Scholar
  16. 16.
    Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  17. 17.
    Xu, B., Ye, Y., Nie, L.: An improved random forest classifier for image classification. In: 2012 International Conference on Information and Automation (ICIA), pp. 795–800. IEEE (2012)Google Scholar
  18. 18.
    Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, pp. 27–34. ACM (2011)Google Scholar
  19. 19.
    Li, W., Meng, Y.: Improving the performance of neural networks with random forest in detecting network intrusions. In: Guo, C., Hou, Z.-G., Zeng, Z. (eds.) ISNN 2013, Part II. LNCS, vol. 7952, pp. 622–629. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  20. 20.
    Gray, K.R., Aljabar, P., Heckemann, R.A., Hammers, A., Rueckert, D.: Random forest-based similarity measures for multi-modal classification of Alzheimer’s disease. NeuroImage 65, 167–175 (2013)CrossRefGoogle Scholar
  21. 21.
    Robnik-Šikonja, M.: Improving random forests. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 359–370. Springer, Heidelberg (2004)Google Scholar
  22. 22.
    HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems: An International Journal 39(3), 213–228 (2008)CrossRefzbMATHGoogle Scholar
  23. 23.
    Fox, C.: A stop list for general text. ACM SIGIR Forum 24(1-2) (1989)Google Scholar
  24. 24.
    Sikora, T.: The MPEG-7 visual standard for content description-an overview. IEEE Transactions on Circuits and Systems for Video Technology 11(6), 696–702 (2001)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Zhou, Q., Hong, W., Luo, L., Yang, F.: Gene selection using random forest and proximity differences criterion on DNA microarray data. Journal of Convergence Information Technology 5(6), 161–170 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Information Technologies Institute, Centre for Research and Technology HellasThermi-ThessalonikiGreece
  2. 2.Dept. of Computer ScienceJerusalem College of Technology - Lev Academic CenterJerusalemIsrael

Personalised recommendations