Skip to main content

Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features

  • Conference paper
  • First Online:
Book cover Semantic Keyword-Based Search on Structured Data Sources (IKC 2015)

Abstract

In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representative image from each document). The application domain is news documents written in English that belong to four categories: Health, Lifestyle-Leisure, Nature-Environment and Politics. The use of the N-gram textual feature set alone led to an accuracy result of 81.0 %, which is much better than the corresponding accuracy result (58.4 %) obtained through the use of the visual feature set alone. A competition between three classification methods, a feature selection method, and parameter tuning led to improved accuracy (86.7 %), achieved by the Random Forests method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  2. Ozgür, A.: Supervised and unsupervised machine learning techniques for text document categorization. Doctoral dissertation, Bogaziçi University (2004)

    Google Scholar 

  3. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Informatica 31, 249–268 (2007)

    MathSciNet  Google Scholar 

  4. Aggarwal, C.C., Zhai, C.: Mining Text Data. Springer, Heidelberg (2012)

    Book  Google Scholar 

  5. Pazienza, M.T.: Information Extraction A Multidisciplinary Approach to an Emerging Information Technology. LNCS, vol. 1299. Springer, Heidelberg (1997)

    Book  Google Scholar 

  6. Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications to Intelligence. CRM and Knowledge Management, pp. 109–129. WIT Press, Southampton (2005)

    Chapter  Google Scholar 

  7. Kim, S.M., Hovy, E.: Automatic identification of pro and con reasons in online reviews. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 483–490. Association for Computational Linguistics (2006)

    Google Scholar 

  8. Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)

    Google Scholar 

  9. Reddy, D.K.S., Pujari, A.K.: N-gram analysis for computer virus detection. J. Comput. Virol. 2(3), 231–239 (2006)

    Article  Google Scholar 

  10. Wang, X., McCallum, A., Wei, X.: Topical N-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on ICDM, pp.697–702 (2007)

    Google Scholar 

  11. Ikeda, D., Takamura, H., Okumura, M.: Semi-supervised learning for blog classification. In: AAAI, pp. 1156–1161 (2008)

    Google Scholar 

  12. HaCohen-Kerner, Y., Rosenfeld, A., Tzidkani, M., Cohen, D.N.: Classifying papers from different computer science conferences. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) ADMA 2013, Part I. LNCS, vol. 8346, pp. 529–541. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  13. HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic feature sets as classifiers of documents according to their historical period and ethnic origin. Appl. Artif. Intell. 24(9), 847–862 (2010)

    Article  Google Scholar 

  14. HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: classification using stylistic feature sets and/or name-based feature sets. J. Am. Soc. Inf. Sci. Technol. 61(8), 1644–1657 (2010)

    Google Scholar 

  15. Kennedy, A., Inkpen, D.: Sentiment classification of movie reviews using contextual valence shifters. Comput. Intell. 22(2), 110–125 (2006)

    Article  MathSciNet  Google Scholar 

  16. Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C.: BLEWS: using blogs to provide context for news articles. In: Proceedings of the Second International AAAI Conference on Weblogs and Social Media (ICWSM), Seattle, Washington, 30 March–2 April 2008‏

    Google Scholar 

  17. Bandari, R., Asur, S., Huberman, B.A.: The pulse of news in social media: forecasting popularity. In: Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM) (Arxiv preprint arXiv), Dublin, vol. 1202, pp. 26–33, 4–7 June 2012

    Google Scholar 

  18. Swezey, R.M.E., Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: Automatic detection of news articles of interest to regional communities. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 12(6), 99–106 (2012)

    Google Scholar 

  19. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1555–1565 (2014)

    Google Scholar 

  20. Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure-based features. Int. J. Doc. Anal. Recogn. 3(4), 232–247 (2001)

    Article  Google Scholar 

  21. Chen, N., Shatkay, H., Blostein, D.: Exploring a new space of features for document classification: figure clustering. In: Proceedings of the 2006 Conference of the Center for Advanced Studies on Collaborative research, p. 35. IBM Corp (2006)

    Google Scholar 

  22. Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I.: News articles classification using random forests and weighted multimodal features. In: Lamas, D., Buitelaar, P. (eds.) IRFC 2014. LNCS, vol. 8849, pp. 63–75. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  23. Augereau, O., Journet, N., Vialard, A., Domenger, J.P.: Improving classification of an industrial document image database by combining visual and textual features. In: In Proceedings of the 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 314–318. IEEE (2014)

    Google Scholar 

  24. Fox, C.: A stop list for general text. ACM SIGIR Forum 24(1–2), 19–35 (1989)

    Article  Google Scholar 

  25. Van De Sande, K.E., Gevers, T., Snoek, C.G.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1582–1596 (2010)

    Article  Google Scholar 

  26. Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3304–3311 (2010)

    Google Scholar 

  27. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  28. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)

    Article  Google Scholar 

  29. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  30. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1998)

    Google Scholar 

  31. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)

    Article  Google Scholar 

  32. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    MATH  Google Scholar 

  33. Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. Hamilton, New Zealand (1998)

    Google Scholar 

  34. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

Download references

Acknowledgments

This work was supported by MULTISENSOR project, partially funded by the European Commission, under the contract number FP7-610411. The authors would also like to thank Avi Rosenfeld, Maor Tzidkani and Daniel Nissim Cohen from the Jerusalem College of Technology, Lev Academic Center, for their assistance to the authors in providing the software tool to generate the textual features used in this research. The authors would also like to acknowledge the networking support by the COST Action IC1302: semantic KEYword-based Search on sTructured data sOurcEs (KEYSTONE) and the COST Action IC1307: The European Network on Integrating Vision and Language (iV&L Net).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaakov HaCohen-Kerner .

Editor information

Editors and Affiliations

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

HaCohen-Kerner, Y., Sabag, A., Liparas, D., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I. (2015). Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features. In: Cardoso, J., Guerra, F., Houben, GJ., Pinto, A.M., Velegrakis, Y. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2015. Lecture Notes in Computer Science(), vol 9398. Springer, Cham. https://doi.org/10.1007/978-3-319-27932-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27932-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27931-2

  • Online ISBN: 978-3-319-27932-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics