Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features

HaCohen-Kerner, Yaakov; Sabag, Asaf; Liparas, Dimitris; Moumtzidou, Anastasia; Vrochidis, Stefanos; Kompatsiaris, Ioannis

doi:10.1007/978-3-319-27932-9_6

Yaakov HaCohen-Kerner¹⁹,
Asaf Sabag¹⁹,
Dimitris Liparas²⁰,
Anastasia Moumtzidou²⁰,
Stefanos Vrochidis²⁰ &
…
Ioannis Kompatsiaris²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9398))

Included in the following conference series:

International KEYSTONE Conference on Semantic Keyword-Based Search on Structured Data Sources

471 Accesses
1 Citations

Abstract

In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representative image from each document). The application domain is news documents written in English that belong to four categories: Health, Lifestyle-Leisure, Nature-Environment and Politics. The use of the N-gram textual feature set alone led to an accuracy result of 81.0 %, which is much better than the corresponding accuracy result (58.4 %) obtained through the use of the visual feature set alone. A competition between three classification methods, a feature selection method, and parameter tuning led to improved accuracy (86.7 %), achieved by the Random Forests method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Ozgür, A.: Supervised and unsupervised machine learning techniques for text document categorization. Doctoral dissertation, Bogaziçi University (2004)
Google Scholar
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Informatica 31, 249–268 (2007)
MathSciNet Google Scholar
Aggarwal, C.C., Zhai, C.: Mining Text Data. Springer, Heidelberg (2012)
Book Google Scholar
Pazienza, M.T.: Information Extraction A Multidisciplinary Approach to an Emerging Information Technology. LNCS, vol. 1299. Springer, Heidelberg (1997)
Book Google Scholar
Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications to Intelligence. CRM and Knowledge Management, pp. 109–129. WIT Press, Southampton (2005)
Chapter Google Scholar
Kim, S.M., Hovy, E.: Automatic identification of pro and con reasons in online reviews. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 483–490. Association for Computational Linguistics (2006)
Google Scholar
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)
Google Scholar
Reddy, D.K.S., Pujari, A.K.: N-gram analysis for computer virus detection. J. Comput. Virol. 2(3), 231–239 (2006)
Article Google Scholar
Wang, X., McCallum, A., Wei, X.: Topical N-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on ICDM, pp.697–702 (2007)
Google Scholar
Ikeda, D., Takamura, H., Okumura, M.: Semi-supervised learning for blog classification. In: AAAI, pp. 1156–1161 (2008)
Google Scholar
HaCohen-Kerner, Y., Rosenfeld, A., Tzidkani, M., Cohen, D.N.: Classifying papers from different computer science conferences. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) ADMA 2013, Part I. LNCS, vol. 8346, pp. 529–541. Springer, Heidelberg (2013)
Chapter Google Scholar
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic feature sets as classifiers of documents according to their historical period and ethnic origin. Appl. Artif. Intell. 24(9), 847–862 (2010)
Article Google Scholar
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: classification using stylistic feature sets and/or name-based feature sets. J. Am. Soc. Inf. Sci. Technol. 61(8), 1644–1657 (2010)
Google Scholar
Kennedy, A., Inkpen, D.: Sentiment classification of movie reviews using contextual valence shifters. Comput. Intell. 22(2), 110–125 (2006)
Article MathSciNet Google Scholar
Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C.: BLEWS: using blogs to provide context for news articles. In: Proceedings of the Second International AAAI Conference on Weblogs and Social Media (ICWSM), Seattle, Washington, 30 March–2 April 2008‏
Google Scholar
Bandari, R., Asur, S., Huberman, B.A.: The pulse of news in social media: forecasting popularity. In: Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM) (Arxiv preprint arXiv), Dublin, vol. 1202, pp. 26–33, 4–7 June 2012
Google Scholar
Swezey, R.M.E., Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: Automatic detection of news articles of interest to regional communities. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 12(6), 99–106 (2012)
Google Scholar
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1555–1565 (2014)
Google Scholar
Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure-based features. Int. J. Doc. Anal. Recogn. 3(4), 232–247 (2001)
Article Google Scholar
Chen, N., Shatkay, H., Blostein, D.: Exploring a new space of features for document classification: figure clustering. In: Proceedings of the 2006 Conference of the Center for Advanced Studies on Collaborative research, p. 35. IBM Corp (2006)
Google Scholar
Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I.: News articles classification using random forests and weighted multimodal features. In: Lamas, D., Buitelaar, P. (eds.) IRFC 2014. LNCS, vol. 8849, pp. 63–75. Springer, Heidelberg (2014)
Chapter Google Scholar
Augereau, O., Journet, N., Vialard, A., Domenger, J.P.: Improving classification of an industrial document image database by combining visual and textual features. In: In Proceedings of the 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 314–318. IEEE (2014)
Google Scholar
Fox, C.: A stop list for general text. ACM SIGIR Forum 24(1–2), 19–35 (1989)
Article Google Scholar
Van De Sande, K.E., Gevers, T., Snoek, C.G.: Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1582–1596 (2010)
Article Google Scholar
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3304–3311 (2010)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1998)
Google Scholar
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar
Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. Hamilton, New Zealand (1998)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar

Download references

Acknowledgments

This work was supported by MULTISENSOR project, partially funded by the European Commission, under the contract number FP7-610411. The authors would also like to thank Avi Rosenfeld, Maor Tzidkani and Daniel Nissim Cohen from the Jerusalem College of Technology, Lev Academic Center, for their assistance to the authors in providing the software tool to generate the textual features used in this research. The authors would also like to acknowledge the networking support by the COST Action IC1302: semantic KEYword-based Search on sTructured data sOurcEs (KEYSTONE) and the COST Action IC1307: The European Network on Integrating Vision and Language (iV&L Net).

Author information

Authors and Affiliations

Department of Computer Science, Jerusalem College of Technology - Lev Academic Center, 9116001, Jerusalem, Israel
Yaakov HaCohen-Kerner & Asaf Sabag
Centre for Research and Technology Hellas, Information Technologies Institute, Thermi, Thessaloniki, Greece
Dimitris Liparas, Anastasia Moumtzidou, Stefanos Vrochidis & Ioannis Kompatsiaris

Authors

Yaakov HaCohen-Kerner
View author publications
You can also search for this author in PubMed Google Scholar
Asaf Sabag
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris Liparas
View author publications
You can also search for this author in PubMed Google Scholar
Anastasia Moumtzidou
View author publications
You can also search for this author in PubMed Google Scholar
Stefanos Vrochidis
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Kompatsiaris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaakov HaCohen-Kerner .

Editor information

Editors and Affiliations

University of Coimbra, Coimbra, Portugal
Jorge Cardoso
Huawei European Research Center, Munich, Germany
Jorge Cardoso
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Francesco Guerra
Delft University of Technology, Delft, Zuid-Holland, The Netherlands
Geert-Jan Houben
University of Coimbra, Coimbra, Portugal
Alexandre Miguel Pinto
Università degli Studi di Trento, Trento, Italy
Yannis Velegrakis

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

HaCohen-Kerner, Y., Sabag, A., Liparas, D., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I. (2015). Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features. In: Cardoso, J., Guerra, F., Houben, GJ., Pinto, A.M., Velegrakis, Y. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2015. Lecture Notes in Computer Science(), vol 9398. Springer, Cham. https://doi.org/10.1007/978-3-319-27932-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-27932-9_6
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27931-2
Online ISBN: 978-3-319-27932-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics