Sentiment Classification of the Slovenian News Texts

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 403)

Abstract

This paper deals with automatic two class document-level sentiment classification. We retrieved textual documents with political, business, economic and financial content from five Slovenian web media. By annotating a sample of 10,427 documents, we obtained a labelled corpus in the Slovenian language. Five classifiers were evaluated on this corpus: multinomial naïve Bayes, support vector machines, random forest, k-nearest neighbour and naïve Bayes, out of which the first three were used also in the assessment of the pre-processing options. Among the selected classifiers, multinomial naïve Bayes outperforms the naïve Bayes, k-nearest neighbour, random forest and support vector machines classifier in terms of classification accuracy. The best selection of pre-processing options achieves more than 95 % classification accuracy with Naïve Bayes Multinomial and more than 85 % with support vector machines and random forest classifier.

Keywords

Sentiment analysis Document classification Machine learning Slovenian language Corpus 

Notes

Acknowledgments

Work supported by Creative Core FISNM-3330-13-500033 ‘Simulations’ project funded by the European Union, The European Regional Development Fund and Young Researcher Programme by Slovenian Research Agency. The operation is carried out within the framework of the Operational Programme for Strengthening Regional Development Potentials for the period 2007–2013, Development Priority 1: Competitiveness and research excellence, Priority Guideline 1.1: Improving the competitive skills and research excellence.

References

  1. 1.
    Aha, D.W., Kibler, D., Albert, M.A.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)Google Scholar
  2. 2.
    Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)CrossRefMATHGoogle Scholar
  3. 3.
    Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)MATHGoogle Scholar
  4. 4.
    Cronbach, L.J.: Coefficient alpha and the internal structure of tests. Psychometrika 16, 297–334 (1951)CrossRefGoogle Scholar
  5. 5.
    Das, S.R., Chen, M.Y.: Yahoo! for amazon: Extracting market sentiment from stock message boards. In: Proceedings of the Asia Pacific Finance Association Annual Conference (APFA) (2001)Google Scholar
  6. 6.
    Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In: Proceedings of the World Wide Web Conference (2003)Google Scholar
  7. 7.
    Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. Proc. Int. Conf. Weblogs Soc. Media 2, 1–4 (2007)Google Scholar
  8. 8.
    Hatzivassiloglou, V., McKeown, K.R.: Predicting the semantic orientation of adjectives. In: Proceedings of the 8th Conference on European Chapter of the Association for Computational Linguistics, pp. 174–181 (1997)Google Scholar
  9. 9.
    Hrala, M., Král, P.: Evaluation of the document classification approaches. In: Proceedings of the 8th International Conference on Computer Recognition Systems CORES, pp. 877–885 (2013)Google Scholar
  10. 10.
    Internet World Stats, World Internet Users and 2014 Population Stats (2014), http://www.internetworldstats.com/stats.htm. Accessed 10 Mar 2015
  11. 11.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
  12. 12.
    Lewis, D.D.: Naïe (Bayes) at forty: the independent assumption in information retrieval. Mach. Learn.: ECML 98, 4–15 (1998)Google Scholar
  13. 13.
    Likert, R.: A technique for the measurement of attitudes. Arch. Psychol. 22, 1–55 (1932)Google Scholar
  14. 14.
    Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, pp. 469–492. Springer, New York (2011)CrossRefMATHGoogle Scholar
  15. 15.
    McCallum, A., Nigam, K.: A comparison of event models for Naïve Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48 (1998)Google Scholar
  16. 16.
    Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulos, C.: Discovering user communities on the internet using unsupervised machine learning techniques. Interact. Comput. 14, 761–791 (2002)CrossRefGoogle Scholar
  17. 17.
    Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2, 1–135 (2008)CrossRefGoogle Scholar
  18. 18.
    Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, pp. 79–86 (2002)Google Scholar
  19. 19.
    Smailović, J., Grčar, M., Lavrač, N., Žnidaršič, M.: Predictive sentiment analysis of tweets: A stock market application. In: Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data, pp. 77–88 (2013)Google Scholar
  20. 20.
    Stone, G.C., Grusin, E.: Network TV as the Bad News Bearer. Journal. Q. 61, 517 (1984)CrossRefGoogle Scholar
  21. 21.
    Turney, P.D.: Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 417–424 (2002)Google Scholar
  22. 22.
    Web Technology Surveys Usage of content languages for websites (2011), http://w3techs.com/technologies/overview/content_language/all. Accessed 08 Mar 2015
  23. 23.
    Wright, A.: Mining the Web for Feelings, Not Facts. New York Times 24 (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Faculty of Information StudiesLaboratory of Data TechnologiesNovo mestoSlovenia
  2. 2.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia

Personalised recommendations