Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization

  • Bartosz KrawczykEmail author
  • Bridget T. McInnes
  • Alberto Cano
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10334)


Twitter became one of the most dynamically developing areas of social media. Due to concise nature of messages, rapid publication and high outreach, people share more and more of their opinions, thoughts and commentaries using this medium. Sentiment analysis is a specific subsection of natural language processing that concentrates on automatically categorizing opinions and attitudes expressed in a given portion of textual information. This requires dedicated machine learning solutions that are able to handle various difficulties embedded in the nature of data. In this paper, we present an efficient framework for automatic sentiment analysis from high-dimensional and sparse datasets that suffer from multi-class imbalance. We propose to approach it by applying a one-vs-one binary decomposition and reducing the dimensionality of each pairwise class set using Multiple Correspondence Analysis. Then we apply preprocessing to alleviate the skewed distributions in reduced number of dimensions. After that, on each pair of classes we train a binary classifier and combined them using a weighted multi-class reconstruction that promotes minority classes. The proposal is evaluated on a large Twitter dataset and obtained results are in favor of the proposed solution.


Machine learning Text mining Sentiment analysis Imbalanced learning Multi-class imbalance 


  1. 1.
    Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14, 106 (2013)CrossRefGoogle Scholar
  2. 2.
    Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 49(2), 31:1–31:50 (2016)CrossRefGoogle Scholar
  3. 3.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)Google Scholar
  4. 4.
    Fernández, A., López, V., Galar, M., del Jesús, M.J., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl. Based Syst. 42, 97–110 (2013)CrossRefGoogle Scholar
  5. 5.
    Greenacre, M.J., Blasius, J.: Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC, London (2006)CrossRefzbMATHGoogle Scholar
  6. 6.
    Hoens, T.R., Qian, Q., Chawla, N.V., Zhou, Z.-H.: Building decision trees for the multi-class imbalance problem. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012. LNCS (LNAI), vol. 7301, pp. 122–134. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-30217-6_11 CrossRefGoogle Scholar
  7. 7.
    Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V., Sebastiani, F.: SemEval-2016 task 4: sentiment analysis in Twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval 2016. Association for Computational Linguistics, San Diego, California, June 2016Google Scholar
  8. 8.
    Pang, B., Lee, L., et al.: Opinion mining and sentiment analysis. Found. Trends® Inf. Retrieval 2(1–2), 1–135 (2008)Google Scholar
  9. 9.
    Porwik, P., Doroz, R., Orczyk, T.: The k-nn classifier and self-adaptive hotelling data reduction technique in handwritten signatures recognition. Pattern Anal. Appl. 18(4), 983–1001 (2015)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Sáez, J.A., Krawczyk, B., Wozniak, M.: Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recogn. 57, 164–178 (2016)CrossRefGoogle Scholar
  11. 11.
    Wang, S., Yao, X.: Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B 42(4), 1119–1130 (2012)CrossRefGoogle Scholar
  12. 12.
    Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Bartosz Krawczyk
    • 1
    Email author
  • Bridget T. McInnes
    • 1
  • Alberto Cano
    • 1
  1. 1.Department of Computer ScienceVirginia Commonwealth UniversityRichmondUSA

Personalised recommendations