Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization

Krawczyk, Bartosz; McInnes, Bridget T.; Cano, Alberto

doi:10.1007/978-3-319-59650-1_3

Bartosz Krawczyk¹⁷,
Bridget T. McInnes¹⁷ &
Alberto Cano¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10334))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

2875 Accesses
20 Citations

Abstract

Twitter became one of the most dynamically developing areas of social media. Due to concise nature of messages, rapid publication and high outreach, people share more and more of their opinions, thoughts and commentaries using this medium. Sentiment analysis is a specific subsection of natural language processing that concentrates on automatically categorizing opinions and attitudes expressed in a given portion of textual information. This requires dedicated machine learning solutions that are able to handle various difficulties embedded in the nature of data. In this paper, we present an efficient framework for automatic sentiment analysis from high-dimensional and sparse datasets that suffer from multi-class imbalance. We propose to approach it by applying a one-vs-one binary decomposition and reducing the dimensionality of each pairwise class set using Multiple Correspondence Analysis. Then we apply preprocessing to alleviate the skewed distributions in reduced number of dimensions. After that, on each pair of classes we train a binary classifier and combined them using a weighted multi-class reconstruction that promotes minority classes. The proposal is evaluated on a large Twitter dataset and obtained results are in favor of the proposed solution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14, 106 (2013)
Article Google Scholar
Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 49(2), 31:1–31:50 (2016)
Article Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Google Scholar
Fernández, A., López, V., Galar, M., del Jesús, M.J., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl. Based Syst. 42, 97–110 (2013)
Article Google Scholar
Greenacre, M.J., Blasius, J.: Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC, London (2006)
Book MATH Google Scholar
Hoens, T.R., Qian, Q., Chawla, N.V., Zhou, Z.-H.: Building decision trees for the multi-class imbalance problem. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012. LNCS (LNAI), vol. 7301, pp. 122–134. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30217-6_11
Chapter Google Scholar
Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V., Sebastiani, F.: SemEval-2016 task 4: sentiment analysis in Twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval 2016. Association for Computational Linguistics, San Diego, California, June 2016
Google Scholar
Pang, B., Lee, L., et al.: Opinion mining and sentiment analysis. Found. Trends® Inf. Retrieval 2(1–2), 1–135 (2008)
Google Scholar
Porwik, P., Doroz, R., Orczyk, T.: The k-nn classifier and self-adaptive hotelling data reduction technique in handwritten signatures recognition. Pattern Anal. Appl. 18(4), 983–1001 (2015)
Article MathSciNet Google Scholar
Sáez, J.A., Krawczyk, B., Wozniak, M.: Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recogn. 57, 164–178 (2016)
Article Google Scholar
Wang, S., Yao, X.: Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B 42(4), 1119–1130 (2012)
Article Google Scholar
Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
Bartosz Krawczyk, Bridget T. McInnes & Alberto Cano

Authors

Bartosz Krawczyk
View author publications
You can also search for this author in PubMed Google Scholar
Bridget T. McInnes
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Cano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bartosz Krawczyk .

Editor information

Editors and Affiliations

University of La Rioja , Logroño, La Rioja, Spain
Francisco Javier Martínez de Pisón
University of La Rioja , Logroño, La Rioja, Spain
Rubén Urraca
University of A Coruña , Ferrol, La Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Krawczyk, B., McInnes, B.T., Cano, A. (2017). Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization. In: Martínez de Pisón, F., Urraca, R., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2017. Lecture Notes in Computer Science(), vol 10334. Springer, Cham. https://doi.org/10.1007/978-3-319-59650-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-59650-1_3
Published: 02 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59649-5
Online ISBN: 978-3-319-59650-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics