Skip to main content
Log in

Big data analytics for critical information classification in online social networks using classifier chains

  • Published:
Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Abstract

Industrial and academic organizations are using online social network (OSN) for different purposes, such as social and economic aspects. Now, OSN is a new mean of obtaining information from people about their preferences, and interests. Due to the large volume of user-generated content, researchers use various techniques, such as sentiment analysis or data mining to evaluate this information automatically. However, the sentiment analysis of OSN content is performed by different methods, but there are some problems to obtain highly reliable results, mainly because of the lack of user profile information, such as gender and age. In this work, a novel dataset is built, which contains the writing characteristics of 160,000 users of the Twitter OSN. Before creating classification models with Machine Learning (ML) techniques, feature transformation and feature selection methods are applied to determine the most relevant set of characteristics. To create the models, the Classifier Chain (CC) transformation technique and different machine learning algorithms are applied to the training set. Simulation results show that the Random Forest, XGBoost and Decision Tree algorithms obtain the best performance results. In the testing phase, these algorithms reached Hamming Loss values of 0.033, 0.033, and 0.034, respectively, and all of them reached the same F1 micro-average value equal to 0.976. Therefore, our proposal based on a multidimensional learning technique using CC transformation overcomes other similar proposals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Liu B (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5(1):1–167

    Article  Google Scholar 

  2. Asghar MZ, Kundi FM, Ahmad S, Khan A, Khan FK (2018) T-SAF: twitter sentiment analysis framework using a hybrid classification scheme. Expert Syst 35(1). https://doi.org/10.1111/exsy.12233

  3. Guimaraes RG, Rosa RL, De Gaetano D, Rodriguez DZ, Bressan G (2017) Age groups classification in social network using deep learning. IEEE Access 5:10805–10816

    Article  Google Scholar 

  4. Nguyen D, Gravel R, Trieschnigg D, Meder T (2013) how old do you think i am?; a study of language and age in twitter. In: Proceedings of the seventh international AAAI conference on weblogs and social media. AAAI Press

  5. Park G, Yaden DB, Schwartz HA, Kern ML, Eichstaedt JC, Kosinski M, Stillwell D, Ungar LH, Seligman ME (2016) Women are warmer but no less assertive than men: Gender and language on facebook. PLoS One 11(5):e0155885

    Article  Google Scholar 

  6. Li D, Li Y, Ji W (2017) Gender identification via reposting behaviors in social media. IEEE Access 6:2879–2888

    Article  Google Scholar 

  7. Romanov AS, Kurtukova AV, Sobolev AA, Shelupanov AA, Fedotova AM (2020) Determining the age of the author of the text based on deep neural network models. Information 11(12):589

    Article  Google Scholar 

  8. Srivastava DK, Roychoudhury B (2020) Words are important: A textual content based identity resolution scheme across multiple online social networks. Knowledge-Based Systems 195:105624

    Article  Google Scholar 

  9. Kiratsa P, Sidiropoulos G, Badeka E, Papadopoulou C, Nikolaou A, Papakostas GA (2018) Gender identification through facebook data analysis using machine learning techniques. In: Proceedings of the 22nd Pan-Hellenic Conference on Informatics, pp. 117–120

  10. Keikha M, Hashemi S (2016) Ordered classifier chains for multi-label classification. Journal of Machine Intelligence 1(1):7–12

    Article  Google Scholar 

  11. Marquardt J, Farnadi G, Vasudevan G, Moens MF, Davalos S, Teredesai A, De Cock M (2014) Age and gender identification in social media. Proceedings of CLEF 2014 Evaluation Labs 1180:1129–1136

  12. Read J, Martino L, Luengo D (2014) Efficient monte carlo methods for multi-dimensional learning with classifier chains. Pattern Recogn 47(3):1535–1546

    Article  Google Scholar 

  13. Carmona MA, Pellegrin L, Montes M, Sánchez-Vega F, Escalante HJ, López-Monroy A, Villaseñor-Pineda L, Villatoro-Tello E (2018) A visual approach for age and gender identification on twitter. J Intell Fuzzy Syst 34:3133–3145. https://doi.org/10.3233/JIFS-169497

  14. Guimarães R, Rodríguez DZ, Rosa RL, Bressan G (2016) Recommendation system using sentiment analysis considering the polarity of the adverb. In: 2016 IEEE International Symposium on Consumer Electronics (ISCE), pp. 71–72. IEEE

  15. Rosa RL, De Silva MJ, Silva DH, Ayub MS, Carrillo D, Nardelli PHJ, Rodríguez DZ (2020) Event detection system based on user behavior changes in online social networks: Case of the covid-19 pandemic. IEEE Access 8:158806–158825. https://doi.org/10.1109/ACCESS.2020.3020391

  16. Rosa RL, Rodriguez DZ, Bressan G (2013) Sentimeter-br: A new social web analysis metric to discover consumers’ sentiment. In: 2013 IEEE International Symposium on Consumer Electronics (ISCE), pp. 153–154. IEEE

  17. Cardoso ONP (2004) Recuperação de informação. INFOCOMP J Comput Sci 2(1):33–38

    Google Scholar 

  18. Tan PN, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India

  19. Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794

  20. Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp. 616–623

  21. Rosa RL, Rodriguez DZ, Bressan G (2013) Sentimeter-br: A social web analysis tool to discover consumers’ sentiment. In: 2013 IEEE 14th International Conference on Mobile Data Management 2:122–124. https://doi.org/10.1109/MDM.2013.80

  22. Darwich M, Noah SAM, Omar N (2020) Deriving the sentiment polarity of term senses using dual-step context-aware in-gloss matching. Inf Process Manag 57(6):102273. https://doi.org/10.1016/j.ipm.2020.102273

    Article  Google Scholar 

  23. Ramos BL, Lasmar E, Rosa RL, Rodriguez DZ, Grutzman A (2018) Calculating the influence of tagging people on sentiment analysis. In: 2018 26th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), pp. 1–6. IEEE

  24. Rosa RL, Rodríguez DZ, Schwartz GM, de Campos Ribeiro I, Bressan G (2016) Monitoring system for potential users with depression using sentiment analysis. In: 2016 IEEE International Conference on Consumer Electronics (ICCE), pp. 381–382. https://doi.org/10.1109/ICCE.2016.7430656

  25. Jain A, Shakya A, Khatter H, Gupta AK (2019) A smart system for fake news detection using machine learning. In: 2019 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 1:1–4. https://doi.org/10.1109/ICICT46931.2019.8977659

  26. Mandical RR, Mamatha N, Shivakumar N, Monica R, Krishna AN (2020) Identification of fake news using machine learning. In: 2020 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), pp. 1–6. https://doi.org/10.1109/CONECCT50063.2020.9198610

  27. Reis JCS, Correia A, Murai F, Veloso A, Benevenuto F (2019) Supervised learning for fake news detection. IEEE Intell Syst 34(2):76–81. https://doi.org/10.1109/MIS.2019.2899143

    Article  Google Scholar 

  28. Lasmar EL, de Paula FO, Rosa RL, Abrahão JI, Rodríguez DZ (2019) Rsrs: Ridesharing recommendation system based on social networks to improve the user’s qoe. IEEE Transactions on Intelligent Transportation Systems 20(12):4728–4740. https://doi.org/10.1109/TITS.2019.2945793

  29. Margaris D, Vassilakis C, Spiliotopoulos D (2020) What makes a review a reliable rating in recommender systems? Inf Process Manag 57(6):102304. https://doi.org/10.1016/j.ipm.2020.102304

  30. Rosa RL, Lasmar Junior EL, Zegarra Rodríguez D (2018) A recommendation system for shared-use mobility service through data extracted from online social networks. Journal of Communications Software and Systems 14(4):359–366

    Google Scholar 

  31. Alhijawi B, Hriez S, Awajan A (2018) Text-based authorship identification-a survey. In: 2018 Fifth International Symposium on Innovation in Information and Communication Technology (ISIICT), pp. 1–7. IEEE

  32. AlSukhni, E., Alequr, Q.: Investigating the use of machine learning algorithms in detecting gender of the arabic tweet

  33. Affonso ET, Rodríguez DZ, Rosa RL, Andrade T, Bressan G (2016) Voice quality assessment in mobile devices considering different fading models. In: 2016 IEEE International Symposium on Consumer Electronics (ISCE), pp. 21–22. https://doi.org/10.1109/ISCE.2016.7797329

  34. Al-Ghadir AI, Azmi AM (2019) A study of arabic social media users-posting behavior and author’s gender prediction. Cogn Comput 11(1):71–86

    Article  Google Scholar 

  35. Alrifai K, Rebdawi G, Ghneim N (2017) Arabic tweeps gender and dialect prediction. In: CLEF (Working Notes)

  36. Aravantinou C, Simaki V, Mporas I, Megalooikonomou V (2015) Gender classification of web authors using feature selection and language models. In: International Conference on Speech and Computer, pp. 226–233. Springer

  37. Bayot R, Gonçalves T (2016) Multilingual author profiling using word embedding averages and svms. In: 2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA), pp. 382–386. IEEE

  38. Briedienė M, Kapočiutė-Dzikienė J (2018) An automatic author profiling from non-normative lithuanian texts. In: CEUR Workshop proceedings [electronic resource]: IVUS 2018, International conference on information technologies, Kaunas, Lithuania, 27 April, 2018. Aachen: CEUR-WS, 2018, 2145

  39. Bsir B, Zrigui M (2018) Bidirectional lstm for author gender identification. In: International Conference on Computational Collective Intelligence, pp. 393–402. Springer

  40. Bsir B, Zrigui M (2018) Enhancing deep learning gender identification with gated recurrent units architecture in social text. Computación y Sistemas 22(3):757–766

    Article  Google Scholar 

  41. Cheng N, Chandramouli R, Subbalakshmi K (2011) Author gender identification from text. Digit Investig 8(1):78–88

    Article  Google Scholar 

  42. Cheng N, Chen X, Chandramouli R, Subbalakshmi K (2009) Gender identification from e-mails. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 154–158. IEEE

  43. Ciobanu AM, Zampieri M, Malmasi S, Dinu LP (2017) Including dialects and language varieties in author profiling. arXiv preprint arXiv:1707.00621

  44. Dwivedi VP, Singh DK, Jha S et al (2017) Gender classification of blog authors: With feature engineering and deep learning using lstm networks. In: 2017 Ninth International Conference on Advanced Computing (ICoAC), pp. 142–148. IEEE

  45. Liu H, Cocea M (2018) Fuzzy rule based systems for gender classification from blog data. In: 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI), pp. 79–84. IEEE

  46. Markov I, Gómez-Adorno H, Posadas-Durán JP, Sidorov G, Gelbukh A (2016) Author profiling with doc2vec neural network-based document embeddings. In: Mexican International Conference on Artificial Intelligence, pp. 117–131. Springer

  47. Markov I, Gómez-Adorno H, Sidorov G (2017) Language-and subtask-dependent feature selection and classifier parameter tuning for author profiling. In: CLEF (Working Notes)

  48. Modaresi P, Liebeck M, Conrad S (2016) Exploring the effects of cross-genre machine learning for author profiling in pan 2016. In: CLEF (Working Notes), pp. 970–977

  49. Pandya A, Oussalah M, Monachesi P, Kostakos P, Lovén L (2018) On the use of urls and hashtags in age prediction of twitter users. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp. 62–69. IEEE

  50. Peersman C, Daelemans W, Van Vaerenbergh L (2011) Predicting age and gender in online social networks. In: Proceedings of the 3rd international workshop on Search and mining user-generated contents, pp. 37–44

  51. Reddy TR, Vardhan BV, Reddy PV (2017) N-gram approach for gender prediction. In: 2017 IEEE 7th International Advance Computing Conference (IACC), pp. 860–865. IEEE

  52. Schaetti N (2017) Unine at clef 2017: Tf-idf and deep-learning for author profiling. In: CLEF (Working Notes)

  53. Simaki V, Aravantinou C, Mporas I, Megalooikonomou V (2015) Using sociolinguistic inspired features for gender classification of web authors. In: International Conference on Text, Speech, and Dialogue, pp. 587–594. Springer

  54. Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, Shah A, Kosinski M, Stillwell D, Seligman ME et al (2013) Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS One 8(9):e73791

    Article  Google Scholar 

  55. Alowibdi JS, Buy UA, Yu P (2013) Empirical evaluation of profile characteristics for gender classification on twitter. In: 2013 12th International Conference on Machine Learning and Applications 1:365–369. IEEE

  56. Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on twitter. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp. 739–743

  57. Scholefield P (1966) On the correlation function of the chi-square process. Proceedings of the IEEE 54(11):1573–1574. https://doi.org/10.1109/PROC.1966.5191

    Article  Google Scholar 

  58. Adeniran A, Jadah H, Mohammed N (2020) Impact of information technology on strategic management in the banking sector of Iraq. Insights into Regional Development 2(2):592–601

    Article  Google Scholar 

  59. Nunes RD, Rosa RL, Rodríguez DZ (2019) Performance improvement of a non-intrusive voice quality metric in lossy networks. IET Commun 13(20):3401–3408

    Article  Google Scholar 

  60. Rodríguez DZ, Möller S (2019) Speech quality parametric model that considers wireless network characteristics. In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. https://doi.org/10.1109/QoMEX.2019.8743346

  61. Zhang ML, Zhou ZH (2013) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837

    Article  Google Scholar 

  62. Ceri S, Fraternali P, Bongio A, Brambilla M, Comai S, Matera M (2003) Morgan Kaufmann series in data management systems: Designing data-intensive Web applications. Morgan Kaufmann

  63. Pereira RB, Plastino A, Zadrozny B, Merschmann LH (2018) Correlation analysis of performance measures for multi-label classification. Inf Process Manag 54(3):359–369

    Article  Google Scholar 

  64. Asim MN, Rehman A, Shoaib U (2017) Accuracy based feature ranking metric for multi-label text classification. Int J Adv Comput Sci Appl 8(10)

  65. Szymański P, Kajdanowicz T (2017) A network perspective on stratification of multi-label data. In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications, pp. 22–35. PMLR

  66. Rodríguez-Fdez I, Canosa A, Mucientes M, Bugarín A (2015) Stac: A web platform for the comparison of algorithms using statistical tests. In: 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8. https://doi.org/10.1109/FUZZ-IEEE.2015.7337889

  67. Beasley TM, Zumbo BD (2003) Comparison of aligned friedman rank and parametric methods for testing interactions in split-plot designs. Comput Stat Data Anal 42(4):569–593

    Article  MathSciNet  Google Scholar 

  68. Finner H (1993) On a monotonicity problem in step-down multiple test procedures. J Am Stat Assoc 88(423):920–923

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Saadi.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silva, D.H., Maziero, E.G., Saadi, M. et al. Big data analytics for critical information classification in online social networks using classifier chains. Peer-to-Peer Netw. Appl. 15, 626–641 (2022). https://doi.org/10.1007/s12083-021-01269-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12083-021-01269-1

Keywords

Navigation