Semi-supervised labeling: a proposed methodology for labeling the twitter datasets

Jan, Tabassum Gull; Khurana, Surinder Singh; Kumar, Munish

doi:10.1007/s11042-022-12221-7

Semi-supervised labeling: a proposed methodology for labeling the twitter datasets

Published: 28 January 2022

Volume 81, pages 7669–7683, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Tabassum Gull Jan¹,
Surinder Singh Khurana¹ &
Munish Kumar²

3 Citations
1 Altmetric
Explore all metrics

Abstract

Twitter has nowadays become a trending microblogging and social media platform for news and discussions. Since the dramatic increase in its platform has additionally set off a dramatic increase in spam utilization in this platform. For Supervised machine learning, one always finds a need to have a labeled dataset of Twitter. It is desirable to design a semi-supervised labeling technique for labeling newly prepared recent datasets. To prepare the labeled dataset lot of human affords are required. This issue has motivated us to propose an efficient approach for preparing labeled datasets so that time can be saved and human errors can be avoided. Our proposed approach relies on readily available features in real-time for better performance and wider applicability. This work aims at collecting the most recent tweets of a user using Twitter streaming and prepare a recent dataset of Twitter. Finally, a semi-supervised machine learning algorithm based on the self-training technique was designed for labeling the tweets. Semi-supervised support vector machine and semi-supervised decision tree classifiers were used as base classifiers in the self-training technique. Further, the authors have applied K means clustering algorithm to the tweets based on the tweet content. The principled novel approach is an ensemble of semi-supervised and unsupervised learning wherein it was found that semi-supervised algorithms are more accurate in prediction than unsupervised ones. To effectively assign the labels to the tweets, authors have implemented the concept of voting in this novel approach and the label pre-directed by the majority voting classifier is the actual label assigned to the tweet dataset. Maximum accuracy of 99.0% has been reported in this paper using a majority voting classifier for spam labeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering of Tweets: A Novel Approach to Label the Unlabelled Tweets

A Classification Model to Analyze the Spread and Emerging Trends of the Zika Virus in Twitter

Classification, Identification, and Analysis of Events on Twitter Through Data Mining

Notes

https://backlinko.com/twitter-users

References

Abuliash M, Fazil M (2018) A hybrid approach for detecting automated spammers in twitter. IEEE Trans Inform Forensics Security 13(11):2707–2719
Article Google Scholar
Al-Zoubi AM, Alqatawna J and Faris H (2017) Spam profile detection in social networks based on public features. 8th Int Conf Inform Comm Syst (ICICS), 130-135
Bazzaz Abkenar, S., Mahdipour, E., Jameii, S., & Haghi Kashani, M. (2021). A hybrid classification method for twitter spam detection based on differential evolution and random forest. Concurrency And Computation: Practice And Experience. https://doi.org/10.1002/cpe.6381.
Benevenuto F, Magno G, Rodrigues T and Almeida V (2010) Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), 6:12–22.
Chakraborty A, Sundi J and Satapathy S (2012) SPAM: a framework for social profile abuse monitoring. CSE508 report, Stony Brook University, stony brook, NY.
Eshraqi N, Jalali M and Moattar MH (2015) Detecting Spam tweets in twitter using a data stream clustering algorithm. International Congress on Technology, Communication and Knowledge (ICTCK), 347–351
Gautam G and Yadav D (2014) Sentiment analysis of twitter data using machine learning approaches and semantic analysis. Seventh Int Conf Contemp Comp (IC3), 1-6.
Herzallah W, Faris H, Adwan O (2018) Feature engineering for detecting spammers on twitter: modelling and analysis. J Inf Sci 44(2):230–247
Article Google Scholar
Inuwa-Dutse I, Liptrott M, Korkontzelos I (2018) Detection of spam-posting accounts on twitter. Neurocomputing 315:496–511
Article Google Scholar
Lin PC and Huang PM (2013) A study of effective features for detecting long-surviving twitter spam accounts. 15^th Int Conf Adv Comm Technol (ICACT), 841-846.
Liu C and Wang G (2016) Analysis and detection of Spam accounts in social networks. 2nd IEEE Int Conf Comp Comm, 2526-2530
Peikari M, Salsms S, Nofech-Mozes S, Martel A (2018) A cluster-then-label semi-supervised learning approach for pathology image classification. Sci Rep 8(1):1–13
Article Google Scholar
Sedhai S, Sun A (2018) Semi-supervised spam detection in twitter stream. IEEE Trans Comp Soc Syst 5(1):169–175
Article Google Scholar
Stringhini G, Kruegel C and Vigna G (2010) Detecting spammers on social networks. Proceed 26^th Ann Comp Sec Appl Conf (ACSAC), 1-9
Sun, N., Lin, G., Qiu, J., & Rimba, P. (2020). Near real-time twitter spam detection with machine learning techniques. Int J Comp Appl. 1-11. https://doi.org/10.1080/1206212x.2020.1751387

Download references

Author information

Authors and Affiliations

Department of Computer Science & Technology, Central University of Punjab, Bathinda, India
Tabassum Gull Jan & Surinder Singh Khurana
Department of Computational Sciences, Maharaja Ranjit Singh Punjab Technical University, Bathinda, India
Munish Kumar

Authors

Tabassum Gull Jan
View author publications
You can also search for this author in PubMed Google Scholar
Surinder Singh Khurana
View author publications
You can also search for this author in PubMed Google Scholar
Munish Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Munish Kumar.

Ethics declarations

Conflict of interest

Authors declared that they have no conflict of interest in this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jan, T.G., Khurana, S.S. & Kumar, M. Semi-supervised labeling: a proposed methodology for labeling the twitter datasets. Multimed Tools Appl 81, 7669–7683 (2022). https://doi.org/10.1007/s11042-022-12221-7

Download citation

Received: 21 May 2021
Revised: 01 December 2021
Accepted: 10 January 2022
Published: 28 January 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11042-022-12221-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised labeling: a proposed methodology for labeling the twitter datasets

Abstract

Access this article

Similar content being viewed by others

Clustering of Tweets: A Novel Approach to Label the Unlabelled Tweets

A Classification Model to Analyze the Spread and Emerging Trends of the Zika Virus in Twitter

Classification, Identification, and Analysis of Events on Twitter Through Data Mining

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised labeling: a proposed methodology for labeling the twitter datasets

Abstract

Access this article

Similar content being viewed by others

Clustering of Tweets: A Novel Approach to Label the Unlabelled Tweets

A Classification Model to Analyze the Spread and Emerging Trends of the Zika Virus in Twitter

Classification, Identification, and Analysis of Events on Twitter Through Data Mining

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation