Unsupervised tweets categorization using semantic and statistical features

Devi, Maibam Debina; Saharia, Navanath

doi:10.1007/s11042-022-13042-4

Unsupervised tweets categorization using semantic and statistical features

1209: Recent Advances on Social Media Analytics and Multimedia Systems: Issues and Challenges
Published: 06 May 2022

Volume 82, pages 9047–9064, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

241 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Clustering is one of the widely used techniques in information retrieval. This experiment intends to categorize Tweets (based on their content) as representative of social media/user-generated content by exploiting statistical and semantic features. tf-idf, being widespread, is employed in combination with a synonym-based weighting scheme. The output of tf-idf in the form of the weight vector is transferred to the next phase as input, where based on the word synonyms, the system generate another weighted vector. Both vectors are used as a feature for clustering. The synonym-based feature technique adds semantic importance to the formation of the clusters. Using a density-based categorical clustering algorithm (with 8 as minpoints and 1.5 as epsilon), we categorized the Tweets into clusters. Six clusters are formed from 1K Tweets, which are evaluated manually and found cohesive. The Silhouette coefficient score (0.47) is used to validate the clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

A survey of sentiment analysis in social media

Article 04 July 2018

Targeted marketing on social media: utilizing text analysis to create personalized landing pages

Article 04 April 2024

Notes

Text REtrieval Conference, https://trec.nist.gov
Forum for Information Retrieval Evaluation, http://fire.irsi.res.in
ACM’s Special Interest Group on Information Retrieval, https://sigir.org
https://www.kaggle.com/vkrahul/twitter-hate-speech?select=train_E6oV3lV.csv
https://radimrehurek.com/gensim/corpora/textcorpus.html?highlight=stopwords#gensim.corpora.textcorpus.remove_stopwords. remove_stopwords; Accessed on: 20 Dec 2020
https://norvig.com/mayzner.html; visited on: December 20, 2020
http://storage.googleapis.com/books/ngrams/books/data-setsv2.html; Visited on: December 15, 2020
https://wordnet.princeton.edu; Accessed date: 15 Dec 2020

References

Agarwal V (2015) Research on data preprocessing and categorization technique for smartphone review analysis. Int J Comput Appl 131(4):30–36
Google Scholar
Ali A, Zhu Y, Zakarya M (2021) Exploiting dynamic spatio-temporal correlations for citywide traffic flow prediction using attention based neural networks. Inf Sci 577:852–870
Article MathSciNet Google Scholar
Aouicha MB, Taieb MAH, Hamadou AB (2016) Lwcr: multi-layered Wikipedia representation for computing word relatedness. Neurocomputing 216:816–843
Article Google Scholar
Arachie C, Gaur M, Anzaroot S, Groves W, Zhang K, Jaimes A (2020) Unsupervised detection of sub-events in large scale disasters. In: AAAI Conference on Artificial Intelligence, vol 34, pp 354–361
Bafna P, Pramod D, Vaidya A (2016) Document clustering: Tf-idf approach. In: International Conference on Electrical, Electronics, and Optimization Techniques. IEEE, pp 61–66
Bradley PS, Fayyad U, Reina C, et al (1998) Scaling em (expectation-maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research
Chen J, Yan S, Wong K-C (2020) Verbal aggression detection on Twitter comments: Convolutional neural network for short-text sentiment analysis. Neural Comput Appl 32(15):10809–10818
Article Google Scholar
Clark E, Araki K (2011) Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. Procedia-Soc Behav Sci 27:2–11
Article Google Scholar
Cotelo JM, Cruz FL, Enríquez F, Troyano JA (2016) Tweet categorization by combining content and structural knowledge. Inf Fusion 31:54–64
Article Google Scholar
Daouadi KE, Rebaï RZ, Amous I (2021) Optimizing semantic deep forest for tweet topic classification. Inf Syst 101:101801
Article Google Scholar
Day WHE, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Class 1(1):7–24
Article MATH Google Scholar
Devi MD, Saharia N (2020) Exploiting topic modelling to classify sentiment from lyrics. In: International Conference on Machine Learning, Image Processing, Network Security and Data Sciences, pp 411–423
Dos Santos C, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: International Conference on Computational Linguistics, pp 69–78
Ester M, Kriegel H-P, Sander J, Xu X, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining. AAAI Press, Portland, pp 226–231
Firdaus DH, Suyanto S (2020) Topic-based tweet clustering for public figures using ant clustering. In: 3rd International Seminar on Research of Information Technology and Intelligent Systems, pp 476–481
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1(12)
Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J R Stat Soc 28(1):100–108
MATH Google Scholar
Jianqiang Z, Xiaolin G, Xuejun Z (2018) Deep convolution neural networks for Twitter sentiment analysis. IEEE Access 6:23253–23260
Article Google Scholar
Jianqiang Z, Xueliang C (2015) Combining semantic and prior polarity for boosting Twitter sentiment analysis. In: International Conference on Smart City. IEEE, pp 832–837
Link A-K (2018) Challenges for dbscan: Closely adjacent clusters and varying densities
Meetei LS, Singh TD, Borgohain SK, Bandyopadhyay S (2021) Low resource language specific pre-processing and features for sentiment analysis task. Lang Resour Eval:1–23
Miller GA, Newman EB, Friedman EA (1958) Length-frequency statistics for written english. Inf Control 1(4):370–389
Article Google Scholar
Miyamoto S, Suzuki S, Takumi S (2012) Clustering in tweets using a fuzzy neighborhood model. In: International Conference on Fuzzy Systems, Brisbane, pp 1–6
Mojiri MM, Ravanmehr R (2020) Event detection in Twitter using multi timing chained windows. Comput Inf 39(6):1336–1359
Google Scholar
Munková D, Munk M, Vozár M (2013) Influence of stop-words removal on sequence patterns identification within comparable corpora. In: International Conference on ICT Innovations. Springer, pp 67–76
Norvig P (2013) English letter frequency counts: Mayzner revisited or etaoin srhldcu. https://norvig.com/mayzner.html
O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for Twitter. In: Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, George Washington University, pp 384–385
Park S, Kim Y (2016) Building thesaurus lexicon using dictionary-based approach for sentiment classification. In: International Conference on Software Engineering Research, Management and Applications. IEEE, pp 39–44
Rosa KD, Shah R, Lin B, Gershman A, Frederking R (2011) Topical clustering of tweets. Proceedings of the ACM SIGIR workshop on Social Web Search and Mining, Analysis under crisis, vol 63
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv:1609.04747
Rudrapal D, Das A (2017) Measuring the limit of semantic divergence for english tweets. In: Recent Advances in Natural Language Processing, Varna, pp 618–624
Saharia N (2015) Detecting emotion from short messages on Nepal earthquake. In: International Conference on Speech Technology and Human-Computer Dialogue. IEEE, Bucharest, pp 1–5
Sahni T, Chandak C, Chedeti NR, Singh M (2017) Efficient Twitter sentiment classification using subjective distant supervision. In: International Conference on Communication Systems and Networks, pp 548–553
Saif H, Fernandez M, He Y, Alani H (2013) Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the sts-gold
Singh TD, Singh TJ, Shadang M, Thokchom S (2021) Review comments of manipuri online video: Good, bad or ugly. In: International Conference on Computing and Communication Systems, vol 170. Springer, Shillong, p 45
Tang G, Xia Y, Wang W, Lau R, Zheng F (2014) Clustering tweets using wikipedia concepts. In: Proceedings of the Language Resources and Evaluation Conference, Reykjavik, pp 2262–2267
Teodorescu H-N, Saharia N (2015) An internet slang annotated dictionary and its use in assessing message attitude and sentiments. In: International Conference on Speech Technology and Human-Computer Dialogue. IEEE, Bucharest, pp 1–8
Vosoughi S, Vijayaraghavan P, Roy D (2016) Tweet2vec: Learning tweet embeddings using character-level CNN-LSTM encoder-decoder. In: ACM SIGIR conference on Research and Development in Information Retrieval, pp 1041–1044, DOI https://doi.org/10.1145/2911451.2914762, (to appear in print)
Zhou D, Chen L, He Y (2015) An unsupervised framework of exploring events on Twitter: Filtering, extraction and categorization. In: AAAI conference on Artificial Intelligence, vol 29
Zou L, Song WW (2016) LDA-TM: A two-step approach to Twitter topic data clustering. In: International Conference on Cloud Computing and Big Data Analysis, pp 342–347

Download references

Acknowledgements

Authors would like to thank anonymous reviewers for their insights and suggestions during preparation of the draft.

Funding

The first author acknowledge the financial supports received from TEQIP Phase III, NPIU (Ref. no.: IIITM/ACA-PhD/2017-18/10).

Author information

Authors and Affiliations

Data Engineering Lab, IIIT Senapati, Manipur, 795002, India
Maibam Debina Devi & Navanath Saharia

Authors

Maibam Debina Devi
View author publications
You can also search for this author in PubMed Google Scholar
Navanath Saharia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Navanath Saharia.

Ethics declarations

Conflict of Interests

The authors declare that there is no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Maibam Debina Devi and Navanath Saharia contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Devi, M.D., Saharia, N. Unsupervised tweets categorization using semantic and statistical features. Multimed Tools Appl 82, 9047–9064 (2023). https://doi.org/10.1007/s11042-022-13042-4

Download citation

Received: 03 January 2021
Revised: 11 February 2022
Accepted: 04 April 2022
Published: 06 May 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11042-022-13042-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised tweets categorization using semantic and statistical features

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

A survey of sentiment analysis in social media

Targeted marketing on social media: utilizing text analysis to create personalized landing pages

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised tweets categorization using semantic and statistical features

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

A survey of sentiment analysis in social media

Targeted marketing on social media: utilizing text analysis to create personalized landing pages

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation