Skip to main content
Log in

Unsupervised tweets categorization using semantic and statistical features

  • 1209: Recent Advances on Social Media Analytics and Multimedia Systems: Issues and Challenges
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Clustering is one of the widely used techniques in information retrieval. This experiment intends to categorize Tweets (based on their content) as representative of social media/user-generated content by exploiting statistical and semantic features. tf-idf, being widespread, is employed in combination with a synonym-based weighting scheme. The output of tf-idf in the form of the weight vector is transferred to the next phase as input, where based on the word synonyms, the system generate another weighted vector. Both vectors are used as a feature for clustering. The synonym-based feature technique adds semantic importance to the formation of the clusters. Using a density-based categorical clustering algorithm (with 8 as minpoints and 1.5 as epsilon), we categorized the Tweets into clusters. Six clusters are formed from 1K Tweets, which are evaluated manually and found cohesive. The Silhouette coefficient score (0.47) is used to validate the clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Text REtrieval Conference, https://trec.nist.gov

  2. Forum for Information Retrieval Evaluation, http://fire.irsi.res.in

  3. ACM’s Special Interest Group on Information Retrieval, https://sigir.org

  4. https://www.kaggle.com/vkrahul/twitter-hate-speech?select=train_E6oV3lV.csv

  5. https://radimrehurek.com/gensim/corpora/textcorpus.html?highlight=stopwords#gensim.corpora.textcorpus.remove_stopwords. remove_stopwords; Accessed on: 20 Dec 2020

  6. https://norvig.com/mayzner.html; visited on: December 20, 2020

  7. http://storage.googleapis.com/books/ngrams/books/data-setsv2.html; Visited on: December 15, 2020

  8. https://wordnet.princeton.edu; Accessed date: 15 Dec 2020

References

  1. Agarwal V (2015) Research on data preprocessing and categorization technique for smartphone review analysis. Int J Comput Appl 131(4):30–36

    Google Scholar 

  2. Ali A, Zhu Y, Zakarya M (2021) Exploiting dynamic spatio-temporal correlations for citywide traffic flow prediction using attention based neural networks. Inf Sci 577:852–870

    Article  MathSciNet  Google Scholar 

  3. Aouicha MB, Taieb MAH, Hamadou AB (2016) Lwcr: multi-layered Wikipedia representation for computing word relatedness. Neurocomputing 216:816–843

    Article  Google Scholar 

  4. Arachie C, Gaur M, Anzaroot S, Groves W, Zhang K, Jaimes A (2020) Unsupervised detection of sub-events in large scale disasters. In: AAAI Conference on Artificial Intelligence, vol 34, pp 354–361

  5. Bafna P, Pramod D, Vaidya A (2016) Document clustering: Tf-idf approach. In: International Conference on Electrical, Electronics, and Optimization Techniques. IEEE, pp 61–66

  6. Bradley PS, Fayyad U, Reina C, et al (1998) Scaling em (expectation-maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research

  7. Chen J, Yan S, Wong K-C (2020) Verbal aggression detection on Twitter comments: Convolutional neural network for short-text sentiment analysis. Neural Comput Appl 32(15):10809–10818

    Article  Google Scholar 

  8. Clark E, Araki K (2011) Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. Procedia-Soc Behav Sci 27:2–11

    Article  Google Scholar 

  9. Cotelo JM, Cruz FL, Enríquez F, Troyano JA (2016) Tweet categorization by combining content and structural knowledge. Inf Fusion 31:54–64

    Article  Google Scholar 

  10. Daouadi KE, Rebaï RZ, Amous I (2021) Optimizing semantic deep forest for tweet topic classification. Inf Syst 101:101801

    Article  Google Scholar 

  11. Day WHE, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Class 1(1):7–24

    Article  MATH  Google Scholar 

  12. Devi MD, Saharia N (2020) Exploiting topic modelling to classify sentiment from lyrics. In: International Conference on Machine Learning, Image Processing, Network Security and Data Sciences, pp 411–423

  13. Dos Santos C, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: International Conference on Computational Linguistics, pp 69–78

  14. Ester M, Kriegel H-P, Sander J, Xu X, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining. AAAI Press, Portland, pp 226–231

  15. Firdaus DH, Suyanto S (2020) Topic-based tweet clustering for public figures using ant clustering. In: 3rd International Seminar on Research of Information Technology and Intelligent Systems, pp 476–481

  16. Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1(12)

  17. Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J R Stat Soc 28(1):100–108

    MATH  Google Scholar 

  18. Jianqiang Z, Xiaolin G, Xuejun Z (2018) Deep convolution neural networks for Twitter sentiment analysis. IEEE Access 6:23253–23260

    Article  Google Scholar 

  19. Jianqiang Z, Xueliang C (2015) Combining semantic and prior polarity for boosting Twitter sentiment analysis. In: International Conference on Smart City. IEEE, pp 832–837

  20. Link A-K (2018) Challenges for dbscan: Closely adjacent clusters and varying densities

  21. Meetei LS, Singh TD, Borgohain SK, Bandyopadhyay S (2021) Low resource language specific pre-processing and features for sentiment analysis task. Lang Resour Eval:1–23

  22. Miller GA, Newman EB, Friedman EA (1958) Length-frequency statistics for written english. Inf Control 1(4):370–389

    Article  Google Scholar 

  23. Miyamoto S, Suzuki S, Takumi S (2012) Clustering in tweets using a fuzzy neighborhood model. In: International Conference on Fuzzy Systems, Brisbane, pp 1–6

  24. Mojiri MM, Ravanmehr R (2020) Event detection in Twitter using multi timing chained windows. Comput Inf 39(6):1336–1359

    Google Scholar 

  25. Munková D, Munk M, Vozár M (2013) Influence of stop-words removal on sequence patterns identification within comparable corpora. In: International Conference on ICT Innovations. Springer, pp 67–76

  26. Norvig P (2013) English letter frequency counts: Mayzner revisited or etaoin srhldcu. https://norvig.com/mayzner.html

  27. O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for Twitter. In: Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, George Washington University, pp 384–385

  28. Park S, Kim Y (2016) Building thesaurus lexicon using dictionary-based approach for sentiment classification. In: International Conference on Software Engineering Research, Management and Applications. IEEE, pp 39–44

  29. Rosa KD, Shah R, Lin B, Gershman A, Frederking R (2011) Topical clustering of tweets. Proceedings of the ACM SIGIR workshop on Social Web Search and Mining, Analysis under crisis, vol 63

  30. Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv:1609.04747

  31. Rudrapal D, Das A (2017) Measuring the limit of semantic divergence for english tweets. In: Recent Advances in Natural Language Processing, Varna, pp 618–624

  32. Saharia N (2015) Detecting emotion from short messages on Nepal earthquake. In: International Conference on Speech Technology and Human-Computer Dialogue. IEEE, Bucharest, pp 1–5

  33. Sahni T, Chandak C, Chedeti NR, Singh M (2017) Efficient Twitter sentiment classification using subjective distant supervision. In: International Conference on Communication Systems and Networks, pp 548–553

  34. Saif H, Fernandez M, He Y, Alani H (2013) Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the sts-gold

  35. Singh TD, Singh TJ, Shadang M, Thokchom S (2021) Review comments of manipuri online video: Good, bad or ugly. In: International Conference on Computing and Communication Systems, vol 170. Springer, Shillong, p 45

  36. Tang G, Xia Y, Wang W, Lau R, Zheng F (2014) Clustering tweets using wikipedia concepts. In: Proceedings of the Language Resources and Evaluation Conference, Reykjavik, pp 2262–2267

  37. Teodorescu H-N, Saharia N (2015) An internet slang annotated dictionary and its use in assessing message attitude and sentiments. In: International Conference on Speech Technology and Human-Computer Dialogue. IEEE, Bucharest, pp 1–8

  38. Vosoughi S, Vijayaraghavan P, Roy D (2016) Tweet2vec: Learning tweet embeddings using character-level CNN-LSTM encoder-decoder. In: ACM SIGIR conference on Research and Development in Information Retrieval, pp 1041–1044, DOI https://doi.org/10.1145/2911451.2914762, (to appear in print)

  39. Zhou D, Chen L, He Y (2015) An unsupervised framework of exploring events on Twitter: Filtering, extraction and categorization. In: AAAI conference on Artificial Intelligence, vol 29

  40. Zou L, Song WW (2016) LDA-TM: A two-step approach to Twitter topic data clustering. In: International Conference on Cloud Computing and Big Data Analysis, pp 342–347

Download references

Acknowledgements

Authors would like to thank anonymous reviewers for their insights and suggestions during preparation of the draft.

Funding

The first author acknowledge the financial supports received from TEQIP Phase III, NPIU (Ref. no.: IIITM/ACA-PhD/2017-18/10).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Navanath Saharia.

Ethics declarations

Conflict of Interests

The authors declare that there is no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Maibam Debina Devi and Navanath Saharia contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Devi, M.D., Saharia, N. Unsupervised tweets categorization using semantic and statistical features. Multimed Tools Appl 82, 9047–9064 (2023). https://doi.org/10.1007/s11042-022-13042-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13042-4

Keywords

Navigation