Abstract
Clustering is one of the widely used techniques in information retrieval. This experiment intends to categorize Tweets (based on their content) as representative of social media/user-generated content by exploiting statistical and semantic features. tf-idf, being widespread, is employed in combination with a synonym-based weighting scheme. The output of tf-idf in the form of the weight vector is transferred to the next phase as input, where based on the word synonyms, the system generate another weighted vector. Both vectors are used as a feature for clustering. The synonym-based feature technique adds semantic importance to the formation of the clusters. Using a density-based categorical clustering algorithm (with 8 as minpoints and 1.5 as epsilon), we categorized the Tweets into clusters. Six clusters are formed from 1K Tweets, which are evaluated manually and found cohesive. The Silhouette coefficient score (0.47) is used to validate the clusters.
Similar content being viewed by others
Notes
Text REtrieval Conference, https://trec.nist.gov
Forum for Information Retrieval Evaluation, http://fire.irsi.res.in
ACM’s Special Interest Group on Information Retrieval, https://sigir.org
https://radimrehurek.com/gensim/corpora/textcorpus.html?highlight=stopwords#gensim.corpora.textcorpus.remove_stopwords. remove_stopwords; Accessed on: 20 Dec 2020
https://norvig.com/mayzner.html; visited on: December 20, 2020
http://storage.googleapis.com/books/ngrams/books/data-setsv2.html; Visited on: December 15, 2020
https://wordnet.princeton.edu; Accessed date: 15 Dec 2020
References
Agarwal V (2015) Research on data preprocessing and categorization technique for smartphone review analysis. Int J Comput Appl 131(4):30–36
Ali A, Zhu Y, Zakarya M (2021) Exploiting dynamic spatio-temporal correlations for citywide traffic flow prediction using attention based neural networks. Inf Sci 577:852–870
Aouicha MB, Taieb MAH, Hamadou AB (2016) Lwcr: multi-layered Wikipedia representation for computing word relatedness. Neurocomputing 216:816–843
Arachie C, Gaur M, Anzaroot S, Groves W, Zhang K, Jaimes A (2020) Unsupervised detection of sub-events in large scale disasters. In: AAAI Conference on Artificial Intelligence, vol 34, pp 354–361
Bafna P, Pramod D, Vaidya A (2016) Document clustering: Tf-idf approach. In: International Conference on Electrical, Electronics, and Optimization Techniques. IEEE, pp 61–66
Bradley PS, Fayyad U, Reina C, et al (1998) Scaling em (expectation-maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research
Chen J, Yan S, Wong K-C (2020) Verbal aggression detection on Twitter comments: Convolutional neural network for short-text sentiment analysis. Neural Comput Appl 32(15):10809–10818
Clark E, Araki K (2011) Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. Procedia-Soc Behav Sci 27:2–11
Cotelo JM, Cruz FL, Enríquez F, Troyano JA (2016) Tweet categorization by combining content and structural knowledge. Inf Fusion 31:54–64
Daouadi KE, Rebaï RZ, Amous I (2021) Optimizing semantic deep forest for tweet topic classification. Inf Syst 101:101801
Day WHE, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Class 1(1):7–24
Devi MD, Saharia N (2020) Exploiting topic modelling to classify sentiment from lyrics. In: International Conference on Machine Learning, Image Processing, Network Security and Data Sciences, pp 411–423
Dos Santos C, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: International Conference on Computational Linguistics, pp 69–78
Ester M, Kriegel H-P, Sander J, Xu X, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining. AAAI Press, Portland, pp 226–231
Firdaus DH, Suyanto S (2020) Topic-based tweet clustering for public figures using ant clustering. In: 3rd International Seminar on Research of Information Technology and Intelligent Systems, pp 476–481
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1(12)
Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J R Stat Soc 28(1):100–108
Jianqiang Z, Xiaolin G, Xuejun Z (2018) Deep convolution neural networks for Twitter sentiment analysis. IEEE Access 6:23253–23260
Jianqiang Z, Xueliang C (2015) Combining semantic and prior polarity for boosting Twitter sentiment analysis. In: International Conference on Smart City. IEEE, pp 832–837
Link A-K (2018) Challenges for dbscan: Closely adjacent clusters and varying densities
Meetei LS, Singh TD, Borgohain SK, Bandyopadhyay S (2021) Low resource language specific pre-processing and features for sentiment analysis task. Lang Resour Eval:1–23
Miller GA, Newman EB, Friedman EA (1958) Length-frequency statistics for written english. Inf Control 1(4):370–389
Miyamoto S, Suzuki S, Takumi S (2012) Clustering in tweets using a fuzzy neighborhood model. In: International Conference on Fuzzy Systems, Brisbane, pp 1–6
Mojiri MM, Ravanmehr R (2020) Event detection in Twitter using multi timing chained windows. Comput Inf 39(6):1336–1359
Munková D, Munk M, Vozár M (2013) Influence of stop-words removal on sequence patterns identification within comparable corpora. In: International Conference on ICT Innovations. Springer, pp 67–76
Norvig P (2013) English letter frequency counts: Mayzner revisited or etaoin srhldcu. https://norvig.com/mayzner.html
O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for Twitter. In: Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, George Washington University, pp 384–385
Park S, Kim Y (2016) Building thesaurus lexicon using dictionary-based approach for sentiment classification. In: International Conference on Software Engineering Research, Management and Applications. IEEE, pp 39–44
Rosa KD, Shah R, Lin B, Gershman A, Frederking R (2011) Topical clustering of tweets. Proceedings of the ACM SIGIR workshop on Social Web Search and Mining, Analysis under crisis, vol 63
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv:1609.04747
Rudrapal D, Das A (2017) Measuring the limit of semantic divergence for english tweets. In: Recent Advances in Natural Language Processing, Varna, pp 618–624
Saharia N (2015) Detecting emotion from short messages on Nepal earthquake. In: International Conference on Speech Technology and Human-Computer Dialogue. IEEE, Bucharest, pp 1–5
Sahni T, Chandak C, Chedeti NR, Singh M (2017) Efficient Twitter sentiment classification using subjective distant supervision. In: International Conference on Communication Systems and Networks, pp 548–553
Saif H, Fernandez M, He Y, Alani H (2013) Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the sts-gold
Singh TD, Singh TJ, Shadang M, Thokchom S (2021) Review comments of manipuri online video: Good, bad or ugly. In: International Conference on Computing and Communication Systems, vol 170. Springer, Shillong, p 45
Tang G, Xia Y, Wang W, Lau R, Zheng F (2014) Clustering tweets using wikipedia concepts. In: Proceedings of the Language Resources and Evaluation Conference, Reykjavik, pp 2262–2267
Teodorescu H-N, Saharia N (2015) An internet slang annotated dictionary and its use in assessing message attitude and sentiments. In: International Conference on Speech Technology and Human-Computer Dialogue. IEEE, Bucharest, pp 1–8
Vosoughi S, Vijayaraghavan P, Roy D (2016) Tweet2vec: Learning tweet embeddings using character-level CNN-LSTM encoder-decoder. In: ACM SIGIR conference on Research and Development in Information Retrieval, pp 1041–1044, DOI https://doi.org/10.1145/2911451.2914762, (to appear in print)
Zhou D, Chen L, He Y (2015) An unsupervised framework of exploring events on Twitter: Filtering, extraction and categorization. In: AAAI conference on Artificial Intelligence, vol 29
Zou L, Song WW (2016) LDA-TM: A two-step approach to Twitter topic data clustering. In: International Conference on Cloud Computing and Big Data Analysis, pp 342–347
Acknowledgements
Authors would like to thank anonymous reviewers for their insights and suggestions during preparation of the draft.
Funding
The first author acknowledge the financial supports received from TEQIP Phase III, NPIU (Ref. no.: IIITM/ACA-PhD/2017-18/10).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that there is no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Maibam Debina Devi and Navanath Saharia contributed equally to this work.
Rights and permissions
About this article
Cite this article
Devi, M.D., Saharia, N. Unsupervised tweets categorization using semantic and statistical features. Multimed Tools Appl 82, 9047–9064 (2023). https://doi.org/10.1007/s11042-022-13042-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13042-4