Enhancing keyword correlation for event detection in social networks using SVD and k-means: Twitter case study

  • Ahmad Hany HossnyEmail author
  • Terry Moschuo
  • Grant Osborne
  • Lewis Mitchell
  • Nick Lothian
Original Article


Extracting textual features from tweets is a challenging task due to the noisy nature of the content and the weak signal of most of the words used. In this paper, we propose using singular value decomposition (SVD) with clustering to group related words as enhanced signals for textual features in tweets in order to improve the correlation with events. The proposed method applies SVD to the time series vector for each feature to factorize the matrix of feature/day counts, to ensure the independence of the feature vectors. Then, k-means clustering is applied to build a look-up table that maps members of each cluster to the cluster centroid. The look-up table is used to map each feature in the original data to the centroid of its cluster. Then, we calculate the sum of the term-frequency vectors of all features in each cluster to the term-frequency vector of the cluster centroid. To evaluate the method, we calculated the correlations of the cluster centroids with the golden standard record vector before and after summing the vectors of the cluster members to the centroid vector. The proposed method is applied to multiple correlation techniques including the Pearson, Spearman, distance correlation, and Kendal Tao. The experiments also considered the different word forms and lengths of the features including keywords, n grams, skip grams, and bags-of-words. The correlation results are enhanced significantly as the highest correlation scores have increased from 0.22 to 0.70, and the average correlation scores have increased from 0.22 to 0.60.


Social network Event detection Feature extraction Correlation SVD 


  1. Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459CrossRefGoogle Scholar
  2. Anduiza E, Cristancho C, Sabucedo JM (2014) Mobilization through online social networks: the political protest of the indignados in spain. Inf Commun Soc 17(6):750–764CrossRefGoogle Scholar
  3. Azzam A, Tazi N, Hossny A (2017) A question routing technique using deep neural network for communities of question answering. In: International conference on database systems for advanced applications. Springer, New York, pp 35–49Google Scholar
  4. Blankertz B, Tomioka R, Lemm S, Kawanabe M, Muller KR (2008) Optimizing spatial filters for robust eeg single-trial analysis. IEEE Signal Process Mag 25(1):41–56CrossRefGoogle Scholar
  5. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mac Learn Res 3(Jan):993–1022zbMATHGoogle Scholar
  6. Chen ZP, Morris J, Martin E, Hammond RB, Lai X, Ma C, Purba E, Roberts KJ, Bytheway R (2005) Enhancing the signal-to-noise ratio of X-ray diffraction profiles by smoothed principal component analysis. Anal Chem 77(20):6563–6570CrossRefGoogle Scholar
  7. Comon P (1994) Independent component analysis, a new concept? Signal Process 36(3):287–314zbMATHCrossRefGoogle Scholar
  8. Diggle PJ (2013) Statistical analysis of spatial and spatio-temporal point patterns. CRC Press, Boca RatonzbMATHCrossRefGoogle Scholar
  9. Dumais ST (2004) Latent semantic analysis. Annu Rev Inf Sci Technol 38(1):188–230CrossRefGoogle Scholar
  10. Evangelopoulos NE (2013) Latent semantic analysis. Wiley Interdiscip Rev Cogn Sci 4(6):683–692CrossRefGoogle Scholar
  11. Ewerbring L, Luk FT (1989) Canonical correlations and generalized SVD: applications and new algorithms. J Comput Appl Math 27(1):37–52. (Special Issue on Parallel Algorithms for Numerical Linear Algebra)
  12. Eysenbach G (2011) Can tweets predict citations? Metrics of social impact based on twitter and correlation with traditional metrics of scientific impact. J Med Internet Res 13(4):e123.,
  13. Fernández J, Gutiérrez Y, Gómez JM, Martınez-Barco P (2014) Gplsi: supervised sentiment analysis in twitter using skipgrams. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), number SemEval, pp 294–299Google Scholar
  14. Fung GPC, Yu JX, Yu PS, Lu H (2005) Parameter free bursty events detection in text streams. In: Proceedings of the 31st international conference on very large data bases, VLDB ’05. VLDB Endowment, pp 181–192.
  15. Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numer Math 14(5):403–420MathSciNetzbMATHCrossRefGoogle Scholar
  16. Gonzlez-Bailn S, Wang N (2016) Networked discontent: the anatomy of protest campaigns in social media. Soc Netw 44:95–104.,
  17. Hamadache M, Lee D (2017) Principal component analysis based signal-to-noise ratio improvement for inchoate faulty signals: application to ball bearing fault detection. Int J Control Autom Syst 15(2):506–517CrossRefGoogle Scholar
  18. Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: advances in neural information processing systems, pp 856–864Google Scholar
  19. Hossny A, Shaalan K, Fahmy A (2008) Automatic morphological rule induction for Arabic. In: Proceedings of the LREC08 workshop on HLT & NLP within the Arabic world: Arabic language and local languages processing: status updates and prospects, pp 97–101Google Scholar
  20. Hossny A, Shaalan K, Fahmy A (2009) Machine translation model using inductive logic programming. In: International conference on natural language processing and knowledge engineering, 2009. NLP-KE 2009. IEEE, pp 1–8Google Scholar
  21. Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4):411–430CrossRefGoogle Scholar
  22. Hyvärinen A, Karhunen J, Oja E (2004) Independent component analysis, vol 46. Wiley, New YorkGoogle Scholar
  23. Jiang Z, Lin Z, Davis LS (2013) Label consistent k-svd: learning a discriminative dictionary for recognition. IEEE Trans Pattern Anal Mach Intell 35(11):2651–2664CrossRefGoogle Scholar
  24. Klema V, Laub A (1980) The singular value decomposition: its computation and some applications. IEEE Trans Autom Control 25(2):164–176MathSciNetzbMATHCrossRefGoogle Scholar
  25. Koutsias N, Mallinis G, Karteris M (2009) A forward/backward principal component analysis of landsat-7 etm+ data to enhance the spectral signal of burnt surfaces. ISPRS J Photogramm Remote Sens 64(1):37–46CrossRefGoogle Scholar
  26. Landauer TK (2006) Latent semantic analysis. Wiley Online LibraryGoogle Scholar
  27. Lange K (2010) Singular value decomposition. Numerical analysis for statisticians, pp 129–142Google Scholar
  28. Lee FL, Chan JM (2015) Digital media use and participation leadership in social protests: the case of Tiananmen commemoration in Hong Kong. Telemat Inform 32(4):879–889.,
  29. Li C, Sun A, Datta A (2012) Twevent: segment-based event detection from tweets. In: Proceedings of the 21st ACM international conference on information and knowledge management. ACM, pp 155–164Google Scholar
  30. Loper E, Bird S (2002) Nltk: the natural language toolkit. In: Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, vol 1, ETMTNLP ’02. Association for Computational Linguistics, Stroudsburg, pp 63–70.
  31. Lotte F, Guan C (2011) Regularizing common spatial patterns to improve bci designs: unified theory and new algorithms. IEEE Trans Biomed Eng 58(2):355–362CrossRefGoogle Scholar
  32. Martınez-Cámara E, Gutiérrez-Vázquez Y, Fernández J, Montejo-Ráez A, Munoz-Guillena R (2015) Ensemble classifier for twitter sentiment analysisGoogle Scholar
  33. Mathioudakis M, Koudas N (2010) Twittermonitor: trend detection over the twitter stream. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, pp 1155–1158.
  34. Petrović S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to twitter. In: Human language technologies: the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10. Association for Computational Linguistics, Stroudsburg, pp 181–189.
  35. Potapov P, Longo P, Okunishi E (2017) Enhancement of noisy edx hrstem spectrum-images by combination of filtering and pca. Micron 96:29–37CrossRefGoogle Scholar
  36. Ramoser H, Muller-Gerking J, Pfurtscheller G (2000) Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Trans Rehabil Eng 8(4):441–446CrossRefGoogle Scholar
  37. Riquelme F, Gonzlez-Cantergiani P (2016) Measuring user influence on twitter: a survey. Inf Process Manag 52(5):949–975. CrossRefGoogle Scholar
  38. Sasaki K, Yoshikawa T, Furuhashi T (2014) Online topic model for twitter considering dynamics of user interests and topic trends. In: EMNLP, pp 1977–1985Google Scholar
  39. Shlens J (2014) A tutorial on principal component analysis. arXiv:1404.1100
  40. Soltysik DA, Thomasson D, Rajan S, Biassou N (2015) Improving the use of principal component analysis to reduce physiological noise and motion artifacts to increase the sensitivity of task-based FMRI. J Neurosci Methods 241:18–29. CrossRefGoogle Scholar
  41. Spiegelberg J, Rusz J (2017) Can we use PCA to detect small signals in noisy data? Ultramicroscopy 172:40–46. CrossRefGoogle Scholar
  42. Sun S, Zhang C, Lu Y (2008) The random electrode selection ensemble for eeg signal classification. Pattern Recognit 41(5):1663–1675zbMATHCrossRefGoogle Scholar
  43. Tufekci Z, Wilson C (2012) Social media and the decision to participate in political protest: observations from Tahrir Square. J Commun 62(2):363–379CrossRefGoogle Scholar
  44. Valenzuela S (2013) Unpacking the use of social media for protest behavior: the roles of information, opinion expression, and activism. Am Behav Sci 57(7):920–942CrossRefGoogle Scholar
  45. Wall ME, Rechtsteiner A, Rocha LM (2003) Singular value decomposition and principal component analysis. Springer US, Boston, pp 91–109Google Scholar
  46. Wang X, Gerber MS, Brown DE (2012) Automatic crime prediction using events extracted from twitter posts. In: International conference on social computing, behavioral-cultural modeling, and prediction. Springer, New York, pp 231–238Google Scholar
  47. Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52CrossRefGoogle Scholar
  48. Yu X, Chum P, Sim KB (2014) Analysis the effect of PCA for feature reduction in non-stationary EEG based motor imagery of BCI system. Optik 125(3):1498–1502CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Austria, part of Springer Nature 2018

Authors and Affiliations

  • Ahmad Hany Hossny
    • 1
    Email author
  • Terry Moschuo
    • 2
  • Grant Osborne
    • 2
  • Lewis Mitchell
    • 1
  • Nick Lothian
    • 2
  1. 1.School of Mathematical SciencesUniversity of AdelaideAdelaideAustralia
  2. 2.Data to Decision Research Collaboration CentreAdelaideAustralia

Personalised recommendations