Text Mining and Real-Time Analytics of Twitter Data: A Case Study of Australian Hay Fever Prediction

  • Sudha SubramaniEmail author
  • Sandra Michalska
  • Hua Wang
  • Frank Whittaker
  • Benjamin Heyward
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11148)


Social media platforms such as Twitter contain wealth of user-generated data and over time has become a virtual treasure trove of information for knowledge discovery with applications in healthcare, politics, social initiatives, to name a few. Despite the evident benefits of tweets exploration, there are numerous challenges associated with processing such data, given tweets specific characteristics. The study provides a brief of steps involved in manipulation Twitter data as well as offers the examples of the machine learning algorithms most commonly used in text analysis. It concludes with the case study on the Australian hay fever prediction with the application of the selected techniques described in the brief. It demonstrates an example of Twitter real-time analytics for heath condition surveillance with the use of interactive visualisations to assist knowledge discovery and findings dissemination. The results prove the potential of social media to play an important role in meaningful results extraction and guidance for decision makers.


Twitter Machine learning Text mining Information retrieval Knowledge discovery 


  1. 1.
  2. 2.
    Bruns, A., Stieglitz, S.: Towards more systematic twitter analysis: metrics for tweeting activities. Int. J. Soc. Res. Methodol. 16(2), 91–108 (2013)CrossRefGoogle Scholar
  3. 3.
    Australian Institute of Health and Welfare. Allergic Rhinitis (‘Hay Fever’) in Australia (2016)Google Scholar
  4. 4.
    Sorensen, L.: User managed trust in social networking-comparing Facebook, Myspace and Linkedin. In: 1st International Conference on Wireless Communication, Vehicular Technology, Information Theory and Aerospace & Electronic Systems Technology, Wireless VITAE 2009, pp. 427–431. IEEE (2009)Google Scholar
  5. 5.
    Liu, F., Xiong, L.: Survey on text clustering algorithm-research present situation of text clustering algorithm. In: 2011 IEEE 2nd International Conference on Software Engineering and Service Science (ICSESS), pp. 196–199. IEEE (2011)Google Scholar
  6. 6.
    Dai, Y., Kakkonen, T., Sutinen, E.: MinEDec: a decision-support model that combines text-mining technologies with two competitive intelligence analysis methods. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 3, 165–173 (2011)Google Scholar
  7. 7.
    Forman, G., Kirshenbaum, E.: Extremely fast text feature extraction for classification and indexing. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1221–1230. ACM (2008)Google Scholar
  8. 8.
    Stavrianou, A., Brun, C., Silander, T., Roux, C.: NLP-based feature extraction for automated tweet classification. Interact. Data Min. Nat. Lang. Process. 145 (2014)Google Scholar
  9. 9.
    Zhao, P., Li, X., Wang, K.: Feature extraction from micro-blogs for comparison of products and services. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013. LNCS, vol. 8180, pp. 82–91. Springer, Heidelberg (2013). Scholar
  10. 10.
    Shirbhate, A.G., Deshmukh, S.N.: Feature extraction for sentiment classification on twitter data. Int. J. Sci. Res. (IJSR), 2319–7064 (2016). ISSN (Online)Google Scholar
  11. 11.
    Saif, H., Fernández, M., He, Y., Alani, H.: On stopwords, filtering and data sparsity for sentiment analysis of twitter (2014)Google Scholar
  12. 12.
    Porter, M.F.: Snowball: a language for stemming algorithms (2001)Google Scholar
  13. 13.
    Yuan, L.: Improvement for the automatic part-of-speech tagging based on Hidden Markov Model. In: 2010 2nd International Conference on Signal Processing Systems (ICSPS), vol. 1, pp. V1–744. IEEE (2010)Google Scholar
  14. 14.
    Jadhao, H., Aghav, D.J., Vegiraju, A.: Semantic tool for analysing unstructured data. Int. J. Sci. Eng. Res. 3(8) (2012)Google Scholar
  15. 15.
    Strapparava, C., Valitutti, A., et al.: WordNet affect: an affective extension of WordNet. In: LREC, vol. 4, pp. 1083–1086. Citeseer (2004)Google Scholar
  16. 16.
    Esuli, A., Sebastiani, F.: SentiWordNet: a high-coverage lexical resource for opinion mining. Evaluation 17, 1–26 (2007)Google Scholar
  17. 17.
    Montañés, E., Fernández, J., Díaz, I., Combarro, E.F., Ranilla, J.: Measures of rule quality for feature selection in text categorization. In: R. Berthold, M., Lenz, H.-J., Bradley, E., Kruse, R., Borgelt, C. (eds.) IDA 2003. LNCS, vol. 2810, pp. 589–598. Springer, Heidelberg (2003). Scholar
  18. 18.
    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)CrossRefGoogle Scholar
  19. 19.
    Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5(Nov), 1531–1555 (2004)Google Scholar
  20. 20.
    Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)Google Scholar
  21. 21.
    Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 133–142 (2003)Google Scholar
  22. 22.
    Lee, K., Agrawal, A., Choudhary, A.: Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1474–1477. ACM (2013)Google Scholar
  23. 23.
    Barbosa, L., Feng, J.: Robust sentiment detection on twitter from biased and noisy data. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Association for Computational Linguistics, pp. 36–44 (2010)Google Scholar
  24. 24.
    Tumasjan, A., Sprenger, T.O., Sandner, P.G., Welpe, I.M.: Predicting elections with twitter: what 140 characters reveal about political sentiment. Icwsm 10(1), 178–185 (2010)Google Scholar
  25. 25.
    O’Connor, B., Balasubramanyan, R., Routledge, B.R., Smith, N.A.: From tweets to polls: linking text sentiment to public opinion time series. Icwsm 11(122–129), 1–2 (2010)Google Scholar
  26. 26.
    Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web, pp. 851–860. ACM (2010)Google Scholar
  27. 27.
    Chunara, R., Andrews, J.R., Brownstein, J.S.: Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian Cholera outbreak. Am. J. Trop. Med. Hyg. 86(1), 39–45 (2012)CrossRefGoogle Scholar
  28. 28.
    Petrović, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to twitter. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 181–189 (2010)Google Scholar
  29. 29.
    Jiang, H., Zhou, R., Zhang, L., Wang, H., Zhang, Y.: A topic model based on Poisson decomposition. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1489–1498. ACM (2017)Google Scholar
  30. 30.
    Huang, J., Peng, M., Wang, H., Cao, J., Gao, W., Zhang, X.: A probabilistic method for emerging topic tracking in microblog stream. World Wide Web 20(2), 325–350 (2017)CrossRefGoogle Scholar
  31. 31.
    Peng, M., Xie, Q., Wang, H., Zhang, Y., Tian, G.: Bayesian sparse topical coding. IEEE Trans. Knowl. Data Eng. (2018)Google Scholar
  32. 32.
    Peng, M., et al.: Mining event-oriented topics in microblog stream with unsupervised multi-view hierarchical embedding. ACM Trans. Knowl. Discov. Data (TKDD) 12(3), 38 (2018)Google Scholar
  33. 33.
    Peng, M., et al.: Neural sparse topical coding. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2332–2340 (2018)Google Scholar
  34. 34.
    Yao, W., He, J., Wang, H., Zhang, Y., Cao, J.: Collaborative topic ranking: Leveraging item meta-data for sparsity reduction. In: AAAI, pp. 374–380 (2015)Google Scholar
  35. 35.
    Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends® Inf. Retr. 2(1–2), 1–135 (2008)CrossRefGoogle Scholar
  36. 36.
    Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011)CrossRefGoogle Scholar
  37. 37.
    Bollen, J., Mao, H., Pepe, A.: Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. Icwsm 11, 450–453 (2011)Google Scholar
  38. 38.
    Bruns, A., Burgess, J.E.: # Ausvotes: How twitter covered the 2010 Australian federal election. Commun. Polit. Cult. 44(2), 37–56 (2011)Google Scholar
  39. 39.
    Gaffney, D.: iranElection: quantifying online activism. In: Proceedings of the Web Science Conference WebSci10. Citeseer (2010)Google Scholar
  40. 40.
    Culotta, A.: Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the First Workshop on Social Media Analytics, pp. 115–122. ACM (2010)Google Scholar
  41. 41.
    de Quincey, E., Kostkova, P.: Early warning and outbreak detection using social networking websites: the potential of twitter. In: Kostkova, P. (ed.) eHealth 2009. LNICST, vol. 27, pp. 21–24. Springer, Heidelberg (2010). Scholar
  42. 42.
    Bosley, J.C., et al.: Decoding twitter: Surveillance and trends for cardiac arrest and resuscitation communication. Resuscitation 84(2), 206–212 (2013)CrossRefGoogle Scholar
  43. 43.
    Culotta, A.: Lightweight methods to estimate influenza rates and alcohol sales volume from twitter messages. Lang. Resour. Eval. 47(1), 217–238 (2013)CrossRefGoogle Scholar
  44. 44.
    Cobb, N.K., Graham, A.L., Byron, M.J., Niaura, R.S., Abrams, D.B., Participants, W.: Online social networks and smoking cessation: a scientific research agenda. J. Med. Internet Res. 13(4) (2011)Google Scholar
  45. 45.
    Paul, M.J., Dredze, M.: Drug extraction from the web: Summarizing drug experiences with multi-dimensional topic models. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 168–178 (2013)Google Scholar
  46. 46.
    Golder, S.A., Macy, M.W.: Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333(6051), 1878–1881 (2011)CrossRefGoogle Scholar
  47. 47.
    Odlum, M., Yoon, S.: What can we learn about the ebola outbreak from tweets? Am. J. Infect. Control. 43(6), 563–571 (2015)CrossRefGoogle Scholar
  48. 48.
    Paul, M.J., Dredze, M.: Discovering health topics in social media using topic models. PloS one 9(8), e103408 (2014)CrossRefGoogle Scholar
  49. 49.
    Paul, M.J., Dredze, M.: You are what you tweet: analyzing twitter for public health. Icwsm 20, 265–272 (2011)Google Scholar
  50. 50.
  51. 51.
  52. 52.
    Silver, J.D., et al.: Seasonal asthma in Melbourne, Australia, and some observations on the occurrence of thunderstorm asthma and its predictability. PloS one 13(4), e0194929 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Sudha Subramani
    • 1
    Email author
  • Sandra Michalska
    • 1
  • Hua Wang
    • 1
  • Frank Whittaker
    • 1
  • Benjamin Heyward
    • 2
  1. 1.Institute for Sustainable Industries and Liveable CitiesVictoria UniversityMelbourneAustralia
  2. 2.Nexus Online Pty Ltd.GreenockAustralia

Personalised recommendations