An Ontology for Generalized Disease Incidence Detection on Twitter

  • Mark Abraham MagumbaEmail author
  • Peter Nabende
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10334)


In this paper, we present an ontology of disease related concepts that is designated for detection of disease incidence in tweets. Unlike previous key word based systems and topic modeling approaches, our ontological approach allows us to apply more stringent criteria for determining which messages are relevant such as spatial and temporal characteristics whilst giving a stronger guarantee that the resulting models will perform well on new data that may be lexically divergent. We achieve this by training supervised learners on concepts rather than individual words. Effectively, we map every possible word to a fixed length lexicon thereby eliminating lexical divergence between training data and new data. For training we use a dataset containing mentions of influenza, common cold and Listeria and use the learned models to classify datasets containing mentions of an arbitrary selection of other diseases. We show that our ontological approach results in models whose performance is not only good but also stable on lexically divergent data versus a word-level lookup unigram, bag of words baseline approach. We also show that word vectors can be learned directly from our concepts to achieve even better results.


Epidemiology Twitter Sentiment analysis Text classification Concept ontology Data mining Knowledge engineering 


  1. 1.
    Lee, K., Agrawal, A., Choudary, A.: Real time disease surveillance using twitter data: case study flu and cancer. In: ACM, Chicago, Illinois, USA, pp. 1474–1477 (2013)Google Scholar
  2. 2.
  3. 3.
    Paul, M.J., Dredze, M.: Discovering health topics in social media using topic models. PLoS ONE 9, 8 (2014)Google Scholar
  4. 4.
    Lampos, V., Cristianini, N.: Tracking the flu pandemic by monitoring the social web, pp. 411–416. IEEE, Naregno, Elba island, Italy (2010)Google Scholar
  5. 5.
    Collier, N., Doan, S., Kawazoe, A., Goodwin, R.M., Conway, M., Tateno, Y., et al.: Biocaster: detecting public health rumors with a web-based text mining system. Bioinform. 24(24), 2940–2941 (2008)CrossRefGoogle Scholar
  6. 6.
    Okhmatovskaia, A., Chapman, W., Collier, N., Espino, J., Buckeridge, D.L.: SSO: The Syndromic Surveillance Ontology
  7. 7.
    Porta, M.: A Dictionary of Epidemiology. Oxford University Press, New York (2008)Google Scholar
  8. 8.
    Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., et al.: The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotech. 25, 1251–1255 (2007)CrossRefGoogle Scholar
  9. 9.
    Osborne, J.D., Flatow, J., Holko, M., Lin, S.M., Kibbe, W.A., Zhue, L., et al.: Annotating the human genome with disease ontology. BMC Genom. 10, 1 (2009)CrossRefGoogle Scholar
  10. 10.
    Pesquira, C., Ferreira, J.D., Couto, M.F., Silva, M.J.: The epidemiology ontology: an ontology for semantic annotation of epidemiological resources. J. Biomed. Semant. 5, 4 (2014)CrossRefGoogle Scholar
  11. 11.
    Clark, T., Ciccarese, P.N., Goble, C.A.: Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications. J. Biomed. Semant. 5(1), 1–33 (2014)CrossRefGoogle Scholar
  12. 12.
    Elliott, J., Mavergames, C., Becker, L., Meerpohl, J., Thomas, J., Gruen, R., Tovey, D.: Achieving high quality and efficient systematic review through technological innovation. BMJ Rapid Response (2013)
  13. 13.
    Smith, B., Fellbaum, C.: Medical Wordnet: A New Methodology for the Construction and Validation of Information Resources for Consumer Health, p. 371. ACM, Geneva (2004)Google Scholar
  14. 14.
    Taylor, A., Marcus, M., Santorini, B.: The Penn Treebank: An Overview. In: Abeille, A. (ed.) Treebanks. Building and Using Parsed Corpora, pp. 5–22. Springer, Netherlands (2003)CrossRefGoogle Scholar
  15. 15.
    Derczynski, L., Ritter, A., Clark, S., Bontcheva, K.: Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In: ACL, Hisar, Bulgaria, pp. 198–206 (2013)Google Scholar
  16. 16.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: an architecture for development of robust HLT applications. In: ACL, Philadelphia, USA, pp. 168–175 (2002)Google Scholar
  17. 17.
    Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: ACL, Hong Kong, pp. 63–70 (2000)Google Scholar
  18. 18.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: ACM, Edmonton, Canada, pp. 252–259 (2003)Google Scholar
  19. 19.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation Of Word Representations In Vector Space. Google Curran Associates Inc., Arizona, USA (2013)Google Scholar
  20. 20.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: JMLR Workshop and Conference Proceedings, Beijing, China, pp. 1188–1196 (2014)Google Scholar
  21. 21.
    Rehurek, R., Sojka, P.: Software Framework for Topic Modeling with Large Corpora, pp. 46–50. University of Malta Valetta, Malta (2010)Google Scholar
  22. 22.
    Pedregrosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. 12, 2825–2830 (2011)zbMATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Information SystemsCollege of Computing and Information Sciences, Makerere UniversityKampalaUganda

Personalised recommendations