Smart Health pp 209-235 | Cite as

Linking Biomedical Data to the Cloud

  • Stefan Zwicklbauer
  • Christin Seifert
  • Michael Granitzer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8700)


The application of Knowledge Discovery and Data Mining approaches forms the basis of realizing the vision of Smart Hospitals. For instance, the automated creation of high-quality knowledge bases from clinical reports is important to facilitate decision making processes for clinical doctors. A subtask of creating such structured knowledge is entity disambiguation that establishes links by identifying the correct semantic meaning from a set of candidate meanings to a text fragment. This paper provides a short, concise overview of entity disambiguation in the biomedical domain, with a focus on annotated corpora (e.g. CalbC), term disambiguation algorithms (e.g. abbreviation disambiguation) as well as gene and protein disambiguation algorithms (e.g. inter-species gene name disambiguation). Finally, we provide some open problems and future challenges that we expect future research will take into account.


Linked data cloud Entity disambiguation Text annotation Natural language processing Knowledge bases 



The presented work was developed within the EEXCESS project funded by the European Union Seventh Framework Programme FP7/2007–2013 under grant agreement number 600601.


  1. 1.
    Holzinger, A., Schantl, J., Schroettner, M., Seifert, C., Verspoor, K.: Biomedical text mining: state-of-the-art, open problems and future challenges. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 271–300. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  2. 2.
    Gantz, J., Reinsel, D.: Extracting value from chaos. Technical report. IDC iview (2011)Google Scholar
  3. 3.
    Holzinger, A.: On Knowledge Discovery and Interactive Intelligent Visualization of Biomedical Data - Challenges in Human-Computer Interaction and Biomedical Informatics. INSTICC, Rome (2012)Google Scholar
  4. 4.
    Piateski, G., Frawley, W.: Knowledge Discovery in Databases. MIT press, Cambridge (1991)Google Scholar
  5. 5.
    Holzinger, A., Jurisica, I.: Knowledge discovery and data mining in biomedical informatics: the future is in integrative, interactive machine learning solutions. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 1–18. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  6. 6.
    Davis, A.P., Grondin, C.J., Lennon-Hopkins, K., Saraceni-Richards, C., Sciaky, D., King, B.L., Wiegers, T.C., Mattingly, C.J.: The comparative toxicogenomics database’s 10th year anniversary: update 2015. Nucleic acids research (2014)Google Scholar
  7. 7.
    Kim, J.D., Pyysalo, S.: Bionlp shared task. In: Dubitzky, W., Wolkenhauer, O., Cho, K.H., Yokota, H. (eds.) Encyclopedia of Systems Biology, pp. 138–141. Springer, New York (2013)CrossRefGoogle Scholar
  8. 8.
    Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., Sobral, B., Tsujii, J., Ananiadou, S.: Overview of the ID, EPI and REL tasks of BioNLP shared task 2011. BMC Bioinform. 13(Suppl 11), S2 (2012)CrossRefGoogle Scholar
  9. 9.
    Krell, T., Lacal, J., Busch, A., Silva-Jiménez, H., Guazzaroni, M.E., Ramos, J.L.: Bacterial sensor kinases: diversity in the recognition of environmental signals. Annu. Rev. Microbiol. 64, 539–559 (2010)CrossRefGoogle Scholar
  10. 10.
    Krauthammer, M., Nenadic, G.: Term identification in the biomedical literature. J. Biomed. Inform. 37(6), 512–526 (2004). Named Entity Recognition in BiomedicineCrossRefGoogle Scholar
  11. 11.
    Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 2009, pp. 457–466. ACM, New York, NY, USA (2009)Google Scholar
  12. 12.
    Grishman, R., Sundheim, B.: Message understanding conference-6: A brief history. In: Proceedings of the 16th Conference on Computational Linguistics, COLING 1996, vol. 1, pp. 466–471. Association for Computational Linguistics, Stroudsburg, PA, USA (1996)Google Scholar
  13. 13.
    Gentile, A.L., Zhang, Z., Xia, L., Iria, J.: Semantic relatedness approach for named entity disambiguation. In: Agosti, M., Esposito, F., Thanos, C. (eds.) IRCDL 2010. CCIS, vol. 91, pp. 137–148. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  14. 14.
    Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 708–716. Association for Computational Linguistics, Prague, Czech Republic (2007)Google Scholar
  15. 15.
    Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM 2007, pp. 233–242. ACM, New York, NY, USA (2007)Google Scholar
  16. 16.
    Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1–2), 1338–1347 (2010)CrossRefGoogle Scholar
  17. 17.
    Wacholder, N., Ravin, Y., Choi, M.: Disambiguation of proper names in text. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, ANLC 1997, pp. 202–208. Association for Computational Linguistics, Stroudsburg, PA, USA (1997)Google Scholar
  18. 18.
    Marsh, E., Perzanowski, D.: Muc-7 evaluation of ie technology: overview of results. In: Proceedings of the Seventh Message Understanding Conference (MUC-7) (1998)Google Scholar
  19. 19.
    Campos, D.: Srgio Matos. Theory and Applications for Advanced Text Mining, J.L.O. (2012)Google Scholar
  20. 20.
    Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL 1998, vol. 1, pp. 79–85. Association for Computational Linguistics, Stroudsburg, PA, USA (1998)Google Scholar
  21. 21.
    Chen, L., Liu, H., Friedman, C.: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 21(2), 248–256 (2005)CrossRefGoogle Scholar
  22. 22.
    Ogden, C., Richards, I.A.: The Meaning of Meaning: a Study of the Influence of Language Upon Thought and of the Science of Symbolism, 8th edn. Harcourt Brace Jovanovich, New York (1923). Reprint Google Scholar
  23. 23.
    Zwicklbauer, S., Seifert, C., Granitzer, M.: Do we need entity-centric knowledge bases for entity disambiguation? In: Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies. i-Know 2013, pp. 4:1–4:8. ACM, New York, NY, USA (2013)Google Scholar
  24. 24.
    Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: Genia corpusa semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl 1), i180–i182 (2003)CrossRefGoogle Scholar
  25. 25.
    Yeh, A., Morgan, A., Colosimo, M., Hirschman, L.: Biocreative task 1a: gene mention finding evaluation. BMC Bioinform. 6(Suppl 1), S16 (2005)CrossRefGoogle Scholar
  26. 26.
    Smith, L., Tanabe, L., Johnson nee Ando, R., Kuo, C.J., Chung, I.F., Hsu, C.N., Lin, Y.S., Klinger, R., Friedrich, C., Ganchev, K., Torii, M., Liu, H., Haddow, B., Struble, C., Povinelli, R., Vlachos, A., Baumgartner, W.A., Hunter, L., Carpenter, B., Tzong-Han Tsai, R., Dai, H.J., Liu, F., Chen, Y., Sun, C., Katrenko, S., Adriaans, P., Blaschke, C., Torres, R., Neves, M., Nakov, P., Divoli, A., Maa-Lpez, M., Mata, J., Wilbur, W.: Overview of biocreative II gene mention recognition. Genome Biol. 9(Suppl 2), S2 (2008)CrossRefGoogle Scholar
  27. 27.
    Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., Valencia, A.: Overview of the chemical compound and drug name recognition (chemdner) task. In: BioCreative Challenge Evaluation Workshop, vol. 2. (2013)Google Scholar
  28. 28.
    Van Auken, K., Schaeffer, M.L., McQuilton, P., Laulederkind, S.J., Li, D., Wang, S.J., Hayman, G.T., Tweedie, S., Arighi, C.N., Done, J. et al.: Corpus construction for the biocreative IV go task. In: Proceedings of the BioCreative IV workshop, Bethesda, MD, USA (2013)Google Scholar
  29. 29.
    Rebholz-Schuhmann, D., Yepes, A.J.J., Van Mulligen, E.M., Kors, J., Milward, D., Corbett, P., Buyko, E., Beisswanger, E., Hahn, U.: Calbc silver standard corpus. J. Bioinform. Comput. Biol. 8(01), 163–179 (2010)CrossRefGoogle Scholar
  30. 30.
    Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D., Baumgartner, W.A., Cohen, K., Verspoor, K., Blake, J., Hunter, L.: Concept annotation in the craft corpus. BMC Bioinform. 13(1), 161 (2012)CrossRefGoogle Scholar
  31. 31.
    Tsuruoka, Y., McNaught, J., Tsujii, J., Ananiadou, S.: Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics 23(20), 2768–2774 (2007)CrossRefGoogle Scholar
  32. 32.
    Smith, L.H., Yeganova, L., Wilbur, W.J.: Hidden markov models and optimized sequence alignments. Comput. Biol. Chem. 27(1), 77–84 (2003)CrossRefGoogle Scholar
  33. 33.
    Cohen, W., Minkov, E.: A graph-search framework for associating gene identifiers with documents. BMC Bioinform. 7(1), 440 (2006)CrossRefGoogle Scholar
  34. 34.
    Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research, pp. 354–359 (1990)Google Scholar
  35. 35.
    Rudniy, A., Song, M., Geller, J.: Mapping biological entities using the longest approximately common prefix method. BMC Bioinform. 15, 187 (2014)CrossRefGoogle Scholar
  36. 36.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  37. 37.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRefGoogle Scholar
  38. 38.
    Yu, H., Kim, W., Hatzivassiloglou, V., Wilbur, W.J.: Using medline as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J. Biomed. Inform. 40(2), 150–159 (2007)CrossRefGoogle Scholar
  39. 39.
    Yu, H., Hripcsak, G., Friedman, C.: Mapping abbreviations to full forms in biomedical articles. JAMIA 9(3), 262–272 (2002)Google Scholar
  40. 40.
    Pustejovsky, J., Castaño, J., Saurí, R., Rumshinsky, A., Zhang, J., Luo, W.: Medstract: Creating large-scale information servers for biomedical libraries. In: Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, BioMed 2002, vol. 3, pp. 85–92. Association for Computational Linguistics, Stroudsburg, PA, USA (2002)Google Scholar
  41. 41.
    Pakhomov, S.: Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACL 2002, pp. 160–167. Association for Computational Linguistics, Stroudsburg, PA, USA (2002)Google Scholar
  42. 42.
    Chen, P., Al-Mubaid, H.: Context-based term disambiguation in biomedical literature. In: Proceedings of the 19th International FLAIRS conference FLAIRS Conference, pp. 62–67 (2006)Google Scholar
  43. 43.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  44. 44.
    Spärk Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage. 36(6), 493–502 (2000)Google Scholar
  45. 45.
    Morgan, A.A., Lu, Z., Wang, X., Cohen, A., Fluck, J., Ruch, P., Divoli, A., Fundel, K., Leaman, R., Hakenberg, J., Sun, C., Liu, H.H., Torres, R., Krauthammer, M., Lau, W., Liu, H., Hsu, C.N., Schuemie, M., Cohen, K.B.: Overview of biocreative ii gene normalization. Genome Biol. 9(Suppl 2), S13 (2008)CrossRefGoogle Scholar
  46. 46.
    Hatzivassiloglou, V., Dubou, P.A., Rzhetsky, A.: Disambiguating proteins, genes, and RNA in text: a machine learning approach. In: ISMB (Supplement of Bioinformatics), pp. 97–106 (2001)Google Scholar
  47. 47.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008) CrossRefzbMATHGoogle Scholar
  48. 48.
    Ginter, F., Boberg, J., Järvinen, J., Salakoski, T.: New techniques for disambiguation in natural language and their application to biological text. J. Mach. Learn. Res. 5, 605–621 (2004)Google Scholar
  49. 49.
    McEntyre, J., Lipman, D.: PubMed: bridging the information gap. CMAJ Can. Med. Assoc. J. (journal de l’Association medicale canadienne) 164(9), 1317–1319 (2001)Google Scholar
  50. 50.
    Pahikkala, T.: Filip Ginter, J.B.: Contextual weighting for support vector machines in literature mining: an application to gene versus protein name disambiguation. BMC Bioinform. 6(1), 157 (2005)CrossRefGoogle Scholar
  51. 51.
    Xu, H., Fan, J.W., Hripcsak, G., Mendonça, E.A., Markatou, M., Friedman, C.: Gene symbol disambiguation using knowledge-based profiles. Bioinformatics 23(8), 1015–1022 (2007)CrossRefGoogle Scholar
  52. 52.
    Wermter, J., Tomanek, K., Hahn, U.: High-performance gene name normalization with geno. Bioinformatics 25(6), 815–821 (2009)CrossRefGoogle Scholar
  53. 53.
    Hakenberg, J., Plake, C., Royer, L., Strobelt, H., Leser, U., Schroeder, M.: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol. 9(Suppl 2), S14 (2008)CrossRefGoogle Scholar
  54. 54.
    Hakenberg, J., Plake, C., Leaman, R., Schroeder, M., Gonzalez, G.: Inter-species normalization of gene mentions with GNAT. In: ECCB, pp. 126–132 (2008)Google Scholar
  55. 55.
    Podowski, R.M., Cleary, J.G., Goncharoff, N.T., Amoutzias, G., Hayes, W.S.: Azure, a scalable system for automated term disambiguation of gene and protein names. In: CSB, pp. 415–424. IEEE Computer Society (2004)Google Scholar
  56. 56.
    Wang, X., Tsujii, J., Ananiadou, S.: Disambiguating the species of biomedical named entities using natural language parsers. Bioinformatics 26(5), 661–667 (2010)CrossRefzbMATHGoogle Scholar
  57. 57.
    Hsiao, J.C., Wei, C.H., Kao, H.Y.: Gene name disambiguation using multi-scope species detection. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(1), 55–62 (2014)CrossRefGoogle Scholar
  58. 58.
    Wang, X., Matthews, M.: Distinguishing the species of biomedical named entities for term identification. BMC Bioinform. 9(Suppl 11), S6 (2008)CrossRefGoogle Scholar
  59. 59.
    Alex, B., Grover, C., Haddow, B., Kabadjov, M., Klein, E., Matthews, M., Roebuck, S., Tobin, R., Wang, X.: The ITI TXM corpora: tissue expressions and protein-protein interactions. In: Proceedings of LREC, vol. 8, Citeseer (2008)Google Scholar
  60. 60.
    Wang, X., Tsujii, J., Ananiadou, S.: Classifying relations for biomedical named entity disambiguation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 3, pp. 1513–1522. Association for Computational Linguistics, Stroudsburg, PA, USA (2009)Google Scholar
  61. 61.
    Harmston, N., Filsell, W., Stumpf, M.P.H.: Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices. Bioinformatics 28(2), 254–260 (2012)CrossRefGoogle Scholar
  62. 62.
    Sabol, V., Kow, W.O., Rauch, M., Ulbrich, E., Seifert, C., Granitzer, M., Lukose, D.: Visual ontology alignment system - an evaluation. In: Proceedings of SIGRAD (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Stefan Zwicklbauer
    • 1
  • Christin Seifert
    • 1
  • Michael Granitzer
    • 1
  1. 1.University of PassauPassauGermany

Personalised recommendations