Skip to main content

Identifying Clinical Terms in Free-Text Notes Using Ontology-Guided Machine Learning

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2019)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Abstract

Objective: Automatic recognition of medical concepts in unstructured text is an important component of many clinical and research applications and its accuracy has a large impact on electronic health record analysis. The mining of such terms is complicated by the broad use of synonyms and non-standard terms in medical documents. Here we presented a machine learning model for concept recognition in large unstructured text which optimizes the use of ontological structures and can identify previously unobserved synonyms for concepts in the ontology.

Materials and Methods: We present a neural dictionary model which can be used to predict if a phrase is synonymous to a concept in a reference ontology. Our model, called Neural Concept Recognizer (NCR), uses a convolutional neural network and utilizes the taxonomy structure to encode input phrases, then rank medical concepts based on the similarity in that space. It also utilizes the biomedical ontology structure to optimize the embedding of various terms and has fewer training constraints than previous methods. We train our model on two biomedical ontologies, the Human Phenotype Ontology (HPO) and SNOMED-CT.

Results: We tested our model trained on HPO on two different data sets: 288 annotated PubMed abstracts and 39 clinical reports. We also tested our model trained on the SNOMED-CT on 2000 MIMIC-III ICU discharge summaries. The results of our experiments show the high accuracy of our model, as well as the value of utilizing the taxonomy structure of the ontology in concept recognition.

Conclusion: Most popular medical concept recognizers rely on rule-based models, which cannot generalize well to unseen synonyms. Also, most machine learning methods typically require large corpora of annotated text that cover all classes of concepts, which can be extremely difficult to get for biomedical ontologies. Without relying on a large-scale labeled training data or requiring any custom training, our model can efficiently generalize to new synonyms and performs as well or better than state-of-the-art methods custom built for specific ontologies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Simmons, M., Singhal, A., Lu, Z.: Text mining for precision medicine: bringing structure to EHRs and biomedical literature to understand genes and health. In: Shen, B., Tang, H., Jiang, X. (eds.) Translational Biomedical Informatics. AEMB, vol. 939, pp. 139–166. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-1503-8_7

    Chapter  Google Scholar 

  2. Jonnagaddala, J., Dai, H.-J., Ray, P., Liaw, S.-T.: Mining electronic health records to guide and support clinical decision support systems. In: Healthcare Ethics and Training: Concepts, Methodologies, Tools, and Applications, pp. 184–201. IGI Global (2017)

    Google Scholar 

  3. Luo, Y., et al.: Natural language processing for EHR-based pharmacovigilance: a structured review. Drug Saf. 40(11), 1075–1089 (2017)

    Article  Google Scholar 

  4. Gonzalez, G.H., Tahsin, T., Goodale, B.C., Greene, A.C., Greene, C.S.: Recent advances and emerging applications in text and data mining for biomedical discovery. Brief. Bioinform. 17(1), 33–42 (2015)

    Article  Google Scholar 

  5. Piñero, J., et al.: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015 (2015)

    Google Scholar 

  6. SNOMED-CT. https://www.nlm.nih.gov/healthit/snomedct/

  7. Köhler, S., et al.: The human phenotype ontology in 2017. Nucleic Acids Res. 45(D1), D865–D876 (2017)

    Article  Google Scholar 

  8. Lochmüller, H., et al.: ‘IRDiRC Recognized Resources’: a new mechanism to support scientists to conduct efficient, high-quality research for rare diseases. Eur. J. Hum. Genet. 25(2), 162–165 (2017)

    Article  Google Scholar 

  9. Rehm, H.L., et al.: ClinGen—the clinical genome resource. N. Engl. J. Med. 372(23), 2235–2242 (2015)

    Article  Google Scholar 

  10. Jonquet, C., Shah, N.H., Musen, M.A.: The open biomedical annotator. Summit Transl. Bioinform. 2009, 56 (2009)

    Google Scholar 

  11. Taboada, M., Rodríguez, H., Martínez, D., Pardo, M., Sobrido, M.J.: Automated semantic annotation of rare disease cases: a case study. Database (Oxford) 2014 (2014)

    Google Scholar 

  12. Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17 (2001)

    Google Scholar 

  13. Savova, G.K., et al.: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17(5), 507–513 (2010)

    Article  Google Scholar 

  14. Groza, T., et al.: Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora. Database 2015, bav005 (2015)

    Article  Google Scholar 

  15. Lobo, M., Lamurias, A., Couto, F.M.: Identifying human phenotype terms by combining machine learning and validation rules. Biomed. Res. Int. 2017, Article no. 8565739 (2017)

    Google Scholar 

  16. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv Preprint arXiv:1603.01360 (2016)

  17. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint arXiv:1508.01991 (2015)

  18. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv Preprint arXiv:1603.01354 (2016)

  19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  20. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)

    Google Scholar 

  21. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 142–147 (2003)

    Google Scholar 

  22. Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3 (2016)

    Google Scholar 

  23. Girdea, M., et al.: PhenoTips: patient phenotyping software for clinical and research use. Hum. Mutat. 34(8), 1057–1065 (2013)

    Article  Google Scholar 

  24. Glueck, M., et al.: PhenoLines: phenotype comparison visualizations for disease subtyping via topic models. IEEE Trans. Vis. Comput. Graph. 24(1), 371–381 (2018)

    Article  Google Scholar 

  25. Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)

    Article  Google Scholar 

  26. Vani, A., Jernite, Y., Sontag, D.: Grounded recurrent neural networks. arXiv Preprint arXiv:1705.08557 (2017)

  27. Deng, J., et al.: Large-scale object classification using label relation graphs. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 48–64. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_4

    Chapter  Google Scholar 

  28. Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv Preprint arXiv:1511.06361 (2015)

  29. Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for knowledge base inference. In: 2015 AAAI Spring Symposium Series (2015)

    Google Scholar 

  30. Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. arXiv Preprint arXiv:1705.08039 (2017)

  31. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv Preprint arXiv:1607.04606 (2016)

  32. Kim, Y.: Convolutional neural networks for sentence classification. arXiv Preprint arXiv:1408.5882 (2014)

  33. Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv Preprint arXiv:1511.07289 (2015)

  34. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv Preprint arXiv:1412.6980 (2014)

  35. Tifft, C.J., Adams, D.R.: The National Institutes of Health undiagnosed diseases program. Curr. Opin. Pediatr. 26(6), 626 (2014)

    Article  Google Scholar 

  36. Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(90001), 267D–270D (2004)

    Article  Google Scholar 

  37. Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)

    Google Scholar 

Download references

Acknowledgements

We thank Michael Glueck for his valuable comments and discussions. We also thank Tudor Groza for his helpful comments and for providing us the BioLarK API used for the experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Brudno .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arbabi, A., Adams, D.R., Fidler, S., Brudno, M. (2019). Identifying Clinical Terms in Free-Text Notes Using Ontology-Guided Machine Learning. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17083-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17082-0

  • Online ISBN: 978-3-030-17083-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics