Identifying Clinical Terms in Free-Text Notes Using Ontology-Guided Machine Learning

Arbabi, Aryan; Adams, David R.; Fidler, Sanja; Brudno, Michael

doi:10.1007/978-3-030-17083-7_2

Aryan Arbabi^15,16,17,
David R. Adams¹⁸,
Sanja Fidler^15,17 &
…
Michael Brudno^15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2087 Accesses
1 Citations
1 Altmetric

Abstract

Objective: Automatic recognition of medical concepts in unstructured text is an important component of many clinical and research applications and its accuracy has a large impact on electronic health record analysis. The mining of such terms is complicated by the broad use of synonyms and non-standard terms in medical documents. Here we presented a machine learning model for concept recognition in large unstructured text which optimizes the use of ontological structures and can identify previously unobserved synonyms for concepts in the ontology.

Materials and Methods: We present a neural dictionary model which can be used to predict if a phrase is synonymous to a concept in a reference ontology. Our model, called Neural Concept Recognizer (NCR), uses a convolutional neural network and utilizes the taxonomy structure to encode input phrases, then rank medical concepts based on the similarity in that space. It also utilizes the biomedical ontology structure to optimize the embedding of various terms and has fewer training constraints than previous methods. We train our model on two biomedical ontologies, the Human Phenotype Ontology (HPO) and SNOMED-CT.

Results: We tested our model trained on HPO on two different data sets: 288 annotated PubMed abstracts and 39 clinical reports. We also tested our model trained on the SNOMED-CT on 2000 MIMIC-III ICU discharge summaries. The results of our experiments show the high accuracy of our model, as well as the value of utilizing the taxonomy structure of the ontology in concept recognition.

Conclusion: Most popular medical concept recognizers rely on rule-based models, which cannot generalize well to unseen synonyms. Also, most machine learning methods typically require large corpora of annotated text that cover all classes of concepts, which can be extremely difficult to get for biomedical ontologies. Without relying on a large-scale labeled training data or requiring any custom training, our model can efficiently generalize to new synonyms and performs as well or better than state-of-the-art methods custom built for specific ontologies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Simmons, M., Singhal, A., Lu, Z.: Text mining for precision medicine: bringing structure to EHRs and biomedical literature to understand genes and health. In: Shen, B., Tang, H., Jiang, X. (eds.) Translational Biomedical Informatics. AEMB, vol. 939, pp. 139–166. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-1503-8_7
Chapter Google Scholar
Jonnagaddala, J., Dai, H.-J., Ray, P., Liaw, S.-T.: Mining electronic health records to guide and support clinical decision support systems. In: Healthcare Ethics and Training: Concepts, Methodologies, Tools, and Applications, pp. 184–201. IGI Global (2017)
Google Scholar
Luo, Y., et al.: Natural language processing for EHR-based pharmacovigilance: a structured review. Drug Saf. 40(11), 1075–1089 (2017)
Article Google Scholar
Gonzalez, G.H., Tahsin, T., Goodale, B.C., Greene, A.C., Greene, C.S.: Recent advances and emerging applications in text and data mining for biomedical discovery. Brief. Bioinform. 17(1), 33–42 (2015)
Article Google Scholar
Piñero, J., et al.: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015 (2015)
Google Scholar
SNOMED-CT. https://www.nlm.nih.gov/healthit/snomedct/
Köhler, S., et al.: The human phenotype ontology in 2017. Nucleic Acids Res. 45(D1), D865–D876 (2017)
Article Google Scholar
Lochmüller, H., et al.: ‘IRDiRC Recognized Resources’: a new mechanism to support scientists to conduct efficient, high-quality research for rare diseases. Eur. J. Hum. Genet. 25(2), 162–165 (2017)
Article Google Scholar
Rehm, H.L., et al.: ClinGen—the clinical genome resource. N. Engl. J. Med. 372(23), 2235–2242 (2015)
Article Google Scholar
Jonquet, C., Shah, N.H., Musen, M.A.: The open biomedical annotator. Summit Transl. Bioinform. 2009, 56 (2009)
Google Scholar
Taboada, M., Rodríguez, H., Martínez, D., Pardo, M., Sobrido, M.J.: Automated semantic annotation of rare disease cases: a case study. Database (Oxford) 2014 (2014)
Google Scholar
Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17 (2001)
Google Scholar
Savova, G.K., et al.: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17(5), 507–513 (2010)
Article Google Scholar
Groza, T., et al.: Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora. Database 2015, bav005 (2015)
Article Google Scholar
Lobo, M., Lamurias, A., Couto, F.M.: Identifying human phenotype terms by combining machine learning and validation rules. Biomed. Res. Int. 2017, Article no. 8565739 (2017)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv Preprint arXiv:1603.01360 (2016)
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint arXiv:1508.01991 (2015)
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv Preprint arXiv:1603.01354 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 142–147 (2003)
Google Scholar
Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3 (2016)
Google Scholar
Girdea, M., et al.: PhenoTips: patient phenotyping software for clinical and research use. Hum. Mutat. 34(8), 1057–1065 (2013)
Article Google Scholar
Glueck, M., et al.: PhenoLines: phenotype comparison visualizations for disease subtyping via topic models. IEEE Trans. Vis. Comput. Graph. 24(1), 371–381 (2018)
Article Google Scholar
Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)
Article Google Scholar
Vani, A., Jernite, Y., Sontag, D.: Grounded recurrent neural networks. arXiv Preprint arXiv:1705.08557 (2017)
Deng, J., et al.: Large-scale object classification using label relation graphs. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 48–64. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_4
Chapter Google Scholar
Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv Preprint arXiv:1511.06361 (2015)
Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for knowledge base inference. In: 2015 AAAI Spring Symposium Series (2015)
Google Scholar
Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. arXiv Preprint arXiv:1705.08039 (2017)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv Preprint arXiv:1607.04606 (2016)
Kim, Y.: Convolutional neural networks for sentence classification. arXiv Preprint arXiv:1408.5882 (2014)
Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv Preprint arXiv:1511.07289 (2015)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv Preprint arXiv:1412.6980 (2014)
Tifft, C.J., Adams, D.R.: The National Institutes of Health undiagnosed diseases program. Curr. Opin. Pediatr. 26(6), 626 (2014)
Article Google Scholar
Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(90001), 267D–270D (2004)
Article Google Scholar
Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)
Google Scholar

Download references

Acknowledgements

We thank Michael Glueck for his valuable comments and discussions. We also thank Tudor Groza for his helpful comments and for providing us the BioLarK API used for the experiments.

Author information

Authors and Affiliations

Department of Computer Science, University of Toronto, Toronto, ON, Canada
Aryan Arbabi, Sanja Fidler & Michael Brudno
Center for Computational Medicine, Hospital for Sick Children, Toronto, ON, Canada
Aryan Arbabi & Michael Brudno
Vector Institute, Toronto, ON, Canada
Aryan Arbabi, Sanja Fidler & Michael Brudno
Section on Human Biochemical Genetics, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
David R. Adams

Authors

Aryan Arbabi
View author publications
You can also search for this author in PubMed Google Scholar
David R. Adams
View author publications
You can also search for this author in PubMed Google Scholar
Sanja Fidler
View author publications
You can also search for this author in PubMed Google Scholar
Michael Brudno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Brudno .

Editor information

Editors and Affiliations

Tufts University, Cambridge, MA, USA
Lenore J. Cowen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arbabi, A., Adams, D.R., Fidler, S., Brudno, M. (2019). Identifying Clinical Terms in Free-Text Notes Using Ontology-Guided Machine Learning. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-17083-7_2
Published: 02 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics