Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings
The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity—there are often many ways to represent the same type of information, such as a geographical location via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represented in a uniform way that can be queried effectively. One step toward uniformly-represented metadata is to normalize the multiple, distinct field names used in metadata (e.g., lat lon, lat and long) to describe the same type of value. To that end, we present a new method based on clustering and embeddings (i.e., vector representations of words) to align metadata field names with ontology terms. We apply our method to biomedical metadata by generating embeddings for terms in biomedical ontologies from the BioPortal repository. We carried out a comparative study between our method and the NCBO Annotator, which revealed that our method yields more and substantially better alignments between metadata and ontology terms.
KeywordsBiomedical metadata Ontologies Alignment Embeddings
This work is supported by grant U54 AI117925 awarded by the U.S. National Institute of Allergy and Infectious Diseases (NIAID) through funds provided by the Big Data to Knowledge (BD2K) initiative. BioPortal has been supported by the NIH Common Fund under grant U54 HG004028.
We thank the experts in our evaluation panel: John Graybeal, Josef Hardi, Marcos Martínez-Romero, and Csongor Nyulas (all of whom from the Center for Biomedical Informatics Research at Stanford University), for their participation.
- 2.Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Conference on Knowledge Discovery and Data Mining (1996)Google Scholar
- 4.Goldberg, Y., Levy, O.: Word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
- 7.Jonquet, C., et al.: NCBO annotator: semantic annotation of biomedical data. In: International Semantic Web Conference (2009)Google Scholar
- 8.Kamdar, M.R., et al.: An empirical meta-analysis of the life sciences (linked?) open data cloud (2018). http://onto-apps.stanford.edu/lslodminer
- 9.Koster, C., Seutter, M., Seibert, O.: Parsing the medline corpus. In: Recent Advances in Natural Language Processing (2007)Google Scholar
- 10.Lin, Y., et al.: Learning entity and relation embeddings for knowledge graph completion. In: AAAI Conference on Artificial Intelligence (2015)Google Scholar
- 13.Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for named entity resolution. arXiv preprint arXiv:1404.5367 (2014)
- 14.Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (2014)Google Scholar
- 15.Percha, B., Altman, R.B., Wren, J.: A global network of biomedical relationships derived from text. Bioinformatics 1, 11 (2018)Google Scholar
- 17.Shah, N.H., et al.: Comparison of concept recognizers for building the open biomedical annotator. In: BMC Bioinformatics, vol. 10, p. S14. BioMed Central (2009)Google Scholar
- 18.Smaili, F.Z., Gao, X., Hoehndorf, R.: OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction. arXiv preprint arXiv:1804.10922 (2018)
- 19.Socher, R., et al.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems (2013)Google Scholar
- 21.Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: AAAI Conference on Artificial Intelligence (2014)Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.