Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings

  • Rafael S. GonçalvesEmail author
  • Maulik R. Kamdar
  • Mark A. Musen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11503)


The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity—there are often many ways to represent the same type of information, such as a geographical location via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represented in a uniform way that can be queried effectively. One step toward uniformly-represented metadata is to normalize the multiple, distinct field names used in metadata (e.g., lat lon, lat and long) to describe the same type of value. To that end, we present a new method based on clustering and embeddings (i.e., vector representations of words) to align metadata field names with ontology terms. We apply our method to biomedical metadata by generating embeddings for terms in biomedical ontologies from the BioPortal repository. We carried out a comparative study between our method and the NCBO Annotator, which revealed that our method yields more and substantially better alignments between metadata and ontology terms.


Biomedical metadata Ontologies Alignment Embeddings 



This work is supported by grant U54 AI117925 awarded by the U.S. National Institute of Allergy and Infectious Diseases (NIAID) through funds provided by the Big Data to Knowledge (BD2K) initiative. BioPortal has been supported by the NIH Common Fund under grant U54 HG004028.

We thank the experts in our evaluation panel: John Graybeal, Josef Hardi, Marcos Martínez-Romero, and Csongor Nyulas (all of whom from the Center for Biomedical Informatics Research at Stanford University), for their participation.


  1. 1.
    Barrett, T., et al.: BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucl. Acids Res. 40, D57–D63 (2012)CrossRefGoogle Scholar
  2. 2.
    Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Conference on Knowledge Discovery and Data Mining (1996)Google Scholar
  3. 3.
    Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Goldberg, Y., Levy, O.: Word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
  5. 5.
    Gonçalves, R.S., Musen, M.A.: The variable quality of metadata about biological samples used in biomedical experiments. Sci. Data 6, 190021 (2018)CrossRefGoogle Scholar
  6. 6.
    Jiménez-Ruiz, E., Cuenca Grau, B.: LogMap: logic-based and scalable ontology matching. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 273–288. Springer, Heidelberg (2011). Scholar
  7. 7.
    Jonquet, C., et al.: NCBO annotator: semantic annotation of biomedical data. In: International Semantic Web Conference (2009)Google Scholar
  8. 8.
    Kamdar, M.R., et al.: An empirical meta-analysis of the life sciences (linked?) open data cloud (2018).
  9. 9.
    Koster, C., Seutter, M., Seibert, O.: Parsing the medline corpus. In: Recent Advances in Natural Language Processing (2007)Google Scholar
  10. 10.
    Lin, Y., et al.: Learning entity and relation embeddings for knowledge graph completion. In: AAAI Conference on Artificial Intelligence (2015)Google Scholar
  11. 11.
    McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017)CrossRefGoogle Scholar
  12. 12.
    Noy, N.F., et al.: BioPortal: ontologies and integrated data resources at the click of a mouse. Nucl. Acids Res. 37, W170–W173 (2009)CrossRefGoogle Scholar
  13. 13.
    Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for named entity resolution. arXiv preprint arXiv:1404.5367 (2014)
  14. 14.
    Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (2014)Google Scholar
  15. 15.
    Percha, B., Altman, R.B., Wren, J.: A global network of biomedical relationships derived from text. Bioinformatics 1, 11 (2018)Google Scholar
  16. 16.
    Ristoski, P., Paulheim, H.: RDF2Vec: RDF graph embeddings for data mining. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 498–514. Springer, Cham (2016). Scholar
  17. 17.
    Shah, N.H., et al.: Comparison of concept recognizers for building the open biomedical annotator. In: BMC Bioinformatics, vol. 10, p. S14. BioMed Central (2009)Google Scholar
  18. 18.
    Smaili, F.Z., Gao, X., Hoehndorf, R.: OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction. arXiv preprint arXiv:1804.10922 (2018)
  19. 19.
    Socher, R., et al.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems (2013)Google Scholar
  20. 20.
    Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. J. Biomed. Inform. 87, 12–20 (2018)CrossRefGoogle Scholar
  21. 21.
    Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: AAAI Conference on Artificial Intelligence (2014)Google Scholar
  22. 22.
    Wilkinson, M.D., et al.: The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  • Rafael S. Gonçalves
    • 1
    Email author
  • Maulik R. Kamdar
    • 1
  • Mark A. Musen
    • 1
  1. 1.Center for Biomedical Informatics ResearchStanford UniversityStanfordUSA

Personalised recommendations