Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data

  • Anna BernasconiEmail author
  • Stefano Ceri
  • Alessandro Campi
  • Marco Masseroli
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10650)


Many repositories of open data for genomics, collected by world-wide consortia, are important enablers of biological research; moreover, all experimental datasets leading to publications in genomics must be deposited to public repositories and made available to the research community. These datasets are typically used by biologists for validating or enriching their experiments; their content is documented by metadata. However, emphasis on data sharing is not matched by accuracy in data documentation; metadata are not standardized across the sources and often unstructured and incomplete.

In this paper, we propose a conceptual model of genomic metadata, whose purpose is to query the underlying data sources for locating relevant experimental datasets. First, we analyze the most typical metadata attributes of genomic sources and define their semantic properties. Then, we use a top-down method for building a global-as-view integrated schema, by abstracting the most important conceptual properties of genomic sources. Finally, we describe the validation of the conceptual model by mapping it to three well-known data sources: TCGA, ENCODE, and Gene Expression Omnibus.


Conceptual model Data integration Genomics Next Generation Sequencing Open data 



This research is funded by the ERC Advanced Grant project GeCo (Data-Driven Genomic Computing), 2016–2021.


  1. 1.
    Adams, D., et al.: BLUEPRINT to decode the epigenetic signature written in blood. Nat. Biotechnol. 30(3), 224–226 (2012)CrossRefGoogle Scholar
  2. 2.
    Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome. Nucleic Acids Res. 44(W1), W581–W586 (2016)CrossRefGoogle Scholar
  3. 3.
    Barrett, T., et al.: BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40(D1), 57–63 (2012)CrossRefGoogle Scholar
  4. 4.
    Barrett, T., et al.: NCBI GEO: archive for functional genomics data sets – update. Nucleic Acids Res. 41(Database issue), D991–D995 (2013)Google Scholar
  5. 5.
    Bornberg-Bauer, E., Paton, N.W.: Conceptual data modelling for bioinformatics. Brief. Bioinform. 3(2), 166–180 (2002)CrossRefGoogle Scholar
  6. 6.
    Buneman, P., et al.: A data transformation system for biological data sources. In: International Conference on Very Large Data Bases, pp. 158–169 (1995)Google Scholar
  7. 7.
    Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)Google Scholar
  8. 8.
    Davidson, S.B., et al.: Biokleisli: a digital library for biomedical researchers. Int. J. Digit. Libr. 1(1), 36–53 (1997)Google Scholar
  9. 9.
    Davidson, S.B., et al.: K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J. 40(2), 512–531 (2001)CrossRefGoogle Scholar
  10. 10.
    El-Ghalayini, H., et al.: Deriving conceptual data models from domain ontologies for bioinformatics. In: 2006 2nd Information and Communication Technologies, ICTTA 2006, vol. 2, pp. 3562–3567 (2006)Google Scholar
  11. 11.
    Fernández, J.D., et al.: Ontology-based search of genomic metadata. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 233–247 (2016)CrossRefGoogle Scholar
  12. 12.
    Galeota, E., Pelizzola, M.: Ontology-based annotations and semantic relations in large-scale (epi)genomics data. Brief. Bioinform. 18(3), 403–412 (2017)Google Scholar
  13. 13.
    Haider, S., et al.: BioMart Central Portal - unified access to biological data. Nucleic Acids Res. 37(Web Server issue), 23–27 (2009)CrossRefGoogle Scholar
  14. 14.
    Hernandez, T., Kambhampati, S.: Integration of biological sources: current systems and challenges ahead. SIGMOD Rec. 33(3), 51–60 (2004)CrossRefGoogle Scholar
  15. 15.
    Idrees, M., et al.: A review: conceptual data models for biological domain. JAPS, J. Anim. Plant Sci. 25(2), 337–345 (2015)Google Scholar
  16. 16.
    Ji, F., Elmasri, R., et al.: Incorporating concepts for bioinformatics data modeling into EER models. In: ACS/IEEE International Conference on Computer Systems and Applications, pp. 189–192. IEEE Computer Society, Washington, DC, USA (2005)Google Scholar
  17. 17.
    Kaitoua, A., Pinoli, P., Bertoni, M., Ceri, S.: Framework for supporting genomic operations. IEEE Trans. Comput. 66(3), 443–457 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
  18. 18.
    Keet, M.C.: Biological data and conceptual modelling method. J. Concept. Model. 29(1), 1–14 (2003)Google Scholar
  19. 19.
    Kundaje, A., et al.: Integrative analysis of 111 reference human epigenomes. Nature 518(7539), 317–330 (2015)CrossRefGoogle Scholar
  20. 20.
    Lenzerini, M.: Data integration: a theoretical perspective. In: Symposium on Principles of Database Systems, PODS, pp. 233–246. ACM, New York, NY, USA (2002)Google Scholar
  21. 21.
    Louie, B., et al.: Data integration and genomic medicine. J. Biomed. Inform. 40(1), 5–16 (2007)CrossRefGoogle Scholar
  22. 22.
    Masseroli, M., Canakoglu, A., Ceri, S.: Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 209–219 (2016)CrossRefGoogle Scholar
  23. 23.
    Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)CrossRefGoogle Scholar
  24. 24.
    Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)CrossRefGoogle Scholar
  25. 25.
    Rechenmann, F.: Data modeling: the key to biological data integration. EMBnet. J. 18(B), 59–60 (2012)CrossRefGoogle Scholar
  26. 26.
    Anonymous paper. Accelerating bioinformatics research with new software for big data to knowledge (BD2K), Paradigm4, April 2015.
  27. 27.
    Consortium 1000Genomes: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)CrossRefGoogle Scholar
  28. 28.
    Consortium ENCODE: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)CrossRefGoogle Scholar
  29. 29.
    Reyes Román, J.F., Pastor, Ó., Casamayor, J.C., Valverde, F.: Applying conceptual modeling to better understand the human genome. In: Comyn-Wattiau, I., Tanaka, K., Song, I.-Y., Yamamoto, S., Saeki, M. (eds.) ER 2016. LNCS, vol. 9974, pp. 404–412. Springer, Cham (2016). doi: 10.1007/978-3-319-46397-1_31CrossRefGoogle Scholar
  30. 30.
    Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD 2017, Chicago, Illinois, USA, 14–19 May 2017, pp. 187–202. ACM, New York (2017)Google Scholar
  31. 31.
    Sarntivijai, S., et al.: CLO: the cell line ontology. J. Biomed. Semant. 5(1), 37 (2014)CrossRefGoogle Scholar
  32. 32.
    Schomburg, I., et al.: BRENDA in 2013: new options and contents in BRENDA. Nucleic Acids Res. 41(Database issue), D764–D772 (2013)Google Scholar
  33. 33.
    Schriml, L.M., et al.: Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res. 40(Database issue), 940–946 (2012)CrossRefGoogle Scholar
  34. 34.
    Smedley, D., et al.: The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43(W1), 589–598 (2015)CrossRefGoogle Scholar
  35. 35.
    Wang, L., et al.: BioStar models of clinical and genomic data for biomedical data warehouse design. Int. J. Bioinform. Res. Appl. 1(1), 63–80 (2005)MathSciNetCrossRefGoogle Scholar
  36. 36.
    Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)CrossRefGoogle Scholar
  37. 37.
    Zhu, Y., et al.: Geometadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics 24(23), 2798–2800 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Anna Bernasconi
    • 1
    Email author
  • Stefano Ceri
    • 1
  • Alessandro Campi
    • 1
  • Marco Masseroli
    • 1
  1. 1.Dipartimento di Elettronica, Informazione e BioingegneriaPolitecnico di MilanoMilanItaly

Personalised recommendations