Semantic Web Approach to Database Integration in the Life Sciences

  • Kei-Hoi Cheung
  • Andrew K. Smith
  • Kevin Y. L. Yip
  • Christopher J. O. Baker
  • Mark B. Gerstein


This chapter describes the challenges involved in the integration of databases storing diverse but related types of life sciences data. A major challenge in this regard is the syntactic and semantic heterogeneity of life sciences databases. There is a strong need for standardizing the syntactic and semantic data representations. We discuss how to address this by using the emerging Semantic Web technologies based on the Resource Description Framework (RDF) standard. This chapter presents two use cases, namely YeastHub and LinkHub, which demonstrate how to use the latest RDF database technology to build data warehouses that facilitate integration of genomic/proteomic data and identifiers.

Key words

RDF database integration Semantic Web molecular biology 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Cantor C.R. Orchestrating the Human Genome Project. Science. 248: 49–51, 1990.PubMedCrossRefGoogle Scholar
  2. [2]
    Berners-Lee T., Cailliau R., Luotonen A., Nielsen H. F., and Secret A. The World-Wide Web. ACM Communications. 37(3): 76–82, 1994.CrossRefGoogle Scholar
  3. [3]
    Benson D. A., Boguski M. S., Lipman D. J., and Ostell J. GenBank. Nucleic Acids Research. 25(1): 1–6, 1997.PubMedCrossRefGoogle Scholar
  4. [4]
    Gollub J., Ball C, Binkley G., Demeter J., Finkelstein D., Hebert J., Hernandez-Boussard T., Jin H., Kaloper M., Matese J., et al. The Stanford Microarray Database: data access and quality assessment tools. Nucleic Acids Research. 31(1): 94–6, 2003.PubMedCrossRefGoogle Scholar
  5. [5]
    Edgar R., Domrachev M, and Lash A. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 30(1): 207–10, 2002.PubMedCrossRefGoogle Scholar
  6. [6]
    Bader G. D., Betel D., and Hogue C.W.V. BIND: the Biomolecular Interaction Network Database. Nucl. Acids Res. 31(1): 248–250, 2003.PubMedCrossRefGoogle Scholar
  7. [7]
    Peri S., Navarro J., Kristiansen T., Amanchy R., Surendranath V., Muthusamy B., Gandhi T., Chandrika K., Deshpande N., Suresh S., et al. Human protein reference database as a discovery resource for proteomics. Nucl. Acids. Res. 32: D497–501, 2004.PubMedCrossRefGoogle Scholar
  8. [8]
    Joshi-Tope G., Gillespie M., Vastrik I., D’Eustachio P., Schmidt E., de Bono B., Jassal B., Gopinath G.R., Wu G.R., Matthews L., et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 33 (Database issue): D428–32, 2005.PubMedCrossRefGoogle Scholar
  9. [9]
    Hill A. and Kim H. The UAP Proteomics Database. Bioinformatics. 19(16): 2149–51, 2003.PubMedCrossRefGoogle Scholar
  10. [10]
    Desiere F., Deutsch E. W., King N. L., Nesvizhskii A. I., Mallick P., Eng J., Chen S., Eddes J., Loevenich S. N., and Aebersold R. The PeptideAtlas project. Nucl. Acids. Res. 34 (Database Issue): D655–8, 2006.PubMedCrossRefGoogle Scholar
  11. [11]
    Dwight S. S., Harris M. A., Dolinski K., Ball C. A., Binkley G., Christie K. R., Fisk D.G., Issel-Tarver L., Schroeder M, Sherlock G., et al. Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucl. Acids. Res. 30(1): 69–72, 2002.PubMedCrossRefGoogle Scholar
  12. [12]
    Blake J. A., Eppig J. T., Bult C. J., Kadin J. A., and Richardson J. E. The Mouse Genome Database (MGD): updates and enhancements. Nucl. Acids. Res. 34 (Database Issue): D562–7, 2006.PubMedCrossRefGoogle Scholar
  13. [13]
    Ashburner M., Ball C, Blake J., Botstein D., Butler H., Cherry M., Davis A., Dolinski K., Dwight S., Eppig J., et al. Gene ontology: tool for the unification of biology. Nature Genetics. 25: 25–29, 2000.PubMedCrossRefGoogle Scholar
  14. [14]
    Apweiler R., Bairoch A., Wu C. H., Barker W. C, Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., et al. UniProt: the Universal Protein knowledgebase. Nucl. Acids Res. 32(90001): D115–119, 2004.PubMedCrossRefGoogle Scholar
  15. [15]
    Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S., Griffiths-Jones S., Howe K., Marshall M., and Sonnhammer E. The Pfam Protein Families Database. Nucleic Acids Research. 30(1), 2002.Google Scholar
  16. [16]
    Cheung K., Nadkarni P., Silverstein S., Kidd J., Pakstis A., Miller P., and Kidd K. PhenoDB: an integrated client/server database for linkage and population genetics. Comput Biomed Res. 29(4): 327–37, 1996.PubMedCrossRefGoogle Scholar
  17. [17]
    Shannon W., Culverhouse R., and Duncan J. Analyzing microarray data using cluster analysis. Pharmacogenomics. 4(1): 41–51, 2003.PubMedCrossRefGoogle Scholar
  18. [18]
    Manduchi E., Grant G.R., He H., Liu J., Mailman M. D., Pizarro A. D., Whetzel P. L., and Stoeckert C. J. RAD and the RAD Study-Annotator: an approach to collection, organization and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics. 20(4): 452–9, 2004.PubMedCrossRefGoogle Scholar
  19. [19]
    Sujansky W. Heterogeneous database integration in biomedicine. Journal of Biomedical Informatics. 34: 285–98, 2001.PubMedCrossRefGoogle Scholar
  20. [20]
    Buneman P., Davidson S., Hart K., Overton C, and Wong L., A Data Transformation System for Biological Data Sources. in Proc. 21st Int. Conf. VLDB. 158–169, 1995.Google Scholar
  21. [21]
    Lee T.J., Pouliot Y., Wagner V., Gupta P., Stringer-Calvert D.W., Tenenbaum J.D., and Karp P.D. Bio Warehouse: a bioinformatics database warehouse toolkit. Bioinformatics. 7: 170, 2006.PubMedCrossRefGoogle Scholar
  22. [22]
    Birkland A. and Yona G. BIOZON: a hub of heterogeneous biological data. Nucl. Acids. Res. 34 (Database Issue): D235–42, 2006.PubMedCrossRefGoogle Scholar
  23. [23]
    Critchlow T., Fidelis K., Ganesh M., Musick R., and Slezak T. DataFoundry: information management for scientific data. IEEE Trans Inf Technol Biomed. 4(1): 52–7, 2000.PubMedCrossRefGoogle Scholar
  24. [24]
    Sheth A. and Larson J. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Comput. Surveys. 22(3): 183–236, 1990.CrossRefGoogle Scholar
  25. [25]
    Kolatkar P.R., Sakharkar M.K., Tse C. R., Kiong B. K., Wong L., Tan T.W., and Subbiah S. Development of software tools at Bioinformatics Centre (BIC) at the National University of Singapore (NUS). in Pac. Symp. Biocomputing. Honolulu, Haiwaii 735–46, 1998.Google Scholar
  26. [26]
    Haas L. M., Schwarz P. M., Kodali P., Kotlar E., Rice J.E., and Swope W.C. DiscoveryLink: A system for integrated access to life sciences data sources. IBM Systems Journal. 40(2): 489–511, 2001.CrossRefGoogle Scholar
  27. [27]
    Marenco L., Wang T.Y., Shepherd G., Miller P.L., and Nadkarni P. QIS: A framework for biomedical database federation. J Am Med Inform Assoc. 11(6): 523–34, 2004.PubMedCrossRefGoogle Scholar
  28. [28]
    Berners-Lee T., Hendler J., and Lassila O. The Semantic Web. Scientific American. 284(5): 34–43, 2001.Google Scholar
  29. [29]
    Wang X., Gorlitsky R., and Almeida, J. S. From XML to RDF: how Semantic Web technologies will change the design of ‘omic’ standards. Nat Biotechnol. 23(9): 1099–103, 2005.PubMedCrossRefGoogle Scholar
  30. [30]
    Hucka M., Finney A., Sauro H., Bolouri H., Doyle J., Kitano H., Arkin A., Bornstein B., Bray D., Cornish-Bowden A., et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics. 19(4): 524–31, 2005.CrossRefGoogle Scholar
  31. [31]
    Hermjakob H., Montecchi-Palazzi L., Bader G., Wojcik J., Salwinski L., Ceol A., Moore S., Orchard S., Sarkans U., Mering C. V., et al. The HUPO PSI’s Molecular Interaction format—a community standard for the representation of protein interaction data Nature Biotechnology. 22: 177–83, 2004.PubMedCrossRefGoogle Scholar
  32. [32]
    Goldbeck J., Fragoso G., Hartel F., Hendler J., Parsia B., and Oberthaler J. The National Cancer Institute’s Thesaurus and Ontology. Journal of Web Semantics. 1(1), 2003.Google Scholar
  33. [33]
    Cheung K.-H., Yip K.Y., Smith A., deKnikker R., Masiar A., and Gerstein M. YeastHub: a Semantic Web use case for integrating data in the life sciences domain. Bioinformatics. 21(suppl_1): i85–96, 2005.PubMedCrossRefGoogle Scholar
  34. [34]
    Neumann E.K. and Quan D. Biodash: A Semantic Web Dashboard for Drug Development. in Pacific Symposium on Biocomputing. 176–87, 2006.Google Scholar
  35. [35]
    Donis-Keller H., Green P., Helms C, Cartinhour S., Weiffenbach B., Stephens K., Keith T., Bowden D., Smith D., Lander E., et al. A Genetic Linkage Map of the Human Genome. Cell. 51: 319–337, 1987.PubMedCrossRefGoogle Scholar
  36. [36]
    Baader F., Calvanese D., McGuinness D., Nardi D., and Patel-Schneider P. The Description Logic Handbook. Cambridge University Press, 2002.Google Scholar
  37. [37]
    Luciano J. S. PAX of mind for pathway researchers. Drug Discov Today. 10(13): 937–42, 2005.PubMedCrossRefGoogle Scholar
  38. [38]
    Romero P., Wagg J., Green M., Kaiser D., Krummenacker M., and Karp P. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 6(1): R2, 2004.PubMedCrossRefGoogle Scholar
  39. [39]
    Baker C.J.O., Shaban-Nejad A., Su X., Haarslev V., and Butler G. Infrastructure for Fungal Enzyme Biotechnologists. Journal of Web Semantics. 4(3), 2006.Google Scholar
  40. [40]
    Golbreich C, Zhang S., Bodenreider O. The Foundational Model of Anatomy in OWL. Journal of Web Semantics. 4(3), 2006.Google Scholar
  41. [41]
    Kumar A., Cheung K.-H., Tosches N., Masiar P., Liu Y., Miller P., and Snyder M. The TRIPLES database: A Community Resource for Yeast Molecular Biology. Nucl. Acids. Res. 30(1): 73–75, 2002.PubMedCrossRefGoogle Scholar
  42. [42]
    Chen H., Wu Z., Wang H., and Mao Y. RDF/RDFS-based Relational Database Integration. in ICDE, Atlanta, Georgia, in press, 2006.Google Scholar
  43. [43]
    Stephens S., Morales A., and Quinian M. Applying Semantic Web Technologies to Drug Safety Determination. IEEE Intelligent Systems. 21(1): 82–6, 2006.CrossRefGoogle Scholar
  44. [44]
    Miller R., Ioannidis Y., and Ramakrishnan R. Schema Equivalence in Heterogeneous Systems: Bridging Theory and Practice. Inf. Sys. 19(1): 3–31, 1994.CrossRefGoogle Scholar
  45. [45]
    Haarslev V., Moeller R., and Wessel M. Querying the Semantic Web with Racer + nRQL. in Proceedings of the KI-04 Workshop on Applications of Description Logics. Ulm, Germany: Deutsche Bibliothek, 2004.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Kei-Hoi Cheung
    • 1
    • 2
    • 3
    • 4
  • Andrew K. Smith
    • 4
  • Kevin Y. L. Yip
    • 4
  • Christopher J. O. Baker
    • 6
    • 7
  • Mark B. Gerstein
    • 4
    • 5
  1. 1.Yale Center for Medical InformaticsYale UniversityUSA
  2. 2.AnesthesiologyYale UniversityUSA
  3. 3.GeneticsYale UniversityUSA
  4. 4.Computer ScienceYale UniversityUSA
  5. 5.Molecular Biophysics and BiochemistryYale UniversityUSA
  6. 6.Computer Science and Software EngineeringConcordia UniversityCanada
  7. 7.Institute for Infocomm ResearchSingapore

Personalised recommendations