Molecular Information Fusion in Ondex

  • Jan Taubert
  • Jacob Köhler


Current biological knowledge is buried in hundreds of proprietary and public life-science databases available on the World Wide Web (WWW) and millions of scientific publications. Gaining access to this knowledge can prove difficult as each database may provide different tools to query or show the data and may differ in their structure and user interface or uses a different interpretation of biological knowledge than others. Systems approaches to biological research require that existing biological knowledge (data) is made available to support on the one hand the analysis of experimental results and on the other hand the construction and enrichment of models. Data integration methods are being developed to address these issues by providing a consolidated view of molecular information fused together from multiple databases. However, a key challenge for data integration is the identification of links between closely related entries in different life sciences databases when there is no direct information that provides a reliable cross reference. Here we describe and evaluate three data integration methods to address this challenge in the context of a graph-based data integration framework (the Ondex system). We give a quantitative evaluation of their performance in two different situations: the integration and analysis of different metabolic pathways resources and the mapping of equivalent elements between the Gene Ontology and a nomenclature describing enzyme function.


Data integration Systems Biology Life sciences Biological databases Molecular information Ontologies Pathways Ondex 



We would like to thank all current and previous contributors to the Ondex system (see The main part of this work has been carried out at Rothamsted Research. Rothamsted Research receives grant in aid from the Biotechnology and Biological Sciences Research Council (BBSRC). This work was supported by BBSRC SABR award BB/F006039/1 and TSB project TP 5082–33372. JT also would like to thank EMBL-EBI for allowing time to write this chapter.


  1. 1.
    Biotechnology and Biological Sciences Research Council (2007) Systems biology.
  2. 2.
    Köhler J, Baumbach J, Taubert J, Specht M, Skusa A, Ruegg A, Rawlings C, Verrier P, Philippi S (2006) Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics 22(11):1383–1390CrossRefGoogle Scholar
  3. 3.
    Gaylord M, Calley J, Qiang H, Su EW, Liao B (2006) A flexible integration and visualisation system for biomarker discovery. Appl Bioinformatics 5(4):219–223CrossRefGoogle Scholar
  4. 4.
    Fischer HP (2005) Towards quantitative biology: integration of biological information to elucidate disease pathways and to guide drug discovery. Biotechnol Annu Rev 11:1–68CrossRefGoogle Scholar
  5. 5.
    Köhler J, Rawlings C, Verrier P, Mitchell R, Skusa A, Ruegg A, Philippi S (2005) Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures. In Silico Biol 5(1):33–44Google Scholar
  6. 6.
    Taubert J, Hindle M, Lysenko A, Weile J, Köhler J, Rawlings CJ (2009) Linking life sciences data using graph-based mapping. Paper presented at the proceedings of the 6th international workshop on data integration in the life sciences, Manchester, UKGoogle Scholar
  7. 7.
    Taubert J, Sieren KP, Hindle M, Hoekman B, Winnenburg R, Philippi S, Rawlings C, Köhler J (2007) The OXL format for the exchange of integrated datasets. J Integr Bioinform 4(3):63Google Scholar
  8. 8.
    Taubert J (2011) ONDEX - a data integration framework for the life sciences. Bielefeld University, BielefeldGoogle Scholar
  9. 9.
    Goble C, Stevens R (2008) State of the nation in data integration for bioinformatics. J Biomed Inform 41(5):687–693. doi:S1532-0464(08)00017-8 [pii]  10.1016/j.jbi.2008.01.008 CrossRefGoogle Scholar
  10. 10.
    Etzold T, Ulyanov A, Argos P (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol 266:114–128Google Scholar
  11. 11.
    Baitaluk M, Qian X, Godbole S, Raval A, Ray A, Gupta A (2006) PathSys: integrating molecular interaction graphs for systems biology. BMC Bioinformatics 7:55CrossRefGoogle Scholar
  12. 12.
    Küntzer J, Blum T, Gerasch A, Backes C, Hildebrandt A, Kaufmann M, Kohlbacher O, Lenhof H-P (2006) BN++ − a Biological Information System. J Integr Bioinform 3(2):34. doi: 10.2390/biecoll-jib-2006-34 Google Scholar
  13. 13.
    Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C (2005) Relations in biomedical ontologies. Genome Biol 6(5):R46CrossRefGoogle Scholar
  14. 14.
    Lee D, Kim S, Kim Y (2007) BioCAD: an information fusion platform for bio-network inference and analysis. BMC Bioinformatics 8(Suppl 9):S2. doi:1471-2105-8-S9-S2 [pii]  10.1186/1471-2105-8-S9-S2 CrossRefGoogle Scholar
  15. 15.
    Birkland A, Yona G (2006) BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics 7:70. doi:1471-2105-7-70 [pii]  10.1186/1471-2105-7-70 CrossRefGoogle Scholar
  16. 16.
    Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C (2009) STRING 8 – a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37(Database issue):D412–D416. doi:gkn760 [pii]  10.1093/nar/gkn760 CrossRefGoogle Scholar
  17. 17.
    Pesch R, Lysenko A, Hindle M, Hassani-Pak K, Thiele R, Rawlings C, Köhler J, Taubert J (2008) Graph-based sequence annotation using a data integration approach. J Integr Bioinform 5(2):94. doi: 10.2390/biecoll-jib-2008-94 Google Scholar
  18. 18.
    Brohee S, Faust K, Lima-Mendez G, Sand O, Janky R, Vanderstocken G, Deville Y, van Helden J (2008) NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways. Nucleic Acids Res 36(Web Server issue):W444–W451. doi:gkn336 [pii]  10.1093/nar/gkn336 CrossRefGoogle Scholar
  19. 19.
    Dwyer T, Rolletschek H, Schreiber F (2004) Representing experimental biological data in metabolic networks. Paper presented at the proceedings of the second conference on Asia-Pacific bioinformatics, vol 29, Dunedin, New ZealandGoogle Scholar
  20. 20.
    Jeong H, Mason SP, Barabasi AL, Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411(6833):41–42. doi: 10.1038/35075138 CrossRefGoogle Scholar
  21. 21.
    Ogata H, Goto S, Fujibuchi W, Kanehisa M (1998) Computation with the KEGG pathway database. Biosystems 47(1–2):119–128CrossRefGoogle Scholar
  22. 22.
    Zhu H, Cabrera RM, Wlodarczyk BJ, Bozinov D, Wang D, Schwartz RJ, Finnell RH (2007) Differentially expressed genes in embryonic cardiac tissues of mice lacking Folr1 gene activity. BMC Dev Biol 7:128. doi: 10.1186/1471-213X-7-128 CrossRefGoogle Scholar
  23. 23.
    Gardner SP (2005) Ontologies and semantic data integration. Drug Discov Today 10(14):1001–1007. doi:S1359-6446(05)03504-X [pii]  10.1016/S1359-6446(05)03504-X CrossRefGoogle Scholar
  24. 24.
    Bairoch A (2000) The ENZYME database in 2000. Nucleic Acids Res 28(1):304–305CrossRefGoogle Scholar
  25. 25.
    Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25(1):25–29. doi: 10.1038/75556 CrossRefGoogle Scholar
  26. 26.
    Jupe S, Akkerman JW, Soranzo N, Ouwehand WH (2012) Reactome – a curated knowledgebase of biological pathways: megakaryocytes and platelets. J Thromb Haemost. doi: 10.1111/j.1538-7836.2012.04930.x Google Scholar
  27. 27.
    Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, Kothari A, Krummenacker M, Latendresse M, Mueller LA, Ong Q, Paley S, Pujar A, Shearer AG, Travers M, Weerasinghe D, Zhang P, Karp PD (2012) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 40(Database issue):D742–D753. doi: 10.1093/nar/gkr1014 CrossRefGoogle Scholar
  28. 28.
    Smith B (2004) Beyond concepts: ontology as reality representation. In: Varzi A, Vieu L (eds) Proceedings of FOIS. IOS Press, AmsterdamGoogle Scholar
  29. 29.
    Schuemie MJ, Mons B, Weeber M, Kors JA (2007) Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. J Biomed Inform 40(3):316–324. doi:S1532-0464(06)00097-9 [pii]  10.1016/j.jbi.2006.09.002 CrossRefGoogle Scholar
  30. 30.
    Knuth D (1997) Section 6.2.3: Balanced trees. In: The art of computer programming, vol 3, Sorting and searching, 2nd edn. Addison-Wesley, Reading, 1998. ISBN 0-201-89685-0Google Scholar
  31. 31.
    Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183:63–98Google Scholar
  32. 32.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. doi:  10.1093/nar/25.17.3389 CrossRefGoogle Scholar
  33. 33.
    Goutte C, Gaussier E (2005) A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada DE, Fernandez-Luna JM (eds) European Colloquium on IR Research (ECIR’05), 2005, Springer Berlin Heidelberg, pp 345–359.
  34. 34.
    Stobbe MD, Houten SM, Jansen GA, van Kampen AH, Moerland PD (2011) Critical assessment of human metabolic pathway databases: a stepping stone for future integration. BMC Syst Biol 5:165. doi: 10.1186/1752-0509-5-165 CrossRefGoogle Scholar
  35. 35.
    Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M (2008) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36(Database issue):D344–D350. doi: 10.1093/nar/gkm791 Google Scholar
  36. 36.
    Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N, Yeh LS (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res 32(Database issue):D115–D119. doi: 10.1093/nar/gkh13132/suppl_1/D115 [pii]CrossRefGoogle Scholar
  37. 37.
    Bader G, Cary M (2005) BioPAX – biological pathways exchange language. BioPAX workgroup.
  38. 38.
    Baldwin TK, Winnenburg R, Urban M, Rawlings C, Köhler J, Hammond-Kosack KE (2006) PHI-base provides insights into generic and novel themes of pathogenicity. Mol Plant Microbe Interact 19(12):1451–1462CrossRefGoogle Scholar
  39. 39.
    Winnenburg R, Baldwin TK, Urban M, Rawlings C, Köhler J, Hammond-Kosack KE (2006) PHI-base: a new database for pathogen host interactions. Nucleic Acids Res 34(Database issue):D459–D464CrossRefGoogle Scholar
  40. 40.
    Köhler J, Munn K, Rüegg A, Skusa A, Smith B (2006) Quality control for terms and definitions in ontologies and taxonomies. BMC Bioinformatics 7:212CrossRefGoogle Scholar
  41. 41.
    Zhang L, Gu J-G (2005) Ontology based semantic mapping architecture. In: Fourth international conference on machine learning and cybernetics. IEEEGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Wellcome Trust Genome CampusHinxton, CambridgeshireUK
  2. 2.Dow AgroSciences LLCIndianapolisUSA

Personalised recommendations