Abstract
A prerequisite to systems biology is the integration of heterogeneous experimental data, which are stored in numerous life-science databases. However, a wide range of obstacles that relate to access, handling and integration impede the efficient use of the contents of these databases. Addressing these issues will not only be essential for progress in systems biology, it will also be crucial for sustaining the more traditional uses of life-science databases.
Similar content being viewed by others
References
Kitano, H. Systems biology: a brief overview. Science 295, 1662–1664 (2002).
Pennisi, E. How will big pictures emerge from a sea of biological data? Science 309, 94 (2005).
Roos, D. S. Computational biology. Bioinformatics — trying to swim in a sea of data. Science 291, 1260–1261 (2001).
Augen, J. Information technology to the rescue! Nature Biotechnol. 19, BE39–BE40 (2001).
Ge, H., Walhout, A. J. & Vidal, M. Integrating 'omic' information: a bridge between genomics and systems biology. Trends Genet. 19, 551–560 (2003).
Carel, R. Practical data integration in biopharmaceutical research and development. PharmaGenomics 22–35 (June 2003).
Galperin, M. Y. The Molecular Biology Database Collection: 2006 update. Nucleic Acids Res. 34, D3–D5 (2006).
Cerami, E. Web services essentials (O'Reilly, Beijing; Sebastopol, California, 2002).
Sugawara, H. & Miyazaki, S. Biological SOAP servers and web services provided by the public sequence data bank. Nucleic Acids Res. 31, 3836–3839 (2003).
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. & Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, D277–D280 (2004).
Pillai, S. et al. SOAP-based services provided by the European Bioinformatics Institute. Nucleic Acids Res. 33, W25–W28 (2005).
Stajich, J. E. et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12, 1611–1618 (2002).
Mangalam, H. The Bio * toolkits — a brief overview. Brief. Bioinformatics 3, 296–302 (2002).
Wang, L., Riethoven, J. J. & Robinson, A. XEMBL: distributing EMBL data in XML format. Bioinformatics 18, 1147–1148 (2002).
Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005).
Luciano, J. S. PAX of mind for pathway researchers. Drug Discov. Today 10, 937–942 (2005).
Lloyd, C. M., Halstead, M. D. & Nielsen, P. F. CellML: its future, present and past. Prog. Biophys. Mol. Biol. 85, 433–450 (2004).
Spellman, P. T. et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, RESEARCH0046 (2002).
Orchard, S. et al. Further steps in standardisation. Report of the second annual Proteomics Standards Initiative Spring Workshop (Siena, Italy 17–20th April 2005). Proteomics 5, 3552–3555 (2005).
Hucka, M. et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19, 524–531 (2003).
Green, M. L. & Karp, P. D. Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res. 33, 4035–4039 (2005).
Stevens, R. et al. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics 16, 184–185 (2000).
Köhler, J., Philippi, S. & Lange, M. SEMEDA: ontology based semantic integration of biological databases. Bioinformatics 19, 2420–2427 (2003).
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25–29 (2000).
Philippi, S. & Köhler, J. Using XML technology for the ontology-based semantic integration of life science databases. IEEE Trans. Inf. Technol. Biomed. 8, 154–160 (2004).
NC-IUBMB. Enzyme Nomenclature 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes (Academic Press, San Diego, 1992).
Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 32, D35–D40 (2004).
Hendler, J. Communication. Science and the semantic web. Science 299, 520–521 (2003).
Noble, D. Will genomics revolutionise pharmaceutical R&D? Trends Biotechnol. 21, 333–337 (2003).
Smith, B., Köhler, J. & Kumar, A. On the application of formal principles to life science data: a case study in the gene ontology. Proc. Data Integr. Life Sci. First Int. Workshop 79–94 (2004).
Zhang, S. & Bodenreider, O. Law and order: assessing and enforcing compliance with ontological modeling principles in the Foundational Model of Anatomy. Comput. Biol. Med. 6 Sep 2005 (doi:10.1016/j.compbiomed.2005.04.007).
van Helden, J. et al. Representing and analysing molecular and cellular function using the computer. Biol. Chem. 381, 921–935 (2000).
Bornberg-Bauer, E. & Paton, N. W. Conceptual data modelling for bioinformatics. Brief. Bioinformatics 3, 166–180 (2002).
Nelson, M. R., Reisinger, S. J. & Henry, S. G. Designing databases to store biological information. BioSilico 1, 134–142 (2003).
Taylor, C. F. et al. A systematic approach to modeling, capturing, and disseminating proteomics experimental data. Nature Biotechnol. 21, 247–254 (2003).
Ma, Z. & Chen, J. (eds) Database Modeling in Biology: Practices and Challenges (Springer, in the press).
Karp, P. D. et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res. 33, 6083–6089 (2005).
Searls, D. B. Data integration — connecting the dots. Nature Biotechnol. 21, 844–845 (2003).
Karp, P. D. What we do not know about sequence analysis and sequence databases. Bioinformatics 14, 753–754 (1998).
Camon, E. et al. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32, D262–D266 (2004).
Gattiker, A. et al. Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27, 49–58 (2003).
Garcia-Berthou, E. & Alcaraz, C. Incongruence between test statistics and P values in medical papers. BMC Med. Res. Methodol. 4, 13 (2004).
Mecham, B. H. et al. Increased measurement accuracy for sequence-verified microarray probes. Physiol. Genomics 18, 308–315 (2004).
Ntzani, E. E. & Ioannidis, J. P. Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. Lancet 362, 1439–1444 (2003).
Hirschhorn, J. N., Lohmueller, K., Byrne, E. & Hirschhorn, K. A comprehensive review of genetic association studies. Genet. Med. 4, 45–61 (2002).
Müller, H., Naumann, F. & Freytag, J.-C. Data quality in genome databases. Proc. Conf. Inf. Qual. (IQ 03) 269–284 (2003).
Iliopoulos, I. et al. Evaluation of annotation strategies using an entire genome sequence. Bioinformatics 19, 717–726 (2003).
Leser, U. & Hakenberg, J. What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinformatics 6, 357–369 (2005).
Resnik, D. B. Strengthening the United States' database protection laws: balancing public access and private control. Sci. Eng. Ethics 9, 301–318 (2003).
Maurer, S. M., Hugenholtz, P. B. & Onsrud, H. J. Intellectual property. Europe's database experiment. Science 294, 789–790 (2001).
Merali, Z. & Giles, J. Databases in peril. Nature 435, 1010–1011 (2005).
Ellis, L. B. & Kalumbi, D. The demise of public data on the web? Nature Biotechnol. 16, 1323–1324 (1998).
Greenbaum, D. & Gerstein, M. A universal legal framework as a prerequisite for database interoperability. Nature Biotechnol. 21, 979–982 (2003).
Brazma, A. et al. Minimum information about a microarray experiment (MIAME) — toward standards for microarray data. Nature Genet. 29, 365–371 (2001).
Bourne, P. Will a biological database be different from a biological journal? PLoS Comput. Biol. 1, 179–181 (2005).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Rother, K. et al. Columba: multidimensional data integration of protein annotations. Proc. Data Integr. Life Sci. First Int. Workshop 156–171 (2004).
Zdobnov, E. M., Lopez, R., Apweiler, R. & Etzold, T. The EBI SRS server — recent developments. Bioinformatics 18, 368–373 (2002).
Haas, L. M. et al. DiscoveryLink: a system for integrated access to life sciences data sources. IBM Syst. J. 40, 489–511 (2001).
Köhler, J. et al. Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures. In Silico Biol. 5, 33–44 (2004).
Stein, L. D. Integrating biological databases. Nature Rev. Genet. 4, 337–345 (2003).
Köhler, J. Integration of life science databases. Drug Discov. Today 2, 61–69 (2004).
Matys, V. et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378 (2003).
Kolchanov, N. A. et al. Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucleic Acids Res. 30, 312–317 (2002).
Acknowledgements
The authors would like to thank C. Rawlings and P. Verrier for commenting on an earlier version of this article. Furthermore we would like to thank the following individuals for exploring with us the pitfalls of life-science databases over the past years: J. Baumbach, J. Butz, E. Kirchem, F. Klingert, S. Knop, B. Kormeier, I. Kupp, A. Neu, A. Rüegg, A. Skusa, B. Steuernagel, J. Taubert, P. Verrier and R. Winnenburg. S.P. gratefully acknowledges funding by the European Science Foundation. Rothamsted Research receives grant-aided support from the UK Biotechnological and Biological Science Research Council.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Related links
FURTHER INFORMATION
BioPAX — Biological Pathways Exchange
EC (enzyme class) numbers of the enzyme nomenclature
European Bioinformatics Institute SRS server
European Bioinformatics Institute
Extensible Markup Language (XML)
Kyoto Encyclopedia of Genes and Genomes
Microarray Gene Expression Data Society
Nucleic Acids Research Database Categories List
Open Source Initiative License Index
Proteomics Standards Initiative — molecular interaction
Glossary
- Controlled vocabulary
-
A standardized set of terms that can be used in a given application domain. A prominent example is the enzyme class nomenclature, which describes classes of biochemical reaction.
- Database management system
-
A system that provides a means of storing, modifying and extracting data from a database.
- Evidence code
-
A controlled vocabulary that is used to track the types of evidence that support a gene annotation.
- Flat file
-
Human readable, non-standardized files that can be used to exchange the contents of life-science databases.
- Ontology
-
A commonly agreed definition of real-world concepts, such as 'protein' and 'enzyme', and their particular relationships, for example, an enzyme 'is a' protein.
- Parser
-
Software that reads a given input, such as a flat file, for further processing.
- Web service
-
A standardized way to allow for interoperable machine-to-machine interaction over a network.
- XML
-
The extensible markup language (XML) is a standard for the creation of application-specific, self-descriptive markup languages, which, for example, can be used for the definition of data-exchange formats.
Rights and permissions
About this article
Cite this article
Philippi, S., Köhler, J. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet 7, 482–488 (2006). https://doi.org/10.1038/nrg1872
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg1872
- Springer Nature Limited
This article is cited by
-
Network-based modeling of drug effects on disease module in systemic sclerosis
Scientific Reports (2020)
-
Brain Radiation Information Data Exchange (BRIDE): integration of experimental data from low-dose ionising radiation research for pathway discovery
BMC Bioinformatics (2016)
-
ONTO-ToolKit: enabling bio-ontology engineering via Galaxy
BMC Bioinformatics (2010)
-
Data recovery and integration from public databases uncovers transformation-specific transcriptional downregulation of cAMP-PKA pathway-encoding genes
BMC Bioinformatics (2009)
-
An XML transfer schema for exchange of genomic and genetic mapping data: implementation as a web service in a Taverna workflow
BMC Bioinformatics (2009)