A survey of metabolic databases emphasizing the MetaCyc family
- First Online:
- Cite this article as:
- Karp, P.D. & Caspi, R. Arch Toxicol (2011) 85: 1015. doi:10.1007/s00204-011-0705-2
- 284 Views
Thanks to the confluence of genome sequencing and bioinformatics, the number of metabolic databases has expanded from a handful in the mid-1990s to several thousand today. These databases lie within distinct families that have common ancestry and common attributes. The main families are the MetaCyc, KEGG, Reactome, Model SEED, and BiGG families. We survey these database families, as well as important individual metabolic databases, including multiple human metabolic databases. The MetaCyc family is described in particular detail. It contains well over 1,000 databases, including highly curated databases for Escherichia coli, Saccharomyces cerevisiae, Mus musculus, and Arabidopsis thaliana. These databases are available through a number of web sites that offer a range of software tools for querying and visualizing metabolic networks. These web sites also provide multiple tools for analysis of gene expression and metabolomics data, including visualization of those datasets on metabolic network diagrams and over-representation analysis of gene sets and metabolite sets.
KeywordsMetabolic databasesBioinformaticsMetabolic pathwaysDatabasesGenome databases
A large number of metabolic databases have been developed during the last two decades to describe the known and predicted metabolism of a wide variety of organisms. Initially small in number, thousands of databases are now available and constitute an important resource for researchers in toxicology, metabolic engineering, drug discovery, and many other disciplines. This article surveys metabolic databases with an emphasis on the MetaCyc family.
It is the confluence of genome sequencing and bioinformatics that has yielded this large number of metabolic databases. Genome sequencing efforts have produced complete genome sequences for more than one thousand organisms to date, and the pace of sequencing is accelerating. Bioinformaticists have developed computational methods for predicting the locations and functions of genes in a sequenced genome, and the field of pathway bioinformatics has developed algorithms for predicting the presence of metabolic pathways in organisms with a sequenced genome. Methods for predicting gene functions and pathways take similar approaches: gene functions are predicted by programs that detect similarity between the sequences of newly sequenced genes and previously sequenced genes of known function. Analogously, computational pathway prediction methods recognize previously elucidated pathways based on the set of enzymes present in a genome.
Survey of metabolic databases
The main families of metabolic databases and their properties
The databases within each of the DB families listed in Table 1 have common properties. They share a common DB schema, and the same set of software tools is used to query and manipulate the DBs in each family. DBs in each family also tend to share common methodologies, such as the approach to curation. The DBs in the BiGG family were not computationally derived from a common reference DB—each DB was created through a manual process. But these DBs do share a common lineage in that all were created by the group of Dr. B. Palsson at the University of California San Diego (Schellenberger et al. 2010). The DBs in the Model SEED family (Henry et al. 2010) were created by the SEED team using a custom pipeline that involves a combination of automated and manual steps. In contrast, DBs within the KEGG (Kanehisa et al. 2010) and MetaCyc (Caspi et al. 2010) families [and Reactome, to a smaller degree (Croft et al. 2010)] were created by many different research groups that made use of the KEGG or MetaCyc reference DBs and the KEGG or MetaCyc software tools. The KEGG and MetaCyc families each contain more than one thousand organisms from all domains of life.
Metabolic DBs differ along a number of other dimensions in addition to their family membership. We consider the following additional dimensions: the types of data they contain, and the amount of manual curation they have received.
Types of data available
The MetaCyc, KEGG, and Model SEED families provide genome data in conjunction with their metabolic data, such as genome map viewers and access to nucleotide and amino acid sequence data. Reactome and BiGG do not provide genome data. These tools allow users to view the chromosomal organization of genes coding for a given pathway. In addition, some databases within the MetaCyc family provide extensive regulatory data (e.g., EcoCyc).
KEGG and MetaCyc each provide approximately 9,000 biochemical reactions. Reactome provides 3,800 reactions for its human database, and the remaining databases are probably subsets of those 3,800 reactions since Reactome uses its human database as the reference for predicting reactions and pathways in other organisms.
The KEGG and MetaCyc families are based on fairly different notions of biological pathways (Green and Karp 2006). KEGG reference pathways are typically mosaics of related pathways and reactions from multiple species. KEGG pathways are typically 3–4 times larger than are MetaCyc pathways because MetaCyc pathways attempt to model individual biological pathways from individual organisms. For example, the KEGG pathway called “methionine metabolism” combines pathways for the biosynthesis of methionine, charging of methionyl-tRNA, and conversion of methionine to other compounds such as N-formyl-methionine. MetaCyc defines pathways that correspond to a single biological function, are regulated as a unit, and are conserved through evolution.
Databases in the BiGG and Model SEED families differ from the other families in containing constraint-based equilibrium models of the metabolic networks of each organism. The models in BiGG were manually constructed, whereas those in Model SEED were constructed computationally. These models can be used to predict essential metabolic genes and to predict growth media for an organism. Similar models will be available for the MetaCyc family in the near future.
The different families have different approaches to curation, meaning manual incorporation of information from the biomedical literature into their databases. All the families undergo some amount of manual curation. For the KEGG family, the majority of curation occurs for reference pathways only and includes only the basic pathway structure, reactions, and compounds. BiGG goes a bit further, sometimes including short commentary and literature references. Curated MetaCyc and Reactome family databases include multiparagraph minireviews for proteins and pathways and include extensive literature citations—this information helps explain the biological role of a pathway or an enzyme and clearly identifies the source of information. Curated MetaCyc family databases go still further in extracting many additional data from publications, including enzyme cofactors, activators, inhibitors, subunit composition, and kinetic constants, along with citations to the source of each datum.
Additional metabolic databases
Large collection of enzyme functional data
Official database of the International Union of Biochemistry and Molecular Biology (IUBMB) Enzyme Nomenclature List
Manually annotated database of biochemical reactions
Curated resource of metabolic pathways for the UniProtKB/Swiss-Prot knowledge base
Analysis and display tools in the main metabolic databases
Regulatory network browser
Full metabolic map browser
Zoomable metabolic map
Paint data onto metabolic map
Paint data onto pathway diagram
Automatic pathway layout
Pathway diagrams include metabolite structures
Flux balance analysis
Choke point analysis
Dead-end metabolite analysis
Metabolite tracing tool
Path search tool
Reachability analysis tool
Human metabolic pathway databases (we also note the existence of the MouseCyc metabolic pathway database for the laboratory mouse)
Edinburgh Human Metabolic Network
Manually curated, compartmentalized database of human reactions and pathways. Requires registration
Human Metabolome Database (HMDB)
Database of small molecule metabolites found in the human body
Manually curated PGDB of known and predicted metabolic pathways
Ingenuity Knowledge Base
Curated commercial database that spans signaling pathways, metabolic pathways, and protein interactions. Requires subscription
Manually curated global reconstruction of the human metabolic network. Requires registration
Manually curated, peer-reviewed pathway database
Contains reference pathways for metabolism, genetic and environmental information processing, cellular processes, organismal systems and human diseases. None of these has been curated specifically for human genes and proteins
Contains an archive of human pathways, in MAPP format
Tier 2 manually curated PGDB of both known and predicted metabolic pathways for the laboratory mouse
The MetaCyc family of metabolic databases
In conjunction with its role as a general reference on metabolism, the MetaCyc DB can be used as a reference DB for the PathoLogic component of the Pathway Tools software, which computationally predicts the metabolic network of any organism having a sequenced and annotated genome (Dale et al. 2010). In this automated process, a predicted metabolic network is created in the form of a Pathway/Genome Database (PGDB). MetaCyc has been used by SRI to create more than one thousand PGDBs (as of November 2010), which are available through the BioCyc web site at BioCyc.org. In addition, MetaCyc has been used by other scientists to create hundreds of additional PGDBs, many of which are available to the general public through the scientists’ own web sites. Since PGDBs created in this fashion share many properties, we refer to them in this manuscript as the MetaCyc family of databases.
Common attributes of DBs within the MetaCyc family are as follows. (1) They share a common lineage in that all were initially derived from MetaCyc through computational prediction of their metabolic pathways with reference to the pathways in MetaCyc. Some of the databases have undergone subsequent curation to add additional pathways and other information. (2) Databases in the MetaCyc family share a common database schema (the underlying database structure used to organize each database). (3) MetaCyc family DBs were created, updated, and are queried using a common software environment called the Pathway Tools software (Karp et al. 2010). As a result, databases within this family share a large degree of standardization and compatibility. For example, comparative pathway analysis tools within Pathway Tools can be used to compare any DBs within the MetaCyc family.
It is important to realize that MetaCyc is different from virtually all other PGDBs within the MetaCyc family in that MetaCyc is a multiorganism PGDB, whereas all other PGDBs within the family describe a single organism [the exception is PlantCyc (Zhang et al. 2010), a multiorganism PGDB that contains only plant information]. More specifically, MetaCyc contains metabolic pathways and enzymes from more than 2,000 organisms that have been curated from the experimental literature. MetaCyc contains only experimentally elucidated pathways, so as to provide a solid foundation for predicting the metabolic pathways of other organisms. In contrast, organism-specific PGDBs contain a mixture of computationally predicted pathways and (depending on the degree of curation) experimentally elucidated pathways and attempt to model the metabolic network of that organism as accurately as possible. For example, 67 experimentally elucidated pathways in MetaCyc (version 14.6) list Bacillus subtilis as a taxon known to possess the pathway. In contrast, the BioCyc PGDB for that organism, BsubCyc (version 1.4), contains 219 pathways as well as the complete genome for that organism. Unlike MetaCyc, which does not contain sequence information, the organism-specific databases of the MetaCyc family include the full genome sequence and provide an excellent platform for the integration of genome information with many other types of data regarding metabolism, regulation, and genetics.
We assign to PGDBs a rating of Tier 1, Tier 2, or Tier 3 to reflect the amount of manual curation that has been applied to that PGDB. Tier 3 PGDBs result from computational predictions only and underwent no manual curation. Tier 2 PGDBs have undergone less than 1 year of manual curation, and Tier 1 PGDBs have undergone more than 1 year of curation. Currently, there are only four Tier 1 databases—MetaCyc, EcoCyc, AraCyc and YeastCyc.
PGDBs curated by external groups that are available on the Internet
Cryptosporidium hominis TU502
Cyrptosporidium parvum IOWA
Plasmodium berghei ANKA
Plasmodium chadaudi AS
Plasomodium falsiparum 3D7
Plasmodium vivax Sal-1
Plasomodium yoelii 17XNL
Toxoplasma gondii ME49
EuPathDB (Eukaryotic pathogens database resources), USA
Universite de Lyon, France
Carnegie Institution, USA
33 BioEnergy organisms
BioEnergy Science Center, USA
Department of Genetics, Stanford U., USA
dictyBase, Northwestern U., USA
PGDBs for 23 organisms
Broad Institute, USA
Leishmania major Friedlin
Bio21 Institute, University of Melbourne, Australia
Samuel Roberts Noble Foundation, USA
PGDBs for 484 organisms
Laboratorio de Genomica e Expressao, Brazil
Pseudomonas Genome Project, Simon Fraser U., Canada
Center for Genomic Sciences, Mexico
Gramene curators, Cornell U. and CSHL, USA
Streptomyces coelicolor A3(2)
John Innes Centre, UK
18 Shewanella genomes
Marine Biological Laboratory, USA
Sol genomics network, USA
Integrated genetics and molecular biology for soybean researchers, USA
Taxonomically broad EST database
TBestDB group, Canada
Mycobacterium tuberculosis H37Rv
TB database, Stanford U., USA
Stanford U., USA
BioCyc is a collection of more than 1,000 organism-specific PGDBs that is available from SRI via BioCyc.org. Most of these PGDBs were generated by SRI, although some were created by other groups and are hosted by SRI.
Interested scientists may adopt and curate existing PGDBs through the BioCyc web site (http://biocyc.org/intro.shtml#adoption). To adopt a PGDB is to assume ongoing responsibility for updating and improving its content.
The MetaCyc database
MetaCyc (MetaCyc.org) is a highly curated multiorganism database of small-molecule metabolism. MetaCyc is unique among metabolic pathway databases in that it only contains data that have been experimentally demonstrated in the scientific literature (Caspi et al. 2010). The experimentally determined pathways and enzymes are tightly integrated with references, making MetaCyc a valuable resource in the field of metabolism, used by researchers from many disciplines, including biochemistry, molecular biology, biotechnology, bioinformatics, metabolic engineering, toxicology, and systems biology (Valdes et al. 2003; Kim et al. 2007; Aanensen et al. 2007; Bernal et al. 2009). The data in MetaCyc are derived from organisms representing all domains of life, with a particular emphasis on microbial and plant metabolism, and are backed by evidence codes and extensive commentary, which include citations of the original literature.
MetaCyc utilizes several ontologies, some of which were developed internally (for example, the pathway and cell component ontologies) and some developed externally (such as the NCBI organism taxonomy ontology). MetaCyc data are obtained from several sources. Most of the data have been manually curated from 26,800 publications (since the start of the MetaCyc project in 1997). In addition, other curated PGDBs periodically submit data to MetaCyc, for example, curated pathways and enzymes for E. coli and yeast are obtained periodically from EcoCyc and YeastCyc. MetaCyc also directly imports some data from other DBs: reactions assigned by the Enzyme Commission are imported from the ENZYME DB (Bairoch 2000) and from the ExplorEnz DB (McDonald et al. 2009), and some proteins (those that were imported from EcoCyc) contain protein feature data that were imported from UniProt.
The MetaCyc Pathway Ontology, a hierarchical classification system for metabolic pathways
Secondary metabolites biosynthesis (411)
Cofactors, prosthetic groups, electron carriers biosynthesis (179)
Fatty acids and lipids biosynthesis (115)
Amino acids biosynthesis (109)
Carbohydrates biosynthesis (87)
Cell structures biosynthesis (45)
Amines and polyamines biosynthesis (36)
Hormones biosynthesis (43)
Nucleosides and nucleotides biosynthesis (28)
Aromatic compounds biosynthesis (24)
Other biosynthesis (19)
Siderophore biosynthesis (17)
Metabolic regulators biosynthesis (5)
Aminoacyl-tRNA charging (4)
Amino acids degradation (165)
Aromatic compounds degradation (152)
Inorganic nutrients metabolism (86)
Secondary metabolites degradation (78)
Carbohydrates degradation (57)
Amines and polyamines degradation (48)
Chlorinated compounds degradation (39)
Carboxylates degradation (35)
C1 Compounds utilization and assimilation (26)
Nucleosides and nucleotides degradation (25)
Hormones degradation (23)
Fatty acids and lipids degradation (20)
Alcohols degradation (16)
Aldehyde degradation (12)
Polymeric compounds segradation (10)
Cofactors, prosthetic groups, electron carriers degradation (3)
Protein degradation (3)
Generation of precursor metabolites and energy (141)
Chemoautotrophic energy metabolism (15)
Electron transfer (12)
TCA cycle (8)
Pentose phosphate pathways (4)
Methylglyoxal detoxification (8)
Arsenate detoxification (4)
Antibiotic resistance (4)
Acid resistance (2)
Mercury detoxification (1)
Distribution of pathways in MetaCyc based on the taxonomic classification of associated species
MetaCyc version 14.6 contains 8,983 metabolic reactions. Of these, 5,446 reactions are assigned to metabolic pathways; the remaining reactions are not components of any pathway. Reactions may or may not have enzymes associated with them. Each reaction refers to its substrates as links to the corresponding compound entries in MetaCyc. Thus, each substrate is captured one time in MetaCyc and is referenced in every reaction using the same name and chemical structure. This basic principle of DB normalization ensures that MetaCyc does not contain duplicate information about the same compound and ensures that every compound will always have the same name and chemical structure in every reaction in which it appears.
MetaCyc 14.6 contains 6,912 enzymes. MetaCyc curators attempt to capture the following information for each enzyme, although not all of this information is available for each enzyme in the literature. MetaCyc encodes the subunit structure of each enzyme (e.g., homodimer, heterotrimer) and the genes encoding each subunit. It captures the reaction(s) catalyzed by the enzyme, its cofactors, activators, and inhibitors and distinguishes the mechanism of activation or inhibition, as well as indicating which activators and inhibitors are thought to act in vivo. Kinetic constants are captured when available, as are molecular weight and pI. Web links to other biological DBs such as UniProt are provided.
MetaCyc 14.6 contains 8,869 compounds. The vast majority of compounds in MetaCyc include chemical structures. All MetaCyc structures have been protonated to pH 7.3 to represent a consistent and biologically relevant protonation state.
Releases of MetaCyc occur four times per year. On each release, MetaCyc is subject to a large number of computational validation checks, including searching for duplicate reactions and duplicate compounds, and searching for unbalanced reactions.
Uses of the MetaCyc family of PGDBs
When combined with the Pathway Tool software, PGDBs offer sophisticated tools for query, navigation, and analysis. In this section, we will cover a few of those tools.
Searching a PGDB
The main objects that users query for in a metabolic database are compounds, reactions, pathways, genes, and proteins. Querying PGDBs can be performed at different levels and by different mechanisms.
For simple searches, a “quick search” box is available at the upper right-hand corner of every web page. This type of search queries all object types simultaneously and is useful if you know the name (or part of the name) or database identifier of the object for which you are searching.
The quick search box can be used to search for genes, proteins, compounds, RNAs, reactions, pathways, operons, and GO terms. If the query string matches a single object, the page for that object will be displayed immediately. If there are multiple matches, the full list of matches will be shown, organized by the type of object. When users enter long text strings in the box, the search will return all objects that contain the text rather than match it exactly. To limit the results to exact matches, users can add the special flag search:exact at the end of the input string. For example, the search “d-glucose search:exact” will return the compound d-glucose, while the search “d-glucose” will return many results, for example, “abscisic acid glucose ester biosynthesis”.
For intermediate searches, Pathway Tools provides specialized search pages for the main objects for which users may search—Compounds, Genes/Proteins/RNAs, Reactions and Pathways, available under the Search menu. While designing these pages, we tried to accommodate common search criteria that we estimated that users may wish to search for, making such searches simple and user-friendly. Each such page contains options for searching using a number of different criteria, either individually or in combination.
The Genes/Proteins/RNAs search page has many search options, including searching by name, database identifier, or the protein’s EC number; by sequence length, replicon and/or gene map position; by a protein’s molecular weight, pI, or small molecule regulator, cofactor, substrate or ligand; by evidence code, cellular location, GO term, MultiFun term, or by organism (when searching MetaCyc or other multiorganism PGDBs). It is also possible to search by publication, using a PubMed ID, author name, or an article title. For example, searching HumanCyc for publications by author “Wilson PJ” returns two articles, associated with the products of the IDS and TSC2 genes.
Reactions can be searched by EC number or name, substrates or products, and ontology.
Pathways can be searched by name, ontology, number of reactions, compounds that participate in the pathway, evidence code, organism (for multiorganism PGDBs), expected taxonomic range, and publication. For example, searching HumanCyc for pathways that contain the substrate l-lysine returns three pathways—two l-lysine degradation pathways and one tRNA charging pathway.
Other types of simple searches
Pathway Tools offers several additional search options under the search menu, including the search of the full web site for a text string using Google Search (available only for websites that have been indexed by Google), browsing the different ontologies, and performing sequence searches using BLAST (currently not available in MetaCyc).
For more complex searches, users can use the Advanced Search tool, which permits writing queries that combine data from multiple organisms or multiple types of objects. To enable this powerful query tool, a dedicated query language, named BioVelo, has been developed (Latendresse and Karp 2010).
The structured advanced query page (SAQP)
The Structured Advanced Query Page (SAQP) enables the user to compose complex queries by selecting options from pull-down menus and combining them with simple text strings entered into text boxes. This interface enables the user to formulate a query without knowing the underlying query language. While not representing the full capabilities of the BioVelo language, this interface provides a simple way to construct a powerful range of queries.
In addition to composing the query, a user can specify the exact output format for the results by specifying any number of output columns and assigning the desired data fields to each column. Users can also select between HTML output, which permits viewing the results immediately in the web browser, and a text tabulated file that can be imported into spreadsheet programs. For example, searching HumanCyc for proteins curated with GO terms that contain the word “lysine” returns a list of 25 proteins that meet these criteria.
Navigation within PGDBs
A key feature of PGDBs served by Pathway Tools is connectivity among data objects. Almost all object displays are clickable, making it easy to navigate from one object to a related object. For example, in addition to displaying a chemical structure, links to other databases, and other compound-related data, the compound pages in PGDBs also include a list of all the reactions in the database in which the compound participates, as well as the pathways that include these reactions. Both the reactions and the pathway names are clickable, making it very easy to navigate from the compound page to the pathways that include that compound. Similarly, reaction pages list all the enzymes known to catalyze the reaction, the genes that encode those enzymes, and the pathways that include the reaction. Gene pages include the transcription units that contain the gene as well as a diagram of the gene local content, the enzyme encoded by the gene, the reactions catalyzed by the enzyme, relevant pathways, and buttons that allow the user to display the organism’s genome browser centered around the gene, display sequence information, compare orthologs from multiple organisms, and align orthologous genes in a multi-genome browser.
Analysis and display tools
Pathway Tools provides a plethora of data analysis and visualization capabilities to the MetaCyc family of PGDBs, including overview diagrams, ChIP-chip data visualization, omics viewers, enrichment analysis, comparative analysis of different organisms, and dead-end metabolite analysis.
The genome browser
The genetic elements (replicons) of the organism can be viewed by a dedicated viewer called the Genome Browser that is built into the Pathway Tools software and can be invoked using the web menu bar command Tools → Genome Browser (this tool is available for all PGDBs with the exception of MetaCyc, since the latter does not contain genome information). The user can specify the region of the element to be viewed, using either exact coordinates, or through zooming and lateral translation navigational controls at the upper left. The browser distinguishes between protein-coding genes, RNA genes, and open reading frames and indicates the transcription direction of the genes. Depending on the level of detail, the browser can show additional information. For example, at the “operons” level, the browser depicts the transcription units (a different color indicates whether the transcription unit is based on computational prediction or experimental evidence). At the “genes” level, the browser adds transcription start sites and terminator binding sites. The user can display positional data (such as predicted promoters) by using the tracks feature (see Data Tracks Visualization below).
The overview diagrams integrate information to provide system-level views of molecular machinery in a single diagram. Three such diagrams are available—a cellular overview, a regulatory overview, and a genome overview. Again, these tools are not available for MetaCyc since it does not describe a single organism.
The cellular overview
The diagram can be magnified using the zoom ladder at the upper left. The diagram can be interrogated in many ways, allowing users to highlight sets of data according to their specifications, with the commands under the Cellular Overview menu.
The regulatory overview
The Regulatory Overview diagram enables the user to visually analyze the regulatory relationships between genes (Tools → Regulatory Overview). Each node represents a gene, and arrows represent a regulatory interaction between the product of one gene and the transcription of another. Since regulatory information cannot be predicted computationally in an accurate manner, this functionality is available only in PGDBs where such information has been entered manually.
The genome overview
The Genome Overview diagram describes the full genome in a compact diagram (Tools → Genome Overview).
The Omics viewers
Data tracks visualization
Using the tracks feature (accessible from the Genome Browser), it is possible to display any type of information with numeric values that correspond to positions on a genetic element, such as ChIP-chip data. The data need to be stored in GFF format (for more information about this format, see http://www.sanger.ac.uk/resources/software/gff/). Any number of additional tracks can be added. Once a data track has been added, it is possible to toggle its display on and off by checking the appropriate check box under “External Annotation Tracks”.
Object groups and enrichment analysis
Experimental protocols often yield a set of genes of interest, but the relationships among these genes are not always clear. This tool was designed to help answer the question “What do groups of genes have in common?”
Enrichment analysis enables users to evaluate over- or under-representation of certain qualities or traits within an object group—for example, determining which genes out of a specified gene group are involved in one or more biological processes. To enable this type of analysis, Pathway Tools includes two functions—a flexible interface that permits the user to define and manipulate object groups, and a statistical analysis engine.
Users can create groups of objects in several ways (see Groups menu in desktop software)—they can import objects from text files (e.g., a list of gene names) or omics datasets, they can search the database and convert the search results to a group (e.g., a group of all the lysine biosynthetic pathways), or they can simply type in the names of the objects that want to include. Once a group has been created, it is extremely easy to automatically transform it to a different type of group. For example, the user can convert a group of pathways to a group of the genes that are involved in these pathways or convert a group of genes to a group of GO terms, to the enzymes encoded by those genes, to the pathways in which those enzymes participate, or to the compounds that are included in those pathways.
It is even possible with a few mouse clicks to combine groups or to filter them and to display the contents of groups on any of the overview diagrams. This last option enables users to instantaneously answer questions, such as “do the genes in my list tend to cluster on the chromosome?” or “do the genes in my list tend to share a regulation scheme?” (by highlighting the genes on the Genome Overview and Regulatory Overview, respectively).
Comparative analyses allow users to compare data and statistics between PGDBs and generate summaries of individual PGDBs. Currently, Pathway Tools supports comparative analysis of reactions, pathways, compounds, proteins, orthologs, transporters, and transcription units. Once users invoke this type of analysis by selecting Tools → Comparative Analysis, they can select the types of objects to be compared and specify the organisms they would like to compare. The system will generate comparison tables that include hyperlinks for most of the included objects. These tools can also be used to generate statistical analysis for a single organism.
Modes and platforms of the pathway tools software
The Pathway Tools software can run as either a desktop application or as a web server. An example for the latter is the BioCyc.Org web site and many other web sites around the world, where users can access PGDBs via the Internet. However, users can also download and install Pathway Tools on their own computers (Windows, Macintosh and Linux versions are available) and run it in desktop mode.
Although the software shares most of its functionality between these two modes of operation, some functionality is available in only one mode or the other. Thus, installing Pathway Tools locally provides access to operations that are not available through the BioCyc.org web site. In addition, local installation is likely to speed up many operations because it eliminates network delays and the sharing of computer resources with other users. The main options that are available only via the desktop mode are PathoLogic (enabling the creation of new PGDBs) and editing of PGDBs. In addition, a number of operations, including some Omics analysis tools, are available only in desktop mode. On the other hand, some comparative genomics tools, the Advanced Query pages, and BLAST searching are available only via the web server. Future versions of Pathway Tools will include more functionality in web server mode. A full comparison of current differences can be found at http://biocyc.org/desktop-vs-web-mode.shtml.
How to learn more about the MetaCyc family
Publications describing new releases of MetaCyc, BioCyc, and EcoCyc occur every other year in the Nucleic Acids Research database issue (Caspi et al. 2010; Keseler et al. 2009). In addition, the web-based guides for MetaCyc, BioCyc, and EcoCyc provide more detailed information about the databases (Guide to the MetaCyc Database; Guide to the BioCyc database collection; Guide to the EcoCyc Database, http://biocyc.org/ecocyc/EcoCycUserGuide.shtml). A survey of Pathway Tools capabilities is available (Karp et al. 2010), as is a guide to using the Pathway Tools-based websites (How to Use a Pathway Tools Website, http://biocyc.org/PToolsWebsiteHowto.shtml). Tutorial videos on how to use the MetaCyc family of databases and Pathway Tools-based web sites can be downloaded from our web site (BioCyc webinars, http://biocyc.org/webinar.shtml).
Recent years have seen a dramatic increase in the number of publicly available metabolic databases, from a handful in the mid-1990s to several thousand today. Many of these databases are generated automatically via software pipelines, resulting in distinct families of databases. The main families are MetaCyc, KEGG, Model SEED, Reactome, and BiGG. The different database families offer different functionalities—for example, the ability to annotate or curate the databases, the availability of different query and analysis options, or the ability to create new custom databases. In this review, we focused on the MetaCyc family, which contains well over one thousand databases, including highly curated model organism databases for Escherichia coli, Saccharomyces cerevisiae, Mus musculus, and Arabidopsis thaliana. Many of the databases in the MetaCyc family were created by scientists, rather than by the SRI developers of the software. The Pathway Tools software, which supports these databases, offers a range of software tools for querying and visualizing metabolic networks, as well as for analysis of gene expression and metabolomics data, including visualization of those datasets on metabolic network diagrams, and analysis of over-representation of gene and metabolite sets. The MetaCyc family databases are available through a number of web sites around the world, including a collection of more than 1,000 databases at BioCyc.org.
The projects described were supported by award numbers GM75742 and GM080746 from the National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health.