Background

A central goal of integrative systems biology is the accurate representation of molecular interaction networks. Ultimately, such networks can be used to underpin mathematical models, consisting of stochastic or ordinary differential equations that permit the simulation of biological behaviour. The first step in generating such models is constructing a network of biochemical reactions and interactions between molecular components of the system to form a qualitative (unparameterised) model. Several groups have reconstructed the metabolic network of baker's yeast from genomic and literature data [13]. Variation in the approaches used, and contradictory interpretations of the available literature, mean that most reconstructions differ considerably. To resolve these problems, a cohort of the yeast systems biology community collaborated to create a consensus reconstruction. In April 2007, a large focused meeting brought together experts from various groups and disciplines in order to resolve discrepancies between the various reactions and metabolites described by other available reconstructions and form a consensus. The resultant reconstruction [4], subsequently referred to as "Yeast 1.0", removed the ambiguities inherent in its predecessors through the use of principled and computer-readable annotations. Whilst previous reconstructions had defined entities using subjective names, which lacked precision and resulted in ambiguities, Yeast 1.0 directly referenced chemical and protein descriptions to persistent databases or used standardised, database-independent, computer-readable representations. This removed the ambiguities and allowed the new reconstruction to be used effectively as the basis for automated analyses.

A limitation of Yeast 1.0 came about through the very generation of the consensus; the network became considerably fragmented as reactions that could not be readily annotated (due to the presence of structural ambiguities) were removed. This led to underrepresentation of a number of pathways, particularly those involved in lipid biosynthesis. Since Yeast 1.0, many improvements have been made to the reconstruction. The latest release, described here, is considerably larger (in terms of numbers of metabolites and reactions), of higher quality (by reference to literature evidence), exhibits greater coverage of known metabolic enzymes, and is better connected than all previous efforts.

The reconstruction is described and made available in Systems Biology Markup Language (SBML) [5], an established community XML format for the mark-up of biochemical models. With the introduction of SBML Level 2, specific model entities, such as species or reactions, can be annotated using ontological terms. These annotations, encoded using the resource description framework (RDF) [6], provide the facility to assign definitive terms to individual components, allowing the software to identify such components unambiguously and thus link model components to existing data resources [7]. Minimum Information Requested in the Annotation of Models (MIRIAM) [8] -compliant annotations have been used to identify components unambiguously by associating them with one or more terms from publicly available databases registered in MIRIAM Resources [9]. An example of such an annotation is presented in Figure 1, where an enzyme is identified by MIRIAM-compliant references to the UniProt [10], SGD [11], and PubMed [12] databases. Metabolites are annotated with reference to the ChEBI (Chemical Entities of Biological Interest) database [13]. Whilst SBML is the primary format for dissemination of the reconstruction, we also make the reconstruction available in an online database [14], B-Net, that enables easy searching of the content. B-Net [15] is able to represent all of the SBML features utilised in the current reconstruction. Searches can be performed using synonyms and the user is also able to navigate through the network from any point (e.g. a metabolite, reaction or enzyme) to its connected neighbours. Query results can also be exported in SBML and this is an effective mechanism to extract subsets of the entire model in this exchange format.

Figure 1
figure 1

SBML example. Simplified example of MIRIAM-compliant SBML, whereby an enzyme is annotated with reference to the databases UniProt, SGD and PubMed, respectively.

Results and Discussion

Improvements in the representation of yeast metabolism in this release as compared to Yeast 1.0 primarily consist of its enhanced representation of lipid metabolism and greater connectivity, thereby permitting constraint-based flux analyses. Many of the extensions to Yeast 1.0 are reactions garnered from the literature, which are entirely novel to any genome-wide yeast metabolic reconstruction. Data were also incorporated, when backed up by traceable evidence, from two other reconstructions: iMM904 [16] and iIN800 [17]. The resulting consensus network (reported in Additional File 1) consists, in decompartmentalised form, of 1102 metabolic reactions involving 924 metabolites and 924 proteins (Table 1) and is therewith larger in scope than any previous reconstruction.

Table 1 Reconstruction scope

Careful curation does not simply involve increasing the scope of the reconstruction. Indeed, 32 enzymes from Yeast 1.0 were considered insufficiently evidenced and have been removed, whilst a number of metabolites were relocalised to a different compartment. A typical example of an enzyme removed from the reconstruction is Gpm2p; whilst a homologue of Gpm1p, its phosphoglycerate mutase activity could not be evidenced and may be non-functional [18]. Four reconstructions are compared in Figure 2 in terms of enzymes present. In addition to the 32 enzymes removed, the reactions of a further 37 enzymes from iMM904 and iIN800 have not been added for lack of supporting evidence. In total, the new reconstruction considers 124 more enzymes than its predecessor, with half of these (61) being retrieved manually from the literature and therefore new to all reconstructions.

Figure 2
figure 2

Comparison of reconstructions in terms of enzymes present. The reconstruction presented here contains 124 more enzymes than Yeast 1.0, 61 of which have not been considered by any of the other reconstructions. Yeast 1.0 was also improved upon through better curation leading to the removal of (2 + 9 + 21 =) 32 enzymes. A further (6 + 13 + 18 =) 37 enzymes from iMM908 and iIM800 were not added to the reconstruction.

Lipid metabolism

The correct and complete representation of lipid metabolism is important, not only to meet the ultimate goal of genome-scale coverage, but also because understanding and engineering lipid metabolism through systems and synthetic biology is likely to play a major role in the replacement of fossil energy sources and chemical feedstocks with biofuels and bioplastics [19]. In Yeast 1.0, lipid metabolism was poorly captured. To move towards a better representation, the literature, database annotations and homology relationships were used to identify the set of lipid-related yeast enzymes. Homology with mouse and human enzymes reported in LipidMaps [20], and with enzymes from all organisms reported in KEGG lipid pathways [21], indicated lipid enzymes in yeast (homology relationships predefined by Ensembl [22]). Further enzymes were added to the set manually by examination of SGD and Ensembl annotations. A total of 268 yeast enzymes were identified as likely to be part of lipid metabolism. Although the boundaries of this set are unavoidably subjective, it appears to capture the majority of lipid-related genes in yeast.

With reference to this set of lipid enzymes, the iIN800 reconstruction of Nookaew et al. improved upon the original community reconstruction (Yeast 1.0) by increasing set coverage from 48% to 62% (with at least one reaction being associated with each enzyme). In the present release set coverage has further improved to 87%. Coverage of the lipid enzyme set by the various reconstructions is summarised in Figure 3. From iIN800 and iMM904, 56 lipid enzymes were added to Yeast 1.0, while three enzymes from these sources were not added. The current reconstruction describes activities for 49 enzymes that no other reconstruction has ever considered. Combining these, the reconstruction extends the Yeast 1.0 description of lipid metabolism by a total of 105 new enzymes, extends iMM904 by 59 enzymes, and iIN800 by 70 enzymes. This is by far the most comprehensive reconstruction of yeast lipid metabolism to date.

Figure 3
figure 3

Comparison of the coverage of lipid metabolism enzymes by the different reconstructions. At least one reaction in a reconstruction is catalyzed by each enzyme. On top of extending Yeast 1.0 by (1 + 9 + 46 =) 56 enzymes from iMM904 and iIN800, a further 49 enzymes uniquely appear in this latest reconstruction. Three reactions common to iMM904 and iIN800, plus 31 others, have not been incorporated for lack of evidence.

The 34 remaining lipid enzymes (in figure 3 these are 31 not found in any reconstruction, plus three found in both iMM904 and iIN800) from the set are either too poorly characterised functionally to be included or cannot be represented within the current description of the cell's compartmentalisation. Flippases, for example, require a more detailed description of membrane faces to capture their role in membrane asymmetry. Improving compartmental representation will be a goal for future releases.

Connectivity

Structural improvement was a major focus of the advancements made to the reconstruction by identifying and rectifying unconnected regions of the network. Two measures were used to describe connectivity. First, we identified clusters of unreachable metabolites; that is, clusters of metabolites that are disconnected from the extracellular medium, in a graph-theoretic sense, and thus cannot ever be produced by the reaction network. Secondly, we used flux variability analysis [23] to identify reactions that, by mass balancing, must have zero flux, for example because of dead-end metabolites (products that are not the substrates of another reaction). Led by these analyses, which are explained graphically in Figure 4, we looked for literature evidence describing these missing elements of our network. By targeting unreachable clusters and those reactions whose reconnection has the most influence on the network's connectivity, we maximised the impact of literature curation on modelling. By both measures, the present release improves both upon the previous release and particularly upon iMM904 and iIN800 (Table 2). More than 90% of metabolites can be reached from the extracellular medium and only 12.7% of reactions must have zero flux.

Figure 4
figure 4

Visualisation of connectivity analysis. Metabolites that are unreachable (in red) were identified with a graphical analysis, by locating metabolites that are disconnected from the extracellular medium. Flux variability analysis identified reactions that must have zero flux (in blue) because they lead to dead-end metabolites.

Table 2 Network connectivity

Our approach towards structural improvement is also an example of the iterative "cycle of knowledge" approach [24], where the model is first used to guide biological research and can subsequently be updated and improved as specific new knowledge becomes available. In this case the iteration consisted of discovery and collation of experimental evidence previously obtained but which had never been identified in this context. Such discovery of knowledge was informed by the previous models and was unlikely to have happened in their absence.

Constraint-based analysis

New reconstructions are often validated through constraint-based approaches like Flux Balance Analysis (FBA) [25] to assess their ability to predict experimental results. While there is clear utility in deploying such methods to explore biochemical capacity, using improved agreement with experimental observations to determine whether the reconstruction is, in some sense, 'better' than previous efforts is potentially misleading. In the current release, non-inferred reactions are supported by evidence from the literature and it is in this sense that the reconstruction is validated and improved. That said, the updates improved the connectivity considerably and together with the inclusion of a reaction describing biomass composition now allows FBA to be performed. The availability of the model in SBML means that it is accessible through many generic and systems-biology-specific software packages, including the COBRA (COnstraint-Based Reconstruction and Analysis) toolbox [26].

The model was used to predict single knockout viability through flux balance analysis (FBA). Growth conditions exactly followed those set out in iMM904, namely a glucose-limited minimal medium. Cellular biomass was defined as in iIN800 (carbon-limited version), due to its high level of detail regarding lipid composition. As the reaction producing biomass does not represent a real metabolic process it is semantically annotated as such using SBO (Systems Biology Ontology) [27] identifiers and GO (Gene Ontology) [28] evidence codes to ensure this distinction is maintained (therefore allowing one to easily remove this reaction based on its annotation). Simulations were performed using COBRA (which is reliant on libSBML [29] and the GNU linear programming kit [30]). The simulation predictions were compared to a list of lethal gene knockouts. This list was generated by considering results from viability experiments under both rich [31] and glucose minimal growth medium conditions [32]. Results demonstrate similar performance to that of previous reconstructions in terms of the accuracy of prediction of single gene knockout viability (Table 3).

Table 3 Gene knockout analysis

Closer inspection of predictions reveals that relatively subtle network variations often underlie prediction differences. Four experimentally lethal knockouts were not initially predicted as such by the new reconstruction, but are correctly predicted using iMM904. Three of these genes encode enzymes that are essential to riboflavin biosynthesis. The capacity of iMM904 to predict lethality correctly is due to its biomass definition including a small contribution from riboflavin, whereas this was not part of the initial iIN800 or current network's biomass definition. Subsequent addition of riboflavin to the (empirical) biomass description has resolved these differences. Note that this is not therefore a reflection of the quality of the underlying network but only of the empirical biomass estimation, which is itself dependent on the growth conditions.

In places, the added richness of the new reconstruction combines with certain known limitations to defeat total agreement with experiment. An example is seen by knocking out the acs2 gene, encoding acetyl-coA synthetase (Acs2p). By experiment this should be lethal, yet in the current network the cytoplasmic reaction is also catalysed by Acs1p, consistent with experimental data [33]. When the Acs2p-catalysed reaction is eliminated, flux simply re-routes through the Acs1p reaction. Importantly, it is only the fortuitous incompleteness of iMM904, lacking the cytosolic Acs1 isozyme that reveals the inviability of the acs2 knockout. The proper basis of the inviability of the acs2 mutant is that ACS1 is transcriptionally repressed in the high glucose conditions of viability experiments and so is unable to compensate for the loss of ACS2[34]. Transcriptional control is not captured in the metabolic network and thus cannot be captured in metabolic reconstructions of this type.

Both these examples highlight the caution required when using approaches such as FBA to validate reconstructions. The added detail in the present network can naturally lead to an increase in false positive outcomes: in silico knockouts that are overcome by alternative routings in the network but are actually lethal in vivo. This is, however, tempered by a decrease in false negative outcomes (i.e. knockouts that appear lethal computationally but are viable in vivo, as presented in Table 3).

Uncharacterised enzymes

Despite the much-increased coverage of the current reconstruction, 451 genes probably encode metabolic enzymes that still have no associated reaction (Additional file 2). For the majority of these, very little is known about their function and further characterisation is required. From the viewpoint of furthering systems biology reconstruction efforts, these enzymes are important targets for reductionist molecular biology studies, including, for instance, systematic analyses using the Robot Scientist approach [35]. Their listing here is a motivation for further iterations on the cycle of knowledge.

Conclusions

The development of high quality, well annotated, genome-scale, metabolic networks is an ambitious, challenging, but necessary step towards the realisation of integrative systems biology. While networks predicted through bioinformatics approaches are useful, particularly for the extension of systems biology approaches to less well-studied organisms, reconstructions built upon solid biochemical evidence provide a gold standard upon which predictions can be reliably based. For metabolic reconstructions, where the goal is to capture maximally our current understanding of metabolism, these problems are primarily of data integration and quality. It has proven essential to involve the extended systems biology and yeast communities in this process, both to establish the mechanisms and structures for acquiring and representing information, and also to tap into expert knowledge from the various sub-disciplines of biology and biochemistry. In the recent very large-scale reconstruction of the yeast molecular interaction network by Aho et al. [36], genomic, transcriptomic, proteomic and metabolomic data were integrated. These authors note that incorporating the higher quality data of Yeast 1.0 (and therefore even more of this contribution) would considerably improve their reconstruction over the metabolic information extracted from KEGG, and also that standards compliance is essential to this integration task.

Yeast 1.0 set standards and amalgamated existing networks, enhancing annotation and removing less reliable data. In this latest reconstruction, we have made significant headway on the process of filling gaps in the network. There is still some way to go before realising the goal of at least one reaction for each putative metabolic enzyme and, if one also considers enzyme promiscuity [37, 38], even this will represent an incomplete picture of metabolism. This latest reconstruction is a considerable improvement on previous releases, particularly in describing lipid metabolism and addressing gaps in the original reconstruction that hindered modelling efforts. Information from other reconstructions since Yeast 1.0 has been incorporated, although not indiscriminately, and very many reactions not found in other reconstructions have been garnered from the literature. It is considerably larger than all previous efforts, while maintaining compliance with community-defined standards.

While Yeast 1.0 represented a major advance, particularly through the definition of standards and by the involvement of the wider yeast community, a major flaw was that it was not amenable to constraint-based analysis. The current reconstruction rectifies this, mostly by filling in gaps but also by inclusion of an appropriately annotated "biomass" reaction, without compromising the strict evidence requirements of its predecessor. When compared to experimental knockout data, this reconstruction did not identify certain lethal knockouts that other yeast reconstructions correctly predicted, but proves better than them in recognising viable deletions. This is a direct result of the richness of the model; as with the example of the acetyl-coA synthetases (above), addition of isoenzymes of specific reactions that do not exist in earlier reconstructions can reduce the predictive power of the model. Nonetheless, such enzymes are included due to literature support. This reconstruction continues the shifting focus, started with the consensus model Yeast 1.0, toward realistic representation and proof-based selection of reactions, rather than creating a reconstruction with simulation in mind. Reactions with a lower level of confidence (e.g. biomass definition) are characterised with specialised evidence codes and SBO terms, allowing the easy extraction of subsets of the network from the SBML code for specific purposes.

To facilitate further improvements, we encourage the community to provide information and/or corrections to the current release. We have set up a dedicated point-of-contact to this end network.reconstruction@manchester.ac.uk. We also highlight gaps in the network that cannot be resolved from current literature, as well as the little-studied enzymes for which we have not yet identified any function (see Additional File 2). These represent potentially important research opportunities for the community and we welcome efforts towards an improved understanding of their functions.