Metabolic network discovery through reverse engineering of metabolome data
- 826 Downloads
Reverse engineering of high-throughput omics data to infer underlying biological networks is one of the challenges in systems biology. However, applications in the field of metabolomics are rather limited. We have focused on a systematic analysis of metabolic network inference from in silico metabolome data based on statistical similarity measures. Three different data types based on biological/environmental variability around steady state were analyzed to compare the relative information content of the data types for inferring the network. Comparing the inference power of different similarity scores indicated the clear superiority of conditioning or pruning based scores as they have the ability to eliminate indirect interactions. We also show that a mathematical measure based on the Fisher information matrix gives clues on the information quality of different data types to better represent the underlying metabolic network topology. Results on several datasets of increasing complexity consistently show that metabolic variations observed at steady state, the simplest experimental analysis, are already informative to reveal the connectivity of the underlying metabolic network with a low false-positive rate when proper similarity-score approaches are employed. For experimental situations this implies that a single organism under slightly varying conditions may already generate more than enough information to rightly infer networks. Detailed examination of the strengths of interactions of the underlying metabolic networks demonstrates that the edges that cannot be captured by similarity scores mainly belong to metabolites connected with weak interaction strength.
KeywordsNetwork inference Interaction strength Metabolome modeling Indirect interactions Biological/environmental variability Similarity scores
The cell’s phenotype emerges from the coordinated behaviour of a web of interactions among its genes, proteins and metabolites. This implies a close relationship between the structure of interaction networks and functionality (Futschik et al.2007; Stelling et al.2002; Vazquez et al.2003). Therefore, one of the challenges in systems biology is to infer cellular networks from data collected through high-throughput techniques. The so-called ‘omics’ data hold information on the network from which they are derived. A proper analysis of such data, therefore, can reveal the structural properties of the network in question, enabling discovery of direct interactions among the measured transcripts, proteins or metabolites. In this regard, network inference is a step towards elucidating functional properties in cellular systems since the network structure is the backbone behind normal as well as abnormal phenotypic states such as disease, malfunctioning, and overproduction.
Network inference approaches are highly popular in transcriptomics to infer genetic regulatory networks (Bansal et al.2007; Soranzo et al.2007). In this study, we focus on a relatively untouched area with the overall goal of inferring metabolic networks from metabolome data. The reverse engineering approach employed is a top–down approach to network reconstruction. In the widely used bottom–up approach the metabolic network topology is compiled from the literature and is later used as a scaffold in analyzing omics data (Gonzalez et al.2008; Notebaart et al.2006; Price et al.2004; Rahnenführer et al.2004), leading to the construction of genome-scale metabolic models. Such bottom–up models are mainly limited to stoichiometric interactions between metabolites, ignoring the interactions due to regulatory mechanisms such as inhibition or activation. Additionally, stoichiometric interactions in such models are not complete as revealed by the presence of a considerable percentage of totally inactive ‘dead-end’ metabolites (Förster et al.2003). These facts are the main reason why the reconstructed bottom–up models lead to some erroneous predictions of phenotypic states (Forster et al.2003). The top–down approach, on the other hand, does not have these limitations provided that the collected data capture the variation in all pathways.
Two major issues in the top–down approach are the type of experiment or perturbation to be performed and the type of network inference method to be used. Biological data collected in different ways (steady-state or dynamic experiments, under genetic or environmental perturbations) differ in the information content they carry about the underlying network (Soranzo et al.2007). Some researchers focus on methods that require complicated experimental design such as perturbation of each node in the system separately (Sontag et al.2004; de la Fuente et al.2002). On the contrary, we concentrate on analyzing the information content of observational metabolome data based on emerging biological or environmental variations around steady state without any sophisticated targeted design. Thereby, we aim to answer the question whether natural variation observed around steady state, which is the simplest experimental analysis, is informative enough to reveal the connectivity of the underlying metabolic network. Various reverse engineering methods of omics data exist in the literature (Bansal et al.2007; Markowetz and Spang 2007). We choose statistical similarity measures as a network inference tool since they are widely employed (Margolin et al.2006; Nemenman et al.2007; Soranzo et al.2007; Werhli et al.2006; de la Fuente et al.2004), and they best suit analysis of steady-state data. Moreover, a detailed application of similarity measures on metabolome data is missing in the literature unlike the popular usage in transcriptome data-based genetic network inference attempts (Markowetz and Spang 2007; Soranzo et al.2007). The amount of applications for metabolomics so far has been limited (Nemenman et al.2007), with no detailed comparative investigation of non-linear measures or conditioning and pruning approaches which eliminate indirect interactions.
A good starting point for metabolic network inference is the use of in silico generated metabolome data from kinetic metabolic models available in literature (Mendes et al. 2003). This approach facilitates to draw conclusions on the quality of data and perturbation needed for metabolic network inference of real systems as well as enabling quick testing of inference quality. Kinetic models of three example systems (threonine synthesis pathway of E. coli consisting of 4 metabolites, S. cerevisiae glycolysis pathway with 13 metabolites, and E. coli central metabolism pathway with 18 metabolites) were used in this study for in silico data generation. We test the effect of the following parameters on network inference: (a) different types of (natural) variability, (b) different types of similarity measures and (c) elimination of indirect interactions through conditioning and/or pruning.
2 Materials and methods
2.1 In silico data generation
Kinetic details of models describing the studied systems were taken from JWS Online (Olivier and Snoep 2004). The systems were solved either using MATLAB’s built-in ordinary differential equation (ODE) solver ode15s for the enzymatic variability data or using the Milstein method of the stochastic differential equation (SDE) Toolbox (Picchini 2007) for the intrinsic variability and the environmental variability data (see next subsection for details of the data types). A thousand steady-state data points were collected from independent runs for each variability type analyzed. Initial values of concentrations were kept the same among different independent runs since they were found to have no effect on steady-state concentrations. For stochastic simulations, a real steady state is not possible due to fluctuating profiles, and data was collected after a few seconds of simulations starting from different near-steady-state concentrations to assure that the fluctuations were stabilized.
2.2 Biological/environmental variability
Metabolomics experiments conducted under identical conditions with the same genetic background do not necessarily lead to identical results (Fiehn et al.2000; Martins et al.2004). This has been attributed to natural variability inherent to living organisms originating from systems properties, leading to consistent correlation patterns among metabolites (Steuer 2006). In this study, we focus on three common factors causing this variability. Our goal is to compare the relative information content of each of these factors in revealing true systemic interactions.
2.2.1 Enzymatic variability
2.2.2 Intrinsic variability
2.2.3 Environmental variability
2.3 Similarity measures
2.3.1 Relevance networks
2.3.2 Conditioned networks
2.3.3 Pruned networks
2.4 Significance measure of similarity scores
A distribution-free test, the permutation test, was applied to the collected in silico data to assign a P-value to each possible edge by shuffling the data 5,000 times. A P-value cut-off of 0.01 was used to select edges with significant similarity scores. These selected edges are combined to give the connectivity pattern of the inferred network, which then can be compared with the actual metabolic network derived from the in silico model. The formation of actual metabolic interaction network is based on the ODE balances around metabolites, which shows if the level of one metabolite is influenced by the level of others (calculation of Jacobean matrix of the system gives the same information quantitatively, see Sect. 2.5). In this way, not only the intuitive substrate-product interactions are counted, but also the influences between substrates of the reactions are covered. This corresponds to substrate-graph representation of metabolic networks in graph theoretical analyses (Wagner and Fell 2001).
2.5 Effect of strength of interactions on network inference
To calculate the Jacobian strength of an interaction, (a) we have calculated the Jacobian matrix of the system at steady state based on Eq. 6 and (b) we have assigned the absolute maximum of upper- and lower-diagonal entries as the Jacobian strength of each metabolite pair since the two entries may differ depending on the reversibility of interactions.
3 Results and discussion
3.1 Threonine synthesis pathway in E. coli
Similarity score calculations are based on a dataset of 1,000 generated data points. A hundred such datasets were generated to test the reproducibility of the resulting networks. The solid lines in Fig. 2 correspond to perfectly reproducible edges whereas the presence of dotted or dashed edges indicates variability in the inference results among the 100 independent datasets (see legend to Fig. 2 for details). Figure 2 reveals that conditioning reduces the number of false positives: PPC1 and CMI1 perform noticeably better compared to the non-conditioned counterparts that give networks with more connectivity. The performance of GGM based PPCn is also comparable. Additionally, DPI pruning is very effective, especially with the intrinsic and environmental variability approaches (Fig. 2c, d), leading to the full inference of the original network by all used similarity measures without leaving any ambiguous edges behind. This shows the refining power of pruning on the inferred network. It is more obvious for environmental variations where the non-pruned results are relatively less promising, especially for the linear measure tests. Even conditioned approaches lead to a set of false-positive edges for this type of data. Application of DPI pruning (gray lines), on the other hand, successfully infers the original network for all types of similarity measures (Fig. 2d).
In terms of linear vs. nonlinear measures, no clear difference was observed for the two systems between PC and MI, or PPC1 and CMI1, regardless of the variability approach. This implies that relationships between metabolites around steady state are mainly linear for the analyzed conditions, in parallel with previous findings for transcriptome data (Steuer et al.2002).
The three data types used in this study can be grouped in two classes. Enzymatic variability data is based on variations of enzymatic properties across different experiments leading to slight differences in individual reaction rates, and that, in turn, causes variability in metabolite levels. Intrinsic and environmental variability, on the other hand, cover net effect of several types of dynamic fluctuations on metabolite levels. It was shown for Vmax-dependent enzymatic variability (Camacho et al.2005) that two neighbouring metabolites in the network may have little or no similarity when the enzymes that regulate them vary in different directions causing a low correlation. That said, it may not be possible to have a perfect network inference based on enzymatic variability as it is dependent on enzyme mechanisms behind metabolic conversions. In other words, enzyme mechanisms play an important role in metabolic network inference. This was partly observed when conditioning or pruning is applied to the edge E23 in Fig. 2b, leading to ambiguous edges for neighbouring metabolites. The other two data types (Fig. 2c, d), on the other hand, did not have this limitation. This suggests that a different data type makes it possible to break the barriers due to enzyme mechanisms and to infer the edges connecting metabolites whose co-response are controlled by multiple enzymes in different directions.
Comparison of the three variability approaches indicate that, for this small example, intrinsic variability leads to the best results, with identification of the original network not only by conditioning but also by pruning regardless of the similarity method employed.
3.2 Application to larger networks: glycolytic pathway in S. cerevisiae and central metabolism in E. coli
The next examples are of larger networks. The first one is the glycolysis pathway of S. cerevisiae which consists of 13 metabolites and 18 reactions (Teusink et al.2000). The 13 metabolites correspond to 78 possible interactions, whereas the number of real edges in the network is 21. Additionally, the 18-metabolite and 30-reaction network of E. coli central carbon metabolism was considered (Chassagnole et al.2002), which has 153 possible pairwise interactions, of which 39 are genuine.
In practice, the ROC curves are not available because the true network is unknown. Hence, one has to select a P-value and usually a value of 0.01 is chosen. The consequence of this selection is shown with the black dots on the ROC’s of Fig. 3. The choice of the cut-off point for the P-value can lead to unfavorable results, e.g., in the case of the CMI of E. coli of the enzymatic variability: a better compromise between false-positive rate and true-positive rate would have been obtained at another P-value (i.e. at another point on the ROC curve). Unfortunately, the position of the ‘P-value point’ on the ROC curve is not known for practical cases. This serves as a remark of warning for practitioners: the P-value is just an arbitrary choice and a different choice of P-value leads to different results, a more- or less-connected graph with more or less false positives and false negatives, and thereby the choice of P-value can lead to a suboptimal recovery of the underlying network.
The importance of quantitative measures for the information quality of experimental data to be used in network inference was pointed out (Camacho et al.2007). The Fisher Information Matrix has been in use for this purpose to judge the quality of experiments (Kresnowati et al.2005). The multiplication of a data matrix with its transpose is called the Fisher Information Matrix and the condition number (called modified E-optimality) of this matrix is one of the most widely used criteria for information content of data (Balsa-Canto et al.2007). In this measure, lower scores correspond to better data types. We calculated the condition number of the Fisher Information Matrices corresponding to each of the three data types for both systems. Data were standardized before the calculation of modified E-optimality score. The condition numbers of data from environmental variability are on the order of 109 and 1012 for S. cerevisiae and E. coli systems, respectively, while that of enzymatic and intrinsic variability data are at least 106 fold lower. This fact points to the low quality of the environmental variability data, in parallel with the observations in Fig. 4a and b. To further strengthen these results, environmental variability data with 50 times higher weight for the stochastic term was generated for E. coli; resulting in a dataset with much higher variation. The corresponding condition number of the Fisher Information Matrix was, albeit lower than original, still 104 fold higher than the other data matrices, suggesting that environmental variability does not result in informative data for the inference of intracellular networks.
3.3 Validation of the results
Figure 5 reveals that, especially for E. coli system, there is a number of interactions which cannot be captured by neither the enzymatic nor the intrinsic variability-based data types (false negatives). Therefore, to investigate the role of weak-strength interactions on the false negatives encountered in similarity-based inference methods, we first focus on the E. coli system. We have classified weak interactions as the ones with interaction strengths lower than 1. From the 39 interactions in E. coli system, 12 fall into this category. Further inspection of these weakest 12 interaction strengths (with a range of 8.10−6–0.17) reveals that 9 and 11 of them have insignificant PPCnP-values, respectively, for data based on enzymatic and intrinsic variability. This explains why these interactions cannot be captured by the PPCn score. Ignoring these interactions can lead to a true-positive rate of as high as 0.84, compared to current values of around 0.60 (Fig. 3, Supplementary Table 1). Further calculations of Spearman rank correlations between strengths of 39 interactions and corresponding PPCn scores gives 0.64 (P-value: 1.10−5) and 0.72 (P-value: 2.10−7), respectively, for enzymatic and intrinsic variability datasets. That is, there is a significant relationship between these two entities for both data types. For S. cerevisiae, a very low number of false negatives was observed, which is in accordance with the fact that no weak interactions were present in this system. Summarizing, false negatives in metabolic network discovery are present because of low interaction strength and not primarily because of the failure of the network inference methods.
4 Concluding remarks
A systematic analysis of metabolic network inference was performed based on different types of in silico steady-state metabolome data. A comprehensive investigation of similarity measures for network inference on metabolomics data enabled the testing of nonlinear measures as well as measures eliminating indirect interactions. Linear versus nonlinear similarity measures were shown not to differ noticeably implying the lack of non-linear relationships among metabolites around steady-state conditions, which is especially true for datasets with relatively small perturbations around steady state. Conditioning and pruning approaches were found to improve results considerably by eliminating a high percentage of indirect links. The false negatives encountered were shown to be related to intrinsic properties of the network, i.e. weak interactions. Along the way, we extended the ARACNE approach, which is specific to the MI scores, to other similarity scores including conditioned ones and concluded that PPCn has a better inference capacity than any of the pruned scores.
Comparison of different variability methods reveals that intrinsic variability is generally more informative. Translating this result to experimental situations, this implies that a single organism under slightly varying conditions may already generate more than enough information to rightly infer networks, without having to turn to more genetic diversity or to more complicated experimental design. However, solely perturbing substrate conditions will not reveal the underlying network.
Use of Fisher Information Matrix-based testing gave hints on the quality of different datasets, suggesting a diagnostic for the quantitative pre-inspection of data. Use of environmental variability was not promising even when conditioning was applied. Pruning, however, improved the results of this variability type considerably, albeit still being inferior to the two other variability approaches.
A disadvantage with similarity-based approaches presented here is the requirement of a high number of replicate measurements. However, no complicated experimental design is needed, making it more practical to employ this approach. Additionally, we have shown that pruning and conditioning approaches have the power to eliminate some ambiguous edges arising due to non-reproducible datasets. We have focused on data from steady-state variations without any designed perturbation since designed perturbations (e.g. knock-out or overexpression of selected enzymes) correspond to different cellular states with different similarity patterns. Therefore, one should be cautious to analyze such data as it can lead to misleading correlations (Camacho et al.2005).
It is not yet possible to have a perfect inference for metabolic networks with the presented approach. However, the finding that different data types hold different information over a network points to the importance of integrated analysis of different data types. It can be argued that all three different types of variation analyzed can be present under normal conditions. Integration of results from different data types were shown to result in much higher true-positive rates, pointing to higher information content of a dataset including the effect of all three variations. The focus on proper experimental setup for reverse engineering approaches together with the measures quantifying the information content of omics datasets will be the future trend in this top–down systems biology approach.
This work is supported by Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research (NWO). Joke Blom (CWI, Amsterdam) is gratefully acknowledged for her invaluable help on the solution of stochastic differential equations. Prof. Ruud Berger (UMC Utrecht) and Prof. Joost Teixeira de Mattos (SILS, University of Amsterdam) are acknowledged for discussions on variability in mammalian and microbial systems, respectively. We thank Daniel J. Vis (SILS, University of Amsterdam) for his comments on the manuscript; Kaustubh Patil (Max-Planck Institute, Saarbruecken) for discussions on pathfinder networks and reading the manuscript; and Emrah Nikerel (Technical University of Delft) for discussions on intrinsic fluctuations.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Kresnowati, M. T. A. P., van Winden, W. A., Almering, M. J. H., ten Pierick, A., Ras, C., Knijnenburg, T. A., et al. (2006). When transcriptome meets metabolome: fast cellular responses of yeast to sudden relief of glucose limitation. Molecular Systems Biology, 2, 49. doi:10.1038/msb4100083.PubMedCrossRefGoogle Scholar
- Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Dalla Favera, R., et al. (2006). Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7(Suppl 1), S7. doi:10.1186/1471-2105-7-S1-S7.PubMedCrossRefGoogle Scholar
- Nemenman, I., Escola, G. S., Hlavacek, W. S., Unkefer, P. J., Unkefer, C. J., & Wall, M. E. (2007). Reconstruction of metabolic networks from high-throughput metabolite profiling data: In silico analysis of red blood cell metabolism. Annals of the New York Academy of Sciences, 1115, 102–115. doi:10.1196/annals.1407.013.PubMedCrossRefGoogle Scholar
- Picchini, U. (2007). Sde toolbox: Simulation and estimation of stochastic differential equations with matlab. Retrieved from http://sdetoolbox.sourceforge.net.
- Rahnenführer, J., Domingues, F. S., Maydt, J., & Lengauer, T. (2004). Calculating the statistical significance of changes in pathway activity from gene expression data. Statistical Application in Genetics and Molecular Biology 3, Article16.Google Scholar
- Steuer, R., Kurths, J., Daub, C. O., Weise, J., & Selbig, J. (2002). The mutual information: detecting and evaluating dependencies between variables. Bioinformatics (Oxford, England), 18(Suppl 2), S231–S240.Google Scholar
- Teusink, B., Passarge, J., Reijenga, C. A., Esgalhado, E., van der Weijden, C. C., Schepper, M., et al. (2000). Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. European Journal of Biochemistry, 267, 5313–5329. doi:10.1046/j.1432-1327.2000.01527.x.PubMedCrossRefGoogle Scholar
- Werhli, A. V., Grzegorczyk, M., & Husmeier, D. (2006). Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics (Oxford, England), 22, 2523–2531. doi:10.1093/bioinformatics/btl391.CrossRefGoogle Scholar
- Wu, L., Mashego, M. R., van Dam, J. C., Proell, A. M., Vinke, J. L., Ras, C., et al. (2005). Quantitative analysis of the microbial metabolome by isotope dilution mass spectrometry using uniformly 13c-labeled cell extracts as internal standards. Analytical Biochemistry, 336, 164–171. doi:10.1016/j.ab.2004.09.001.PubMedCrossRefGoogle Scholar