Estimation of dynamic flux profiles from metabolic time series data
- First Online:
- Cite this article as:
- Chou, IC. & Voit, E.O. BMC Syst Biol (2012) 6: 84. doi:10.1186/1752-0509-6-84
- 3.6k Downloads
Advances in modern high-throughput techniques of molecular biology have enabled top-down approaches for the estimation of parameter values in metabolic systems, based on time series data. Special among them is the recent method of dynamic flux estimation (DFE), which uses such data not only for parameter estimation but also for the identification of functional forms of the processes governing a metabolic system. DFE furthermore provides diagnostic tools for the evaluation of model validity and of the quality of a model fit beyond residual errors. Unfortunately, DFE works only when the data are more or less complete and the system contains as many independent fluxes as metabolites. These drawbacks may be ameliorated with other types of estimation and information. However, such supplementations incur their own limitations. In particular, assumptions must be made regarding the functional forms of some processes and detailed kinetic information must be available, in addition to the time series data.
The authors propose here a systematic approach that supplements DFE and overcomes some of its shortcomings. Like DFE, the approach is model-free and requires only minimal assumptions. If sufficient time series data are available, the approach allows the determination of a subset of fluxes that enables the subsequent applicability of DFE to the rest of the flux system. The authors demonstrate the procedure with three artificial pathway systems exhibiting distinct characteristics and with actual data of the trehalose pathway in Saccharomyces cerevisiae.
The results demonstrate that the proposed method successfully complements DFE under various situations and without a priori assumptions regarding the model representation. The proposed method also permits an examination of whether at all, to what degree, or within what range the available time series data can be validly represented in a particular functional format of a flux within a pathway system. Based on these results, further experiments may be designed to generate data points that genuinely add new information to the structure identification and parameter estimation tasks at hand.
KeywordsBiochemical systems theory Dynamic flux estimation Metabolic pathways Parameter estimation Structure identification Time series data
A grand challenge of biomathematical modeling is the conversion of a biological system into a computational structure that formalizes the underlying system. An important and very challenging component of this process is the estimation of parameter values. The task is typically pursued with one of two generic approaches, namely a forward (bottom-up) or an inverse (top-down) method. Until recently, essentially all models of metabolic pathway systems were developed according to the first strategy, that is, by characterizing model components and processes one at a time and subsequently merging all “local” information about kinetic reaction steps into one comprehensive dynamic model. Although this forward approach is theoretically straightforward, implementation procedures often fail and, moreover, have intrinsic disadvantages . For instance, the necessary information is usually obtained in vitro and from different experiments, so that there is no guarantee that it is entirely compatible and consistent.
The second, top-down approach uses data that characterize the entire system and attempts to estimate all parameter values at once with a sophisticated optimization algorithm. Specifically, this type of method employs time series data that describe the full dynamic response of a pathway system to some stimulus, such as an environmental stress (e.g., heat shock) or the availability of food (e.g., glucose uptake). In contrast to the local data obtained from traditional experiments, the great appeal of using these types of “global” data is that most, if not all, measurements are taken from the same biological system under the same conditions. Furthermore, these data contain enormous and essentially unadulterated information about the structure, dynamics and regulatory mechanisms that govern the biological system under investigation. The main drawback of top-down approaches is that the actual extraction and integration of this information into fully functional, explanatory models is challenging. In fact, more than one hundred articles addressing this task appeared within the past ten years. They focused on various aspects of the estimation process, but most of them were dedicated to the main issue of optimizing parameter values against the observed time series data (e.g., [2, 3]).
Whether a forward or inverse approach is used, the estimation of parameter values necessitates assumptions regarding the functions or rate laws that describe the reactions of interest. As a prominent example, the typical default for enzymatic reactions in a metabolic pathway is the Michaelis-Menten rate law (MMRL) or one of its variations. While such assumptions are understandable, they create an immediate conundrum. Namely, the true mechanisms governing a biological process are in reality unknown or at least unclear. As a result, the estimation process is from the start unguided, uncertain, or maybe even based on modestly or entirely wrong assumptions. Also, descriptions of more complex enzyme mechanisms contain numerous parameters if several substrates or reactions are involved, so that the alleged functions cannot be identified from the typically sparse data [4, 5].
In addition to the troublesome issue of model selection, most proposed methods for estimation from time series data face significant problems related to the data themselves, to inefficient algorithms, and to a variety of computational issues. To complicate matters further, these issues are usually superimposed. The data may be overly noisy, incomplete, collinear with each other, or non-informative. The computational algorithms are often slow to converge, converge to a locally but not globally optimal solution, or do not converge at all. Finally, there is a mathematical issue, especially for systems with many parameters, namely that a system may admit solutions that are distinctly different yet equivalent, or essentially equivalent, with respect to the residual error. This type of result, referred to as sloppiness and unidentifiability, may be due to redundancies in candidate parameter sets and has received much attention in recent times [6, 7, 8].
A different type of sloppiness may be caused by the fact that different model structures may give essentially identical residual errors. For instance, several probability density functions often model the same data equally well . Moreover, two “wrong” structures or representations within a model may permit compensation of errors between different terms or equations . It is not even clear whether the residual error (SSE) is always the best metric for the quality of fit . For instance, the “best” models in terms of having the smallest SSE tend to have too many parameters and therefore encounter over-fitting problems. This issue can be serious, because an over-fitted model often lacks the capacity of extrapolation and predictive power with respect to data not used in the estimation or untested conditions. Therefore, it is necessary to develop tools for the evaluation of model validity and quality beyond residual errors. For instance, one should establish criteria to determine the appropriateness of the chosen mathematical representations, develop methods for assessing whether residual errors are due to idiosyncrasies or noise in the data, and develop diagnostic tools of discriminating between valid and invalid model structures.
The left-hand side of this ODE can be interpreted as the slope of the time course of the variable Xi at a given point in time. Therefore, assuming that the time series data are more or less complete and smooth—or can be validly smoothed (see later) —one can estimate the slope of the time course at each time point and substitute the slopes for the derivatives. If the system contains N equations, and if data are measured at K time points, this substitution decouples the system of N differential equations into N sets of K algebraic equations each. This system is linear in the fluxes and can be assessed with methods of linear algebra. In particular, it is easily solved at each time point if the system has full rank.
The result of this first phase of DFE is a representation of each flux as a numerically characterized function of time and as a function of all contributing metabolites. This representation is not explicit, but purely numerical and consists of points in plots of flux vs. time or flux vs. metabolites and modulators. The second phase of DFE addresses the mathematical formulation of each process in the biological system by attempting to convert these numerical plots into mathematical representations, such as a Michaelis-Menten or Hill rate law or a power-law description. In contrast to most other methods, where a functional form had to be assumed a priori, this step allows quantitative diagnostics of whether a candidate of a mathematical formulation may be appropriate, at least within certain ranges of the contributing variables. The subsequent determination of parameter values is now much easier, because it involves explicit functions that are addressed one flux at a time.
DFE offers substantial advantages. It makes almost no assumptions and is straightforward if the right data are available. It reveals inconsistencies within the data, avoids compensation among and within equations, and permits quantitative diagnostic tools of whether the assumed mathematical formulations are appropriate or in need of improvement. In addition, since DFE identifies parameters based on explicit single-flux representations, the estimation of parameter values is much easier and more reliable than in other top-down approaches. As a result, DFE promises significantly improved extrapolation capacity toward new data or experimental conditions.
Alas, DFE also has limitations and drawbacks. First, it requires more or less complete time series data that characterize the investigated system. These data are still relatively seldom, although they are being generated at an increasing rate and with rapidly improving quality. Second, and arguably more limiting, a unique solution of the flux equations in the first phase of DFE is only possible if the flux system is of full rank. However, most actually pathway systems contain more fluxes than metabolites and are therefore underdetermined.
Several constraint-based optimization techniques have been proposed for stoichiometric analyses of underdetermined metabolic systems . They have become a mainstay of flux balance analysis (FBA ) and work well under steady-state and pseudo-steady-state (PSS) assumptions [14, 15, 16, 17]. Mahadevan and co-workers  extended the traditional FBA to account for dynamics and presented two different formulations: the dynamic optimization approach (DOA) and the static optimization approach (SOA). DOA involves optimization over the entire time period to obtain flux profiles over time, while SOA involves dividing the batch time into several time intervals and solving the instantaneous optimization problem at the beginning of each time interval. These methods basically are variations of FBA and need, for the determination of flux profiles at each time point, constraints and objective functions, which describe some goal the cell aims to reach. For the case of microbial systems, a reasonable objective may be maximization of the growth rate. However, determining an unbiased objective function in a eukaryotic system is often difficult.
In contrast to these methods that require objective functions, we proposed extending DFE with the infusion of additional information . We distinguished four cases. First, the connectivity of the systems is not fully known or some of the connections are uncertain. Second, some of the time series data were not measured, although it is known how the corresponding metabolites are involved in the pathway. Third, the system contains “missing” metabolites which are neither known nor measured, but in actuality affect the system significantly. And fourth, the flux system is underdetermined, even though the time series of all relevant metabolites are measured.
The first issue might be ameliorated by methods developed for structure identification of unknown of ill-characterized pathways. These methods include a wide spectrum of techniques, such as perturbation methods, causality models, correlation-based approaches, or probabilistic models, some of which are based on time series data (see  for review). The lack of certain data in the second case could be complemented by deducing the unknown time profiles from time series of neighboring metabolites, if the corresponding enzymatic information is available for fluxes producing and degrading a metabolite in the equation. However, this approach of using kinetic information obtained in vitro, or maybe even from different organisms, is naturally problematic due to some degree of bias and uncertainty. A possible solution strategy for the third case is to check the mass balance in the entire system throughout the time period. If significant changes in mass balance are observed, additional biological insight will be needed to check the pathway model and identify possible sources of leakage or gain of mass. If the masses are more or less balanced, it is still possible that important fluxes or metabolites are missing. However, there is currently no obvious defense in this situation. Finally, to complement an underdetermined flux system, some of the fluxes need to be estimated with information from other sources. For instance, it might be possible to obtain fluxes directly from experiments, but such data are rare. As an alternative, one may assume the functional form for an enzymatic reaction, and if corresponding kinetic information is available, for instance from BRENDA , parameter values may be estimated for this functional form. As a variation on this strategy, one could assume some canonical model, such as power-law functions  or lin-log approximations [22, 23], if some of the variables and fluxes operate within relatively small ranges. Clearly, this option runs counter to the model-free nature of DFE, but might be the only feasible solution. Instead of using kinetic information, one could also select some of the decoupled equations and use optimization methods to fit the selected model to the time series data (e.g.[24, 25]).
Although we presented proof of concept that the different approaches described above can be used to supplement DFE, these approaches are not always optimal, because they require additional information and assumptions that are a priori not validated. The question thus arises: can we directly squeeze additional information out of the time series data, without the need of further assumptions and additional information? And if so, under what conditions is that possible? Providing at least partial answers to these questions is the topic of this article.
Specifically, we propose here a distinct approach to supplementing DFE with information hidden in suitable metabolic time series. Extracting this information permits the determination of a sufficient subset of fluxes to execute DFE on the rest of the flux system. In contrast to all other solutions presented so far for the complementation of DFE, the method proposed here does not require any assumptions regarding the mathematical representation of the fluxes. Furthermore, kinetic information or knowledge of the functional forms of the enzymatic reactions is not required. We will demonstrate in the following that the proposed method can succeed even if some of the time series data are not measured or when there is mass leakage in the pathway systems. In addition, the new method allows us to address a recurring unanswered question, namely how many time series data are needed to estimate the structure and parameters of a system.
Suppose we have time series data, so that we can estimate for every measured time point with sufficient accuracy. Suppose further that the time series data are such that we have m time points (in the same or in different datasets) where Xi has the same value (e.g., ci), whereas Xj has a different value at each of these time points. It is reasonable to assume that is a function in a strict mathematical sense, which means that has one unique (although yet unknown) value vci. If so, we have m equations of the type , where the values of Xj and are known directly from the data and vci always has the same value. Using these quantities, the methods proposed here allow us to estimate the functional format of , at least over some pertinent range of Xj values. Once we know , we can determine . Thus, we now have numerically quantified two fluxes, which reduce the discrepancy between the number of equations and the number of independent fluxes. Repeated application of the method allows DFE for the entire system. An illustrative example is shown in Methods and other examples are presented in the Results. If the function depends on more than one variable, the procedure is the same in concept but more complicated in detail (see Additional file 1).
where s is a vector of slopes, N is the stoichiometric matrix, v is a vector of fluxes, and K is the number of time points t1, t2,…, tj,…, tK where measurements are available.
Next we check the rank of the linear set of algebraic equations in Eq. (2). The system can be easily solved at each time step to obtain dynamic profiles of all fluxes if the system has full rank. Over-determined systems may be solved by pooling fluxes, the use of pseudo-inverse methods, or regression. However, if the system is underdetermined, the solution space is infinite. To overcome this issue, some of the fluxes need to be estimated independently, until the system has full rank and can be solved uniquely. Elsewhere we showed that additional information maybe used to characterize selected fluxes . Here, the goal is to estimate some fluxes directly from the time series data, without evoking other sources of information.
Suppose time series data were measured and they are without noise (Figure 1(b)). Eq. (3) immediately indicates that the flux system is underdetermined and therefore has infinitely many solutions. A unique solution could be obtained if one of the fluxes could be determined independently. To achieve this independent determination, one may choose any one of the three equations in the system. For ease of computation, one will typically prefer an equation that contains as few fluxes and as few substrates and modulators as possible. In this linear system, all equations have two fluxes and each of them depends on only one metabolite, so that there is no advantage to choosing one equation rather than another.
We repeat this type of screening for different sets of the same or very similar values of Xin. The result is a set of sets with equal Xin_const values within each set but different Xin_const values for different sets. These sets form a histogram with a bin for each Xin_const. If the range of each bin is small enough, we can assume every Xin in the same bin to have very similar values, so that their corresponding vin_const are also very similar. Henceforth, we only retain bins with at least two entries. An example in the illustrative example consists of time points 3.4 and 9.6, where X1 has again similar values. In this case, the value is ~1.26, which is different from the value we screened before. Similarly, for time points 1, 8.4, and 9.4, X1 has a value of ~0.6 (Figure 2(a)). Figure 2(b) shows many situations in the dataset where X1 has approximately some fixed value, and these sets of X1 are reflected in a “bin database of values.” Within each bin, the corresponding value of v2c is very similar as well.
Since we do not know the functional form of vin, we do not know the numerical value of vin_const (binp). However, since vin_const (binp) is a constant for each bin, the relative positions of a group of values of vout (binp) depend on each value –Si (binp) within a given bin, and the slope values can be measured directly from the time series data. In addition, since vout (binp) is solely determined by Xout (binp), we can characterize the relative positions of a set of Xout (binp) and their corresponding values –Si (binp). Collecting these relationships, we can establish a plot of Xout (binp) versus –Si (binp). If the bin contains only two points of Xout, we consider them as a pair and link them with a connecting line. If the bin contains q points of Xout (where q > 2), we sort Xout based on their values and connect every two adjacent points as a pair to form a total of q-1 pairs. In order to address these pairs, we use an additional index for the position of each pair in each bin, such as (Xout (pairr)(1), –Si (pairr)(1)) for the first point and (Xout (pairr) (2), –Si (pairr) (2)) for the second point, where r = 1, …, q–1.
Equation (12) indicates that Xout (binp) and –Si (binp) differ by a constant, since we do not know the value of vin_const (binp). This fact translates into a constant vertical shift in the y direction for each pair of points. In other words, the relative y positions of the pair are preserved and the pair has to be shifted together by a yet unknown amount. While we do not know the size of the shift for each individual pair of points, all points collectively represent the graph of Xout versus vout, and it is reasonable to assume that this graph is continuous and usually even monotonic. Therefore, the next step is to merge the individual pairs by determining a proper shift for each pair.
Intuitively, it is easy to see how to shift all pairs so that they are close to one continuous line. Automation of the process requires an algorithm that is not quite straightforward, but can be facilitated with a graphical user interface; technical details of a possible merging process are presented in Figure S1 of the Additional file 1. A pseudo-code of the merging is the following:
SET each pair of points as a node
SET each node as a subgraph
WHILE the graph is not connected
FOR each subgraph in the graph
FOR each node in the current subgraph
SET other-subgraphs as the subgraphs; exclude the current subgraph
CALCULATE the distance from the current node to every node contained in other-subgraphs
FIND the shortest distance and its corresponding nodes
CONNECT these two nodes
When the merging is completed, all pairs of points are close to a relatively smooth line, but the overall shift of the group of pairs is not known. We do know that essentially all metabolic fluxes will have values close to zero when their substrate concentration approaches zero. Thus, if sufficiently small substrate values are available in one of the bins, one easily estimates a reasonable shift. Should the flux value associated with some substrate concentration be known, the shift can be determined from this information. A further alternative is the following. If the inferred trend line suggests that the flux follows some rate law, such as a Hill function, the parameters of this function, together with the appropriate shift, can be obtained in a single optimization step.
Figure 4(b) shows, for the illustrative example, the process of merging and shifting. The human eye has no problem accomplishing this task intuitively. In the automated process (see Additional file 1), one connects each “node” (pair of points) with its closest-neighbor node and positioning the pair of points. This process creates two sub-groups of points. We recalculate the distance between each node in a sub-group with the nodes in the other sub-group, determine the closest pair of nodes, connect them, and shift the corresponding pairs into one sub-group as shown in Figure 4(c). Suppose the value of v3 is known for X2 = 1. If so, we ultimately shift the entire trend accordingly. The result is shown in Figure 4(d). A shift based on the association between zero flux and zero substrate concentration is an alternative, although it does not uniquely prescribe a solution in this case.
Finally, based on the numerical or graphical flux profile thus determined, one may test candidate functions that capture the flux-substrate relationship. For instance, the result in the illustrative example shows that the functional relationship of X2vs. v3 is s-shaped. It could thus be consistent with the (true) Hill function in Eq. (4), although the computed result itself certainly would not prove that this format is correct. If one assumes, based on the results, that a Hill function is appropriate, one may fit this functional form to data to find the optimal parameter values of the flux-metabolite dependency. Without making such an assumption, one may alternatively connect the dots in Figure 4(d) with a continuous line and interpolate the values of fluxes using a spline or another smoothing method. The resulting trend line can be used as a “look-up” plot.
Now that we have determined v3, it is easy to compute v2 from the measured slopes of X2. The plot of v3 is slightly curved, which is consistent with its power-law function in Eq. (4), although again, there is no proof. The Results section discusses further examples.
The procedure described above has generated one or two additional flux estimates. For the example in Eq. (3), the determination of v2 and v3 “fills” the rank, and the system can be uniquely solved. In fact, only one of the two is needed. For examples where one or two additional fluxes are not sufficient for a unique solution, the same procedure has to be performed with other equations until enough fluxes are determined to make the flux system full rank. DFE subsequently identifies all other fluxes as plots against time or against their substrates and modulators.
In cases where fluxes contain more than one variable, the time courses have to be screened for combinations where the contributing variables have the same values. The concepts of the procedure are exactly the same as for the univariate case, but the implementation is obviously more involved (see Additional file 1). Also, such combinations are rarer than in the cases described above, so that these situations require more diverse datasets for structure identification.
The simple linear pathway shown in the previous section illustrated the concepts of the proposed extension to DFE. This section describes applications of the proposed methods in the context of further didactic and actual examples that become increasingly more complicated. We begin with two artificial cases with distinct characteristics and conclude with the analysis of experimental observations describing trehalose metabolism in the yeast Saccharomyces cerevisiae.
Branched pathway with feedforward activation and feedback inhibition
The system in Eq. (13) is not of full rank. Thus, some of the fluxes need to be determined with the proposed method. For our illustration, we select the third equation in Eq. (13), because it contains only two fluxes; also, v3 depends only on X2, and v4 depends only on X3, which we know from the topology of the pathway. In the previous example, all time series were oscillating and it was easy to find enough data points where one variable is fixed and other variables display different values. In the present example, each single dataset displays changes over time that show few repeated concentration values (see Figure 6(c)). In such a situation, which is more typical than the earlier illustration example, one can use exactly the same method, applied to multiple datasets with different dynamic profiles, as long as one can validly assume that the functional forms are not affected by differences among the datasets.
Instead of v4, one could also estimate an additional flux from another equation in Eq. (13) using the same procedure, for example, by solving v5 and v6 in the fourth equation. Flux v6 depends only on X4 but v5 depends on two variables X1 and X2. The steps of estimating v5 and v6 are described in Additional file 1. We also tested the proposed method by using six datasets in Figure 6(b) and the results similarly recover the true functional form (data not shown).
The proposed method was also tested on a five-variable system that has been used as a benchmark problem in many articles (e.g.[24, 31, 32, 33]). To demonstrate the applicability of the method, we also added artificial noise to the time series data in this example and randomly picked sub-datasets from data generated with ten conditions. The details and results are shown in Additional file 1.
Glycolysis and trehalose production
The model contains eight dependent variables and eight fluxes, as shown in Eq. (15), where Vext and Vint represent the extracellular (0.05 L) and intracellular (0.00717 L) volume of the bioreactor and the cell population, respectively. Each of the fluxes is a function of some of the variables, as shown in Eq. (16), but it is important to note that we do not make any assumptions regarding the functional forms of the fluxes. In principle, DFE seems to be directly applicable. However, the time series data contain the measurements of only five of the metabolites, namely Glc (X1), G6P (X3), Tre (X4), FBP (X5), and extracellularly accumulated end products (EtOH, Gly, and Ace; X6). Without the measurements of X2, X7, and X8, the system in Eq. (15) is not of full rank and, due to the experimental set-up, v7 and v8 cannot be measured or determined directly by estimating slopes.
Of all steps in the generic mathematical modeling process, parameter estimation and structure identification continue to be among the most severe bottlenecks for modeling biological systems. Until recently, this task was typically pursued from the bottom up by using local data from individual enzymatic steps. However, modern techniques of molecular biology have provided us with a strikingly different estimation strategy, namely a top-down or inverse approach, which is based on dynamic time series data that are being generated with rapidly increasing frequency and quality. Many recent articles have proposed various methods to tackle this inverse estimation problem using time series data. However, none of these methods are effective in all cases. Furthermore, almost all methods have been focusing on the goodness of fit and the speed of the algorithm, but not necessarily the quality of fit in terms of the validity of the model, extrapolation ability, and predictive power with respect to data not used in the estimation. In addition, there has been little discussion of the diagnostic tools for data fits beyond the residual error. For instance, it is possible that a fit is good in terms of the residual error, but that the estimated fluxes are incorrect because of numerical compensations between terms within the model .
Dynamic Flux Estimation (DFE)  addresses several of these issues successfully, but only if the data are rather comprehensive. More limiting, DFE requires that the system of fluxes is of full rank. When the number of fluxes exceeds that of the dependent metabolites, either because of the stoichiometry of the pathway or due to the lack of measurements of some metabolites, DFE cannot be applied directly, because the system of fluxes is underdetermined. To supplement such a system, we recently proposed methods for supplementing DFE with other information that may be used as a substitute for unknown fluxes . However, these methods are successful only under certain restrictive conditions, for instance, when the enzymes in the system are well characterized under pertinent conditions, sufficient kinetic information is available, and all significant metabolic time series are measured. One could also determine some of the fluxes within the system by fitting pre-selected models to time series data. However, this pre-selection requires the definition of functional forms for the reactions in question, which in truth are often unknown.
In this article we propose a model-free approach with minimal assumptions to supplement DFE with information already embedded in the time series data. The proposed method starts with the selection of a decoupled equation; preferable one that contains a minimal number of terms and contributing metabolites. Within this equation, we repeatedly fix one or a few variables that have constant or very similar values within certain small ranges, and find the corresponding values of the variables that appear in another flux of the equation. The result of this step is a plot of a flux versus a metabolite, with several pairs of points showing the relative positions of the true metabolite concentration and the flux values in each pair. The position of each pair is initially subject to shifting in the y direction by an unknown amount. The correct shifting of pairs may be accomplished with an automated or manual merging process that, for instance, accounts for the fact that the flux value should be zero when the metabolite concentration is zero. One could also measure the flux value at some metabolite concentration experimentally. Furthermore, if an enzymatic rate function is deemed correct and corresponding kinetic information is available, the vertical shift in the flux can be calculated. Once the metabolite-flux plots are established for all fluxes, one can select a suitable mathematical representation for the entire dependency or use a piecewise approximation for different ranges of data. One could also use the metabolite-flux relationships directly as “look-up” plots.
The proposed method may appear cumbersome or even baroque. However, one should consider that it solves a problem that so far has not even been addressed—let alone solved—with any systematic approach. Also, the method is presently likely to suffer from a lack of suitable data. But judging by the development of high-throughput experimental methods and the number and increasing quality of published time series over the past decade, this issue seems to be primarily a matter of time. Indeed, one should expect that it will soon be feasible to generate strategically selected, multiple datasets for the identification of a system, which differ slightly in their settings. These datasets must come from experiments that do not alter the functional characteristics of the fluxes in the system but might, for instance, measure system responses under modestly different substrate or inhibitor conditions. At the same time, the data should be representative of the dynamics of the system within the pertinent ranges of its variables.
The method involves one step that is subject to bias. Namely, the overall shifting of the flux-metabolite relationship requires extrapolation or some other information, unless metabolite concentrations close to zero are available. To resolve this issue, it might be possible to determine a reference point for the shift from enzymatic or kinetic information. However, in many cases, this information will have been obtained in vitro and possibly under different conditions. A more direct approach would be to measure a flux value experimentally at some point, for instance, at the steady state. Such a measurement is relatively simple when the flux of interest is an input or output flux. It might also be possible to measure some fluxes directly by estimating the rate of consumption and production of the initial substrate or the end product, respectively. However, the measurements of fluxes at these locations are usually of lower interest since they are seldom associated with the underdetermined subsystems of the internal fluxes. One or more intracellular fluxes could also possibly be characterized through measurements of a suitable isotopomer distribution at steady state (i.e.[17, 35]), but such data are still rare. Finally, if one has valid reason to assume a particular format for a flux, such as a Michaelis-Menten or Hill function, the shift may be obtained through optimization. All estimation methods are negatively affected by noise, and the proposed method is no different (see Additional file 1). However, issues of noise can be logistically separated from this method to some degree. Namely, the time series data used as basis for the proposed method may (in fact, should) be smoothed in a preprocessing step, for instance with a filter [2, 26, 27]. This well-established step allows an assessment of the characteristics of the noise as well as the smoothing process itself. Once the data are smoothed and noise is thus reduced, the proposed method is applied as if the data had been noise free.
Outside these remaining details, the proposed method has several notable advantages. First, no assumptions are needed regarding the mathematical representations when determining the individual fluxes. Second, the application of the method is not limited to a small range of a metabolite or its flux. Instead, it allows the modeler to examine the full spectrum of the functional form, depending on how widely the available time series data cover metabolite concentrations along the x axis of the metabolite-flux plot. Third, even under the condition that some of the time series are missing, the proposed method can still recover—at least to some degree—the governing flux profiles. Finally, since the range of coverage depends on the available datasets, we can, arguably for the first time, estimate how many data points are necessary to identify the functional format of a flux or what values of metabolite concentrations are needed to cover the concentration range of interest. Namely, if it is possible to implement the proposed method for a system at hand with sufficient reliability, then we know that we have enough data to assess the range over which the flux can be determined. If a wider range needs to be known, additional data will have to be made available in that extended range. This insight in itself will aid the design of specific experiments that can be used to generate more extensive functional plots.
In this article we propose a systematic strategy to supplement and ameliorate the limitations of the method of Dynamic Flux Estimation (DFE). The proposed strategy makes no a priori assumptions regarding the model representation and uses instead information embedded in the time series data. The results demonstrate that the proposed approach successfully complements DFE in various situations. The method permits the examination of a full spectrum of functional forms, as well as a determination of whether at all, to what degree, or within what range, the available time series data can be validly represented in a particular functional flux format within a pathway model. Based on these results, one can, arguably for the first time, estimate how many data points are required to identify the functional format of processes within a system model and design experiments to generate data points that genuinely add new information to the parameter estimation and structure identification tasks.
The authors are grateful to Dr. Luis L. Fonseca for constructive discussions and for allowing us to use some of his data. They also acknowledge David Fieni’s work on an automated shifting algorithm. This work was supported in part by a Molecular and Cellular Biosciences Grant (MCB-0946595; E.O. Voit, PI) from the National Science Foundation, a grant from the National Institutes of Health (R01 GM063265; Y.A. Hannun, PI), and an endowment from the Georgia Research Alliance. The work was also in part funded by the BioEnergy Science Center (BESC), which is a U.S. Department of Energy Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring institutions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.