Data integration across conditions improves turnover number estimates and metabolic predictions

Wendering, Philipp; Arend, Marius; Razaghi-Moghadam, Zahra; Nikoloski, Zoran

doi:10.1038/s41467-023-37151-2

Data integration across conditions improves turnover number estimates and metabolic predictions

Article
Open access
Published: 17 March 2023

Volume 14, article number 1485, (2023)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Data integration across conditions improves turnover number estimates and metabolic predictions

Download PDF

2503 Accesses
6 Citations
11 Altmetric
2 Mentions
Explore all metrics

Abstract

Turnover numbers characterize a key property of enzymes, and their usage in constraint-based metabolic modeling is expected to increase the prediction accuracy of diverse cellular phenotypes. In vivo turnover numbers can be obtained by integrating reaction rate and enzyme abundance measurements from individual experiments. Yet, their contribution to improving predictions of condition-specific cellular phenotypes remains elusive. Here, we show that available in vitro and in vivo turnover numbers lead to poor prediction of condition-specific growth rates with protein-constrained models of Escherichia coli and Saccharomyces cerevisiae, particularly when protein abundances are considered. We demonstrate that correction of turnover numbers by simultaneous consideration of proteomics and physiological data leads to improved predictions of condition-specific growth rates. Moreover, the obtained estimates are more precise than corresponding in vitro turnover numbers. Therefore, our approach provides the means to correct turnover numbers and paves the way towards cataloguing kcatomes of other organisms.

A practical guide to amplicon and metagenomic analysis of microbiome data

Article Open access 11 May 2020

Modeling Microbial Community Networks: Methods and Tools for Studying Microbial Interactions

Article Open access 08 April 2024

Advanced methods for gene network identification and noise decomposition from single-cell data

Article Open access 08 June 2024

Introduction

Genome-scale metabolic models (GEMs) together with advances in constrained-based modeling have led to an improved understanding of how cellular resources are used to fulfill different cellular tasks^1,2,3. Recent advances are largely propelled by the development of protein-constrained GEMs (pcGEMs) in which the catalytic capacities of individual enzymes are linked to the allocation of enzyme abundances⁴. Such models have led to more accurate predictions of maximum specific growth rates on different carbon sources^5,6,7, flux distributions⁷, and other complex phenotypes⁸ in Escherichia coli and Saccharomyces cerevisiae. However, the development of pcGEMs critically depends on the integration of organism-specific enzyme turnover numbers, ${k}_{{{\rm {cat}}}}$, comprising the kcatome of an organism⁹.

Measuring the kcatome of an organism based on in vitro characterization is limited due to the impossibility to purify specific enzymes, lack of availability of substrates, and knowledge of required cofactors, such that their relevance for studies of in vivo phenotypes remains questionable^10,11. Proxies for in vivo turnover numbers also termed maximal apparent catalytic rates, can be estimated by combining constraint-based approaches for flux prediction with measurements of protein abundance under different growth conditions or genetic modifications^12,13,14. The results from this approach, which entails ranking condition-specific estimates that use individual data sets, have shown that the proxies for in vivo turnover numbers generally concur with in vitro ${k}_{{{\rm {cat}}}}$ values in E. coli¹². However, applications with data from S. cerevisiae¹⁵ and A. thaliana¹⁶ indicated that these proxies for in vivo turnover numbers do not reflect in vitro measurements. Another approach to estimate the kcatome relies exclusively on the machine and deep learning methods that use a variety of features of enzymes (e.g. network-based, structure-based, and biochemical)^17,18,19, resulting in predictive models that can explain up to 70% of the variance in turnover numbers obtained in vitro.

The estimates of turnover numbers are integrated into metabolic models by different constraint-based approaches that have been grouped into coarse-grained (e.g. MOMENT⁵, sMOMENT²⁰, eMOMENT²¹, and GECKO⁷, which all result in the same feasible space in protein limited growth scenarios) and fine-grained (e.g. resource balance analysis¹ and ME-models^2,3). Of these, GECKO⁷ has been adopted in several recent studies due to the elegantly structured formulation of the protein constraints. In addition, GECKO allows for the integration of protein contents and correction factors that account for the mass fraction of enzymes ($f$) included in the model as well as the average in vivo saturation ($\sigma$) of all enzymes, facilitating the development of condition-specific models. While data-driven estimation of in vivo turnover numbers improves the coverage of ${k}_{{{\rm {cat}}}}$ values in pcGEMs, the available estimates usually lead to over-constrained models when using the allocation of total protein mass, not considered in flux balance analysis (FBA)^22,23.

Here, we propose PRESTO (for protein-abundance-based correction of turnover numbers), a scalable constraint-based approach to correct turnover numbers by matching predictions from pcGEMs with measurements of cellular phenotypes— simultaneously—over multiple conditions. As a constraint-based approach, PRESTO facilitates the investigation of the variability of the proposed corrections. We show that predictions of growth by pcGEMs of S. cerevisiae with turnover numbers corrected by PRESTO are more accurate than those based on the models that include ${k}_{{{\rm {cat}}}}$ values corrected based on a contending heuristic that relies on enzyme control coefficients²². We also demonstrate that the same conclusions hold when enzyme abundances are integrated into the E. coli pcGEM using PRESTO. Therefore, PRESTO paves the way to broaden the applicability of pcGEMs for organisms with biotechnological applications and to arrive at genotype-specific estimates of the kcatome.

Results

Protein-abundance-based correction of turnover numbers

For a given data set of protein abundances over a set of conditions, the enzymes with turnover numbers in a pcGEM can be partitioned into three groups. For instance, a data set of protein abundances that was recently used to estimate in vivo turnover numbers in S. cerevisiae¹⁵ includes 45%, 41%, and 14% measured overall, at least one (but not all), and none of the 27 used conditions, respectively. Therefore, there is then different data support for correcting the ${k}_{{{\rm {cat}}}}$ values of these classes of proteins. PRESTO relies on solving a linear program that minimizes a weighted linear combination of the average relative error for predicted specific growth rates and the correction of the initial turnover numbers integrated into the pcGEM (Fig. 1, see the “Methods” section). It further employs K-fold cross-validation (here, K = 3) with 10 repetitions while ensuring a steady state and integrating protein constraints for proteins measured overall conditions (Fig. 1, see the “Methods” section). The training set of conditions is used to generate a single set of corrected in vitro ${k}_{{{\rm {cat}}}}$ values, by using the respective in vivo protein abundances. The resulting corrected ${k}_{{{\rm {cat}}}}$ values are in turn used to determine the relative error of the predicted specific growth rate for each condition in the test set using flux balance analysis with the pcGEM, while only constraining the total protein content and measured uptake rates. The relative error of the predicted specific growth rate along with the sum of introduced corrections is lastly used to select the value for the tuning parameter $\lambda$ in the objective function of PRESTO, as done in machine learning approaches that rely on regularization.

**Fig. 1: Schematic overview of the PRESTO approach for k_cat correction.**

PRESTO outperforms a contending heuristic in S. cerevisiae

To determine the performance of PRESTO and compare it to that of contending heuristics, we used a data set comprising protein abundances and exchange fluxes from 27 diverse conditions, as supported by the principal component analysis (Supplementary Fig. 1). Application of PRESTO with a pcGEM of S. cerevisiae with initial in vitro turnover numbers obtained from BRENDA resulted in a mean relative error of 0.68 from the cross-validation procedure, yielding a correction of on average 213 turnover numbers (Supplementary Fig. 2a). For the S. cerevisiae pcGEM, we found a value of ${10}^{-7}$ for the parameter λ in the PRESTO objective provides the optimal trade-off between both the relative error and the sum of introduced corrections (see the “Methods” section). Moreover, we observed a high overlap between the sets of proteins with corrected turnover numbers in the cross-validation (average Jaccard distance of 0.07 (Supplementary Fig. 2b, c)), suggesting that the integrated data from different conditions point to a specific subset of enzymes that need to be corrected to improve performance of growth prediction.

Unlike PRESTO, GECKO implements a heuristic for the correction of turnover numbers that are based on the objective control coefficient calculated for each protein in a given condition (Supplementary Fig. 3)²². The control coefficient of a protein is determined by increasing the turnover number by 1000-fold and scoring the effect on the predicted specific growth rate. The proteins are then ranked in decreasing order of their control coefficients, and the turnover number of the first enzyme in the list is changed to the maximum value found in BRENDA for this enzyme across all organisms. This procedure is repeated with the remaining enzymes until the pcGEM predicts a growth rate that is at most 10% smaller than the measured specific growth rate for that condition or no additional ${k}_{{{\rm {cat}}}}$ value that strongly constrains the solution can be found (Supplementary Fig. 3). This leads to condition-specific sets of corrected ${k}_{{{\rm {cat}}}}$ with large intersections or full containment over a considered order of conditions (Supplementary Fig. 4a).

In contrast to this procedure, PRESTO corrects at once the turnover numbers of multiple enzymes that are measured in all investigated conditions by simultaneously leveraging the data from the different conditions, considerably reducing runtime and the number of solved problems. As a result, rather than deriving condition-specific corrected ${k}_{{{\rm {cat}}}}$ values, which are difficult to use in making predictions for unseen scenarios or for building large-scale kinetic metabolic models^24,25, PRESTO results in a single set of corrected ${k}_{{{\rm {cat}}}}$ values.

We compared the performance of PRESTO with the heuristic implemented in GECKO in three modeling scenarios that consider: (i) only condition-specific total protein content, (ii) both total protein content and uptake constraints, and (iii) additional constraints from abundances of enzymes measured in all conditions (Fig. 2). For corrections of turnover number from PRESTO, we observed that the relative error spans the range from 0.15 to 0.88 in the least constrained scenario (i) (Fig. 2a) and from 0.69 to 0.98 in the most constrained scenario (iii) (Fig. 2c). In contrast, the relative error with the corrections of turnover numbers from the GECKO heuristic is in the range from 0.96 to 1.00 in scenario (iii) (Fig. 2c). In addition, in scenario (iii), the median relative error in the case of the GECKO heuristic for each condition is larger than the relative error of the PRESTO predicted specific growth rate (Fig. 2c). We observed that predictions from FBA, considering enzyme abundances, without a constraint on the total protein content, led to an average relative error of 0.70 with ${k}_{{{\rm {cat}}}}$ values corrected according to PRESTO and 0.99 with ${k}_{{{\rm {cat}}}}$ values corrected according to GECKO (Supplementary Table 1).

**Fig. 2: Comparison of predicted growth of *S. cerevisiae* from pcGEMs with k_cat corrections from GECKO and PRESTO.**

We also performed a sensitivity analysis by investigating a smaller value, of ${10}^{-10},$ for the weighting factor λ used in the PRESTO objective. We found that when the weighting factor is ${10}^{-10}$ (at which the total corrections of the initial ${k}_{{{\rm {cat}}}}$ values plateaus), the relative errors from PRESTO cross-validation can be further decreased to 0.69 considering the constraint on the total protein content, with no effects on the other findings (Supplementary Fig. 2a). We also note that the relative error lies in the range from 0.35 to 0.80 over the considered weighting factors in the range from ${10}^{-14}$ to ${10}^{-1}$. Together, these results demonstrated that ${k}_{{{\rm {cat}}}}$ values corrected according to PRESTO provide better model performance than the values obtained by the contending heuristic in the case of S. cerevisiae in the scenarios where all available data are integrated into the model constraints.

PRESTO provides precise corrections of turnover numbers

In the following, we investigated the precision of the corrected ${k}_{{{\rm {cat}}}}$ values from the application of PRESTO to data and a pcGEM model of S. cerevisiae. To this end, we determined the range that the correction of the ${k}_{{{\rm {cat}}}}$ value of each enzyme can take while fixing the relative error in specific growth rate and total corrections from the optimum of PRESTO (see the “Methods” section). Moreover, we complemented this analysis by sampling corrected ${k}_{{{\rm {cat}}}}$ values that achieve the optimum of PRESTO with two values of the weighting factor λ of ${10}^{-7}$ and ${10}^{-10}$.

In the case of the corrected ${k}_{{{\rm {cat}}}}$ values for S. cerevisiae with a weighting factor of 10⁻⁷, we found that the ${k}_{{{\rm {cat}}}}$ values with the largest corrections are more precisely determined (Supplementary Fig. 5). In addition, the sampled corrections per enzyme show an average Euclidean distance to the respective mean of 4.88 s⁻¹, indicating that the corrected values are more precise than the values in BRENDA, exhibiting an average Euclidean distance of 27.54 s⁻¹ to the mean per EC number (Supplementary Fig. 6). Importantly, while ${k}_{{{\rm {cat}}}}$ values with smaller correction showed larger variability, the 25 and 75 percentiles of the sampled corrections for 42 enzymes are concentrated around those resulting from PRESTO. Repeating the analysis with a weighting factor of 10⁻¹⁰ showed that the larger total corrections of the initial ${k}_{{{\rm {cat}}}}$ values resulted in also larger variability for the corrections over all ${k}_{{{\rm {cat}}}}$ (Supplementary Fig. 7). Here, too, for 62 enzymes the 25 and 75 percentiles of the sampled corrections are concentrated around those resulting from PRESTO. Therefore, we concluded that the corrections from PRESTO are precise and can be used in downstream analyses.

Pathways enrichment for corrected turnover numbers

In pcGEMs generated by the GECKO toolbox⁷, turnover numbers are assigned to each of the enzymes in the GEM using a fuzzy matching algorithm. It takes into account the organism, substrate, and EC number of a BRENDA entry. When we investigated the magnitude of the turnover number correction dependent on the quality of the match between BRENDA entry and the corresponding enzyme, we found that ${k}_{{{\rm {cat}}}}$ values measured in S. cerevisiae were associated with smaller corrections than those from other organisms (Supplementary Fig. 8a).

To check which metabolic processes are more likely to require correction of in vitro ${k}_{{{\rm {cat}}}}$ values, we next conducted an enrichment analysis based on the KEGG pathway terms linked to corrected ${k}_{{{\rm {cat}}}}$ values (see the “Methods” section). The most prominent pathway in this analysis was the synthesis of secondary metabolites, particularly the synthesis of cofactors and terpenoids (Fig. 3a). However, several terms linked to central carbon metabolisms, such as the tricarboxylic acid cycle and oxidative phosphorylation, were also significantly enriched. Interestingly, amino acid synthesis was the only term linked to nitrogen metabolism that came up in this analysis, although many pathways of nitrogen metabolism were among the tested terms. This analysis suggested that particularly in vitro turnover numbers in carbon metabolism need to be corrected, due to the underestimation of in vitro assays.

**Fig. 3: Comparison of enzymes with corrected k_cat values by both GECKO and PRESTO.**

Comparison of turnover number corrections from GECKO

Next, we aimed to identify the extent to which the corrected ${k}_{{{\rm {cat}}}}$ values differ between PRESTO and the GECKO approach. To this end, we determined the intersection of enzymes with ${k}_{{{\rm {cat}}}}$ values corrected manually⁷, by PRESTO, and by the GECKO heuristic. For this comparison, we considered the union of all condition-specific corrected ${k}_{{{\rm {cat}}}}$ values from the GECKO approach. With the weighing factor $\lambda={10}^{-7}$, PRESTO adapted the ${k}_{{{\rm {cat}}}}$ values of 48% of enzymes corrected by the GECKO heuristic (Fig. 3b, Supplementary Data 1). We did not find a significant Spearman correlation (${\rho }_{{\rm {S}}}=0.17$, $P=0.45$) between the log-transformed ${k}_{{{\rm {cat}}}}$ values in this intersection (Fig. 3c), owing to the different principles employed in the two procedures. To determine the pathways that comprise enzymes whose turnover number are corrected by GECKO and PRESTO, we next repeated the pathway enrichment analysis for the enzymes in the overlap. Among the significant terms, like in PRESTO, we again found 2-Oxocarboxylic acid, amino acid, and secondary metabolism to be enriched (Fig. 3a, S9). However, the more specific pathway terms were associated with pathways that are part of carbohydrate metabolism and aromatic amino acid metabolism corrected by both approaches (Supplementary Fig. 9, Supplementary Data 2). In addition, the intersection between enzymes with manually corrected values and those corrected by the GECKO heuristic is higher than with PRESTO. This is expected since the manual curation is partly aimed at correcting the most constraining turnover numbers⁷.

We also compared the ${k}_{{{\rm {cat}}}}$ values adjusted by GECKO against estimates of in vivo ${k}_{{{\rm {cat}}}}$ values obtained by parsimonious FBA (pFBA) using the same proteomics data¹⁵ (Supplementary Fig. 10a, b). We confirmed the low correspondence (${\rho }_{{\rm {S}}}=0.23$) between the ${k}_{{{\rm {cat}}}}$ values obtained from BRENDA, included in the GECKO model without manual modifications, and the in vivo ${k}_{{{\rm {cat}}}}$ estimates. As expected, the correspondence of the estimated in vivo ${k}_{{{\rm {cat}}}}$ values to the turnover numbers corrected based on PRESTO was higher (${\rho }_{{\rm {S}}}=0.34$). To investigate how these estimates perform as model parameters, we also generated a pcGEM in which BRENDA values were substituted by in vivo ${k}_{{{\rm {cat}}}}$ values from pFBA¹⁵, whenever available. In scenarios without enzyme abundance values, this model performed worse than that including the ${k}_{{{\rm {cat}}}}$ values corrected by PRESTO as well as the model combining the maximum of all condition-specific GECKO corrections (Supplementary Fig. 11a, b). In the enzyme abundance-constrained scenario, the model with in vivo turnover numbers estimated by pFBA performed slightly better than GECKO but still only achieved a minimum relative error of 0.93, which is larger than 0.69 resulting from PRESTO (Supplementary Fig. 11c). These results demonstrated the value of PRESTO in combining the genome-scale coverage of BRENDA with in vivo proteomics chemostat measurements to obtain less biased estimates of ${k}_{{{\rm {cat}}}}$ values.

PRESTO with protein-constrained model of E. coli metabolism

To demonstrate the applicability of PRESTO across species, we deployed it with a pcGEM of E. coli (eciML1515)^22,26. To this end, we used a large dataset comprising 31 different growth conditions^12,27,28,29. Due to the lack of data on nutrient exchange rates, the same GAM value (i.e., 75.55 $\frac{{{\rm {mmol}}}}{{{\rm {gDWh}}}}$) was used across all conditions. Similarly, we used the same value for total protein content since condition-specific measurements were not available (see the “Methods” section).

By applying three-fold cross-validation, we found the optimal value for the λ parameter to be ${10}^{-5}$ (Supplementary Fig. 12a). This value was associated with an average relative error of 1.95 (average over all λ: 3.32) and 73 corrected turnover numbers, while on average 156 ${k}_{{{\rm {cat}}}}$ values were corrected across all explored values for λ. On average, the Jaccard distance between cross-validation folds was 0.13 (Supplementary Fig. 12b), while the average Jaccard distance between unique sets of enzymes with corrected turnover numbers for each λ parameter was three-fold larger (0.4, Supplementary Fig. 12c). Thus, the corrected ${k}_{{{\rm {cat}}}}$ values among cross-validation folds for each λ are more similar (maximum Jaccard distance of 0.29). Moreover, the union of the set of enzymes with corrected ${k}_{{{\rm {cat}}}}$ values can remain similar over a range of chosen λ parameters up to four orders of magnitude (Supplementary Fig. 12c), demonstrating the robustness of the method.

The performance of PRESTO was assessed and compared to GECKO using scenarios (i) and (iii) since no condition-specific uptake rates were available. With default uptake rates, the relative error for predicted growth ranged between 0.01 and 8.56 in the less constrained scenario (i) (Fig. 4a). Further, we obtained relative errors between 0.01 and 0.88 for the more constrained scenario (iii), when using the ${k}_{{{\rm {cat}}}}$ values corrected by PRESTO (Fig. 4b). In contrast, when using the ${k}_{{{\rm {cat}}}}$ values from the GECKO approach, the relative error was in the range between 0.01 and the 4.89 for scenario (i) and between 0.89 and 0.99 for scenario (iii). In this scenario, too, we observed that the relative error using ${k}_{{{\rm {cat}}}}$ values corrected by GECKO was consistently larger than the relative error resulting from the single set of corrected ${k}_{{{\rm {cat}}}}$ values obtained by PRESTO (Fig. 4b).

**Fig. 4: Comparison of predicted growth of *E. coli* from pcGEMs with k_cat corrections from GECKO and PRESTO.**

Since we observed high relative errors in the less-constrained scenario (Fig. 4a), we added a second step to PRESTO, which introduces negative corrections that lead to the same relative errors with consideration of proteomics data. This is an optional step that a user can choose to perform in addition to the positive corrections (i.e. relaxation of turnover numbers), introduced in the first step (Supplementary Method 1). As a result, we found 170 negative corrections that reduced the relative error in scenario (i) (Supplementary Fig. 13).

We do not perform a simultaneous search for positive and negative corrections because negative corrections can only reduce the relative error when the current ${k}_{{{\rm {cat}}}}$ values lead to an overprediction of growth, which is not the case when considering proteomics data. Therefore, no negative corrections are found if the absolute value of introduced positive and negative corrections are to be considered in a single step.

Importantly, the aim of PRESTO is to correct turnover numbers, which represent upper limits on the catalytic efficiency of enzymes. Therefore, we can assume that in vitro turnover numbers that lead to underprediction of specific growth rates when paired with protein abundance data are too low. However, an overprediction of specific growth rates in the same scenario can be caused by thermodynamic, temperature effects, or in-vivo-specific effects. Thus, a reduction of in vitro turnover numbers results in average apparent catalytic rates for the considered conditions, rather than corrected turnover numbers.

Considering the models with positive ${k}_{{{\rm {cat}}}}$ corrections, the sum of corrections reached a plateau at ${10}^{-11}$ for the weighting factor $\lambda$ in the PRESTO objective. We found that the relative cross-validation error at this value was 5.26, which is 2.7-fold larger than the relative error obtained using the optimal λ. Hence, allowing for more and larger corrections in PRESTO leads to a decrease in the overall relative error within the PRESTO program at the cost of highly biased parameters. The predictions with the highly biased parameters are worse in the test conditions and result in a larger specific growth rate when no enzyme abundance constraints are enforced. This observation is in line with the small number of corrections introduced by the GECKO approach, where only the pool constraint is considered. We conclude that the prediction performance of the eciML1515 model was improved by using turnover numbers corrected by PRESTO only when enzyme abundances are integrated.

To assess the precision of the introduced ${k}_{{{\rm {cat}}}}$ corrections, we performed variability analysis and sampling (see the “Methods” section) of the introduced corrections to the initial ${k}_{{{\rm {cat}}}}$ values for two values of the weighting factor $\lambda$, namely ${10}^{-5}$ and ${10}^{-11}$. We observed that the 25th and 75th percentiles enclose a narrow interval around the values resulting from PRESTO (Supplementary Fig. 14) and are thus not evenly distributed across the respective interval determined by the variability analysis. We further noted that here, the predictions of smaller δ are generally more precise than the large corrections ($\delta \ge {p}_{50}$), which span ~2 orders of magnitude (small δ (<p₅₀): 1.83, Supplementary Fig. 14). However, we also observed that the precision decreased when more corrections were allowed in PRESTO. This further justified our choice for the optimal parameter λ, which results in a lower number of 73 corrections compared to 204 at $\lambda={10}^{-11}$; moreover, this value guarantees more precise estimates (Supplementary Fig. 17). In conclusion, the application of PRESTO is not limited to a single species but presents a versatile tool for the correction of turnover numbers across species.

In contrast to the observations made in S. cerevisiae we found that a model parameterized with in vivo turnover numbers estimated by pFBA¹² outperformed both PRESTO and GECKO in the modeling scenario where no enzyme abundance constraints are taken into account (Supplementary Fig. 11d). This is due to the fact, that pFBA, in contrast to PRESTO and GECKO, can generate estimates lower than the in vitro ${k}_{{{\rm {cat}}}}$ values, in turn leading to more accurate predictions. However, in the scenario with enzyme abundance constraints, PRESTO predicts specific growth rates closer to the experimental observation in 87% of the conditions (Supplementary Fig. 11e). Thus, in this scenario the integration of information from different modeling conditions achieved in PRESTO serves to obtain ${k}_{{{\rm {cat}}}}$ value that performs better than the pFBA approach applied by¹².

Finally, we compare the resulting flux distributions and predicted protein abundances by models that are parameterized with ${k}_{{{\rm {cat}}}}$ values that were corrected using either GECKO or PRESTO. Overall, the feasible ranges (${v}^{\max }-{v}^{\min }$) for both approaches resulted in Pearson correlation coefficients of 0.985 across all conditions (Supplementary Method 2). The difference between both flux distributions is manifested in a smaller interquartile range in feasible ranges with PRESTO (Supplementary Fig. 15). More specifically, there are fewer reactions with highly constrained flux after introducing ${k}_{{{\rm {cat}}}}$ corrections with PRESTO compared to GECKO. Moreover, we used the models that were corrected using GECKO and PRESTO to predict protein abundances (Supplementary Method 3), which were then compared to the measured proteomics data using Spearman correlation. Since PRESTO only considers abundances of proteins that were measured across all considered conditions, we computed the correlations for (1) the set of proteins that are measured across all conditions and (2) all protein abundances per condition. In the first scenario, the correlation was higher after correction with PRESTO in 70% of conditions, compared to the median correlation with GECKO and outperformed all GECKO pcGEMs in 30% of conditions (Supplementary Fig. 16a). When all measured proteins were considered, PRESTO only performed better in 54% of conditions (Supplementary Fig. 16b). Similar to the predicted specific growth rate we also observed better performance for PRESTO models that were subjected to an additional ${k}_{{{\rm {cat}}}}$ down correction step (Supplementary Method 1). However, we note that the reduced ${k}_{{{\rm {cat}}}}$ cannot strictly be considered condition-independent ${k}_{{{\rm {cat}}}}$ values because there may exist physiological states where these enzymes may achieve the efficiency given by the original ${k}_{{{\rm {cat}}}}$ value (see the “Discussion” section). Since PRESTO considers protein abundances for the correction, which is not the case for GECKO, we expected to find the increased prediction performance with PRESTO compared to GECKO; however, we still observe low overall predictability of protein abundances using the resulting models. Recently, a more sophisticated protein abundance prediction approach using pcGEMs was introduced that can be used to predict more reliable values and might further be improved by considering corrections introduced by PRESTO³⁰.

Interestingly, in contrast to S. cerevisiae, we did not observe larger corrections by PRESTO for organism-unspecific in vitro ${k}_{{{\rm {cat}}}}$ values (Supplementary Fig. 8b). For the condition-specific sets of ${k}_{{{\rm {cat}}}}$ corrections introduced by GECKO we observe that all smaller sets were proper subsets of larger sets (Supplementary Fig. 4b) The low number of corrections introduced by GECKO leads to an overlap of only three (75%) enzymes whose ${k}_{{{\rm {cat}}}}$ values were also corrected by PRESTO (Supplementary Data 3, Supplementary Fig. 18a). These three enzymes catalyze reactions in three distinct metabolic pathways: Phosphoribosylformylglycinamidine synthase acts in the synthesis of purines, while serine acetyltransferase and NADP dependent Ketol-acid reductoisomerase are involved in the synthesis of sulfur amino acids and hydrophic amino acids, respectively (Supplementary Data 4). The pathway enrichment analysis for all PRESTO corrections at $\lambda={10}^{-5}$, indeed also identified amino acid and secondary metabolite synthesis as significantly enriched terms among the enzymes with corrected turnover numbers (Supplementary Fig. 18b). These results argue for a systematic underestimation of in vivo turnover numbers in the pathways in in vitro experiments, irrespective of the investigated organism. However, the lower-order KEGG pathway terms enriched in E. coli do not overlap with the ones found in S. cerevisiae. Here, fatty acid metabolism and the synthesis of hydrophobic amino acids are among the pathways requiring the correction of turnover numbers.

Robustness of turnover number corrections

All of the approaches for estimation of in vivo turnover numbers rely on predicted (or estimated) fluxes and protein abundances from multiple conditions^12,14,15, but have not investigated the robustness of the estimates to the number of conditions used. Therefore, next, we investigate the difference in the sets of enzymes with corrected turnover numbers and the concordance of their corrections when ten randomly sampled subcollections of M experimental conditions (M = 3, 5, 10, 15) were used instead of all experiments. The differences and concordance were quantified with respect to the estimates obtained by considering data from all available experiments using the Jaccard index and the Pearson correlation coefficient, respectively. In the case of S. cerevisiae, we found that the smallest Jaccard difference over 200 scenarios was 0.36, while for E. coli this was 0.41 (Supplementary Figs. 19 and 20). In addition, the Pearson correlation coefficient between the (log-transformed) corrected turnover numbers with consideration of all versus a subcollection of M experiments in S. cerevisiae ranged from 0.99 to 1.00 (for M = 15) to 0.11 and 1.00 (for M = 3) (Supplementary Fig. 19). Repeating the analysis in the case of E. coli, we found that that the Pearson correlation coefficient ranged from 0.15 to 1.00 (for M = 15) to 0.14 and 1.00 (for M = 3) (Supplementary Fig. 20). This is in line with the expectation that the corrections stabilize with an increasing number of experiments. Altogether, these findings pointed out the robustness of turnover number corrections derived from PRESTO with the number of available experiments.

Discussion

Characterization of enzyme parameters that can inform models of reaction rates is key to expanding and further propelling the usage of metabolic models in diverse biotechnological applications. While the generation of pcGEMs has facilitated the integration of more biophysically relevant constraints, it necessitates access to estimates of turnover numbers as key enzyme parameters. We assessed the bias in the available in vitro and in vivo estimates of turnover numbers as the discrepancy between measured and predicted growth in the ultimate validation scenario when they are combined with constraints from protein abundances. We use the modeling scenario that considers measured protein abundances as the ultimate validation scenario not only for the prediction of metabolic fluxes but also for the prediction of specific growth rates as it contains considerably more biochemically relevant constraints. Indeed for this scenario, we showed that condition-specific growth rates cannot be reliably predicted with pcGEMs of S. cerevisiae and E. coli when available in vitro (Figs. 2c and 4b) and in vivo estimates (Supplementary Fig. 11c, e) of turnover numbers are used. GECKO resolves this issue by flexibilities measured protein abundances without considering physiological information during the procedure. To overcome this limitation, we developed PRESTO, which corrects turnover numbers and facilitates the integration of enzyme abundance constraints.

In contrast to PRESTO, GECKO uses measured total protein content from a single condition to achieve specific growth rates in the process of correcting the turnover numbers. As a result, the corrected turnover numbers vary between different experiments. Like all existing approaches for the estimation of in vivo turnover numbers based on GEMs, we integrated protein abundance data directly to correct turnover numbers. Following this strategy in PRESTO is further justified by the observation that the turnover numbers included in pcGEMs are often neither from the same enzyme (i.e., EC number), substrate, nor organism. While in vivo turnover number estimates can be adjusted by considering recently proposed Bayesian statistical learning¹⁸, this approach has not considered protein abundance information from proteomics measurements.

We employed PRESTO with the largest data set of these measurements available to date for S. cerevisiae¹⁵ and E. coli¹². Through a series of comparative analyses, we demonstrated that the corrections of turnover numbers from PRESTO ultimately increase the prediction accuracy of condition-specific growth for the two organisms when enzyme abundance data are integrated into the corresponding pcGEMs.

Since PRESTO generates a condition-independent ${k}_{{{\rm {cat}}}}$ set it is bound to correct parameters that lead to the underprediction of biological fluxes. The same reasoning is applied when obtaining in vivo ${k}_{{{\rm {cat}}}}$ estimates from pFBA by taking the maximum of apparent catalytic rates, and overall conditions^12,15. Nevertheless, we included an optional step, that additionally allows for the reduction of in vitro ${k}_{{{\rm {cat}}}}$ values, which can be considered average apparent catalytic rate estimates and result in better performance in the E. coli experiments (Supplementary Figs. 13 and 16). Moreover, we showed that in vivo turnover number proxies, obtained by the ranking of condition-specific estimates that use proteomics and fluxomics data, are more highly (but modestly) correlated to estimates from PRESTO than to in vitro turnover numbers. Owing to the constraint-based formulation of PRESTO, we also determined the precision of the correction of turnover numbers. Previous studies have shown that even for the well-studied model organism Saccharomyces cerevisiae, only 52% of enzyme turnover numbers in the pcGEM can be obtained from organism-specific in vitro measurements²². Using organism unspecific ${k}_{{{\rm {cat}}}}$ values for parameterization and correction of pcGEMs, as done in the GECKO pipeline, assumes that enzyme kinetic properties are comparable within one EC number class^31,32. However, we did not identify clear differences between EC classes, down to the second digit, when considering the distribution of ${k}_{{{\rm {cat}}}}$ similarities within EC classes (Supplementary Fig. 21). Indeed, it has been reported that EC class plays only a minor role in the prediction of turnover numbers¹⁹ and show stronger similarity with concordant GO categories³³. Interestingly, our findings show that the turnover number corrections obtained from PRESTO are more precise than EC class-based corrections. (Supplementary Fig. 6). Together, these findings demonstrated PRESTO can be readily used to decrease the bias of turnover numbers. This paves the way for employing the outcome of PRESTO and future extensions toward effectively predicting the kcatome from available protein sequences.

Methods

Experimental data

For S. cerevisiae, we made use of a dataset gathered by¹⁵ from four different studies^34,35,36,37, which included protein abundance data (${{\rm {mmol}}\; {\rm {gD}{W}}}^{-1}$) as well as measured growth or dilution rates (${{\rm {h}}}^{-1}$) and nutrient exchange fluxes (${{\rm {mmol}}\; {\rm {gD}{W}}}^{-1}{{\rm {h}}}^{-1}$). Exchange fluxes missing in certain conditions were set to $1000\,{{\rm {mmol}}\; {\rm {gD}{W}}}^{-1}{{\rm {h}}}^{-1}$ if the nutrient was present in the used culture media. We further extended this data set by total protein content measurements (${{\rm {g}}\; {\rm {gD}{W}}}^{-1}$) from the original studies. For subsequent analyses, we used the maximum abundance of each protein over all replicates per experimental condition. Similarly, we used the average value for specific growth rates and nutrient exchange rates. Since no measurement of total protein content was available for the two conditions evaluated in the Di Bartolomeo study³⁷, we used the maximum protein content measured across the remaining conditions for these conditions (i.e., $0.67\,{\rm {g/{gDW}}}$). Moreover, we excluded three temperature stress conditions (i.e., Lahtvee2017_Temp33, Lahtvee2017_Temp36, Lahtvee2017_Temp38) from the analysis since the temperature can have a large effect on the catalytic activity of an enzyme. Gene names in the proteomics dataset were translated to UniProt identifiers using the batch retrieval service of the UniProt REST API³⁸.

For E. coli, we used a dataset comprising 31 experimental conditions, which was gathered by Davidi and colleagues and augmented by Xu et al.^12,14 from three publications^27,28,29. Here, too, we used the maximum protein abundance over all replicates (in ${{{{{{\rm{mmol}}}}}}\; {{{{{\rm{gD}}}}}{W}}}^{-1}$). Due to the absence of total protein content measurements in two of the original studies, we relied on the maximum protein content measured in the Valgepea study (i.e., $0.61\,{\rm {g/{gDW}}}$) to be used for all conditions. Since precise data on nutrient uptake rates were only given for a few conditions, we assigned a default upper bound of $1000\,{{\rm {mmol}}\; {\rm {gD}{W}}}^{-1}\,{{\rm {h}}}^{-1}$ to all nutrients contained in the minimal medium (Supplementary Table 2) with additional carbon sources as specified. Gene identifiers were translated to UniProt similar as for S. cerevisiae.

Model preparation

The proposed approach aims at parsimonious correction of turnover values in genome-scale enzyme-constraint metabolic models using measured protein abundances. Therefore, it is important to consider the differential association between enzymes and reactions, i.e., isozymes, enzyme complexes, and promiscuous enzymes. We decided to use the GECKO formalism⁷, which deals with these problems elegantly by directly encoding the required information in the stoichiometric matrix. The genome-scale metabolic models for S. cerevisiae (YeastGEM v.8.5.0) and E. coli (iML1515) were obtained from the yeast-GEM and ecModels GitHub repository, respectively^22,39 [https://github.com/SysBioChalmers; accessed on 22.08.2021]. For subsequent steps, functions of the COBRA v3.0 toolbox⁴⁰ and GECKO2.0 toolbox²² were employed, of which several functions were adapted for our purposes.

To arrive at raw protein-constrained models for both organisms, the GECKO2.0 model enhancement pipeline was adapted to allow the ${k}_{{{\rm {cat}}}}$ correction procedure to be omitted. Moreover, any manual corrections of turnover numbers were excluded from model generation. In the process of adapting the raw pcGEM to the respective experimental conditions for both organisms, the GAM value per condition was fitted using the scaleBioMass function of GECKO2.0, based solely on the condition-specific nutrient exchange rates, and returning the minimum ($9\frac{{{\rm {mmol}}}}{{{\rm {gDWh}}}}$) or maximum (161 $\frac{{{\rm {mmol}}}}{{{\rm {gDWh}}}}$) interval boundary if reached (only S. cerevisiae). Furthermore, we omitted enzyme abundances, which were not measured across all experiments as the approach proposed here is only applicable for enzymes in the set with measured abundances ($M$).

PRESTO approach

In the design of PRESTO, we modified the enzyme mass-balance constraints of the augmented stoichiometric matrix, created by GECKO, from

$$-\frac{1}{{k}_{{{\rm {cat}}}}^{{ij}}}{v}_{j}+{e}_{i}=0$$

(1)

to inequality constraints that use the measured protein abundance directly. The variable $e$ denotes the predicted protein abundance in the pcGEM, while $E$ represents the vector of measured enzyme abundances. Further, we assume a single turnover number per enzyme $i$ over all catalyzed reactions (${k}_{{\rm {ca{t}}}_{i}}^{{\rm {min}}}=\mathop{{{\arg }}\,{{\min }}}\limits_{j}{k}_{{\rm {cat}}}^{ij}$):

$$\forall r\in R\mathop{\sum }_{i\in {{\rm {GPR}}}\left(r\right)}{v}_{r}\le {k}_{{{\rm {ca}{t}}}_{i}}^{\min }\cdot {e}_{i}.$$

(2)

GPR stands for gene–protein-reaction rule that associates reactions ($R$) with underlying genes and proteins. The variable $E$ denotes the measured protein abundance in ${{\rm {mmol}}\; {\rm {gD}{W}}}^{-1}$. We justify making the assumption for Eq. (2) based on our observation that most enzymes in the S. cerevisiae model are associated with no more than four reactions (Supplementary Fig. 22a, c). Further, the vast majority of enzymes are assigned a single unique turnover number even though they catalyze multiple reactions (Supplementary Fig. 22b, d).

We then introduced a correction factor δ, which is added to each ${k}_{{{\rm {cat}}}}$ if the protein abundances for the underlying enzyme were available:

$$\forall r\in R\mathop{\sum }_{i\in {{\rm {GPR}}}\left(r\right)}{v}_{r}\le \left({k}_{{{\rm {ca}{t}}}_{i}}^{\min }+{\delta }_{i}\right)\left[{E}_{i}\right].$$

(3)

To find a biologically relevant minimal set of adaptations with respect to the sum of δ, we minimized the weighted sum of the average absolute relative errors, ω, between measured (${\mu }^{\exp }$) and predicted specific growth rates (${v}_{{{\rm {bio}}}}$) overall experimental conditions $C$, and the average δ:

$$\mathop{{{\min }}}\limits_{v,\delta,\omega }\frac{1}{\left|C\right|}\mathop{\sum}\limits_{j\in C}{\omega }_{j}+\frac{\lambda }{\left|M\right|}\mathop{\sum}\limits_{i\in M}{\delta }_{i}.$$

(4)

Finally, the linear programming formulation of the ${k}_{{{\rm {cat}}}}$ correction in PRESTO is the following:

$$\mathop{{{\min }}}\limits_{v,\delta,\omega }\frac{1}{\left|C\right|}\mathop{\sum}\limits_{j\in C}{\omega }_{j}+\frac{\lambda }{\left|M\right|}\mathop{\sum}\limits_{i\in M}{\delta }_{i}$$

subject to

$${N}{{v}}^{{\,j}}=0,\,\forall j\in C$$

(5)

$$\mathop{\sum}\limits_{i\,\in {{{{{\rm{GPR}}}}}}\left(r\right)}{v}_{r}^{\,j}\le \left({k}_{{{\rm {cat}}},i}^{\min }+{\delta }_{i}\right)\left[{E}_{i}^{j}\right],\,\forall r\in R,\,i\in M,\,\forall j\in C$$

$${{v}}_{{{\min }}}^{{\;j}}\le {{v}}^{{\;j}}\le {{v}}_{{{\max }}}^{{\;j}};\forall j\in C$$

(6)

$${\delta }_{i}\le \left(\varepsilon -1\right)\cdot {k}_{{{\rm {cat}}},i}^{\min },\,\forall i\in M$$

(7)

$${k}_{{{\rm {cat}}},i}^{\min }+{\delta }_{i}\le {K}^{\max },\,\forall i\in M$$

(8)

$${v}_{{{\rm {bio}}}}^{\;j}\cdot {\omega }_{j}\ge {\mu }_{\exp }^{\;j}-{v}_{{{\rm {bio}}}}^{j},\,\forall j\in C$$

(9)

$${{v}_{{{\rm {bio}}}}^{\;j}\cdot \omega }_{j}\ge {v}_{{{\rm {bio}}}}^{\;j}-{\mu }_{\exp }^{\;j},\,\forall j\in C$$

(10)

$$\begin{array}{cc}{{{{{\rm{\omega }}}}}}\le {{{{{\rm{\theta }}}}}},& {{{{{\rm{\delta }}}}}}\ge 0.\end{array}$$

The value for ω was bound from above by a value θ, which was set to 0.6. Constraints that enforce metabolic steady state are captured in Eqs. (5) and (6) represent the lower and upper bounds in the flux through each reaction in each condition, respectively. The constraints in Eqs. (7) and (8) impose an upper bound on δ, which is the minimum of the allowed fold change in ${k}_{{{\rm {cat}}}}$ values, ε, and a cut-off value ${K}^{\max }$, which denotes the maximum allowed ${k}_{{{\rm {cat}}}}$ value. The value for ε was set to ${10}^{5}$ since lower values did not yield solutions and ${K}^{\max }$ was set to 57,500,000 s⁻¹ (5.3.1.1, Pyrococcus furiosus⁴¹). Equations (9) and (10) ensure that ω is equal to $\frac{\left|{\mu }_{j}^{\exp }-{v}_{{{\rm {bio}}},j}\right|}{{v}_{{{\rm {bio}}},j}}$.

The parameter λ controls the trade-off between both minimization objectives (see Eq. (4)). As λ is unknown and may also be condition- and model-specific, it was fitted using a 3-fold cross-validation scheme, which was repeated for 10 iterations. To this end, we scanned a log-scale interval between ${10}^{-14}$ and ${10}^{-1}$. In each iteration, we performed ${k}_{{{\rm {cat}}}}$ corrections on two folds of condition-specific models and validated the obtained corrections on the remaining fold of condition-specific models. The validation was done by predicting growth only with a constraint on total protein content, without constraints from measured protein abundances. This was done to counteract overprediction in the scenario without constraints from proteomics data. The relative errors (ω) and the sum of δ (i.e., Δ) were then used to calculate the scores ${s}_{\lambda }$, which helped us choose the optimal value for λ:

$${s}_{\lambda }=\frac{1}{10}\mathop{\sum }\limits_{\tau=1}^{10}\frac{{\omega }_{\lambda,\tau }-{\omega }_{\lambda,\tau }^{\min }}{{\omega }_{\lambda,\tau }^{\max }-{\omega }_{\lambda,\tau }^{\min }}\cdot \frac{{{{\log }}}_{10}\frac{{\Delta }_{\lambda,\tau }}{{\Delta }_{\lambda,\tau }^{\min }}}{{{{\log }}}_{10}\frac{{\Delta }_{\lambda,\tau }^{\max }}{{\Delta }_{\lambda,\tau }^{\min }}}.$$

(11)

The score can be described as the average product of min-max-scaled ω and Δ across the 10 cross-validation iterations per explored λ. The optimal value was then determined by finding the first sign change in the second numerical gradient over ${s}_{\lambda }$, starting from the maximum value for λ. In addition to the optimal λ, we also compared our results to a second λ, where the sum of δ reached a plateau ($\lambda={10}^{-10}$ for S. cerevisiae and $\lambda={10}^{-11}$ for E. coli, Supplementary Fig. 2a and 12a). The presented approach and analysis scripts were implemented using MATLAB⁴².

Variability analysis for δ

While PRESTO considers multiple experimental conditions to find a set of universal corrections for ${k}_{{{\rm {cat}}}}$ values, it does not provide an exhaustive view over all possible solutions to this problem. To assess the precision of the corrections, we first performed a variability analysis for δ to find the minimum and maximum possible values. To guarantee that a solution of equal quality is found with respect to the previously determined sum of δ and the relative errors to experimentally measured specific growth rates (i.e., ${\omega }^{{{\rm {opt}}}}$), corresponding constraints were added to arrive at the following linear programming problem:

$$\mathop{{{\min }}}\limits_{v,\delta,\,\omega }/\mathop{{{\max }}}\limits_{v,\delta,\omega }{\delta }_{i}$$

s.t.

$${N}{{v}}^{{\;j}}=0,\,\forall j\in C$$

$$\mathop{\sum}\limits_{i\in {{{{{\rm{GPR}}}}}}\left(r\right)}{v}_{r}^{\;j}\le \left({k}_{{{\rm {cat}}},i}^{\min }+{\delta }_{i}\right)\left[{E}_{i}^{\;j}\right],\,\forall r\in R,\,i\in M,\,\forall j\in C$$

$${{v}}_{{{\min }}}^{{\;j}}\le {{v}}^{{\;j}}\le {{v}}_{{{\max }}}^{{\;j}},\,\forall j\in C$$

$${\delta }_{i}\le \left(\varepsilon -1\right)\cdot {k}_{{{\rm {cat}}},i}^{\min },\,\forall i\in M$$

$${k}_{{{\rm {cat}}},i}^{\min }+{\delta }_{i}\le {K}^{\max },\,\forall i\in M$$

$${v}_{{{\rm {bio}}}}^{\;j}\cdot {\omega }_{j}\ge {\mu }_{\exp }^{\;j}-{v}_{{{\rm {bio}}}}^{\;j},\,\forall j\in C$$

$${{v}_{{{\rm {bio}}}}^{\;j}\cdot \omega }_{j}\ge {v}_{{{\rm {bio}}}}^{\;j}-{\mu }_{\exp }^{\;j},\,\forall j\in C$$

$$0.99\cdot {{\omega }}_{{j}}^{{{\rm {opt}}}}\le {\omega }_{j}\le 1.01\cdot {{{{{{\rm{\omega }}}}}}}^{{{\rm {opt}}}},\,\forall j\in C$$

(12)

$${\Delta }^{{{\rm {opt}}}}-{10}^{-3}\le \mathop{\sum}\limits_{i\in M}{\delta }_{i}\le {\Delta }^{{{\rm {opt}}}}+{10}^{-3}$$

(13)

$$\begin{array}{cc}{{{{{\rm{\omega }}}}}}\le {{{{{\rm{\theta }}}}}},& {{{{{\rm{\delta }}}}}}\ge 0.\end{array}$$

The minimal relative error determined for each condition $j$ was fixed within a narrow tolerance (±1%, Eq. (12)) and the minimum sum of corrections $\Delta$ was fixed with a tolerance of $\pm {10}^{-3}{{\rm {h}}}^{-1}$ (Eq. (13)).

As the distribution within the obtained min/max intervals can be skewed, we sampled 10,000 points within the obtained intervals. For uniform random sampling, we created random vectors of corrections ${\delta }^{*}$ within the determined intervals and projected them onto the solution space by minimizing the distance of δ to the respective random vector. Therefore, we updated the objective of the program above:

$$\mathop{{{\min }}}\limits_{v,\delta,\omega }\mathop{\sum}\limits_{i\in M}\left|{\delta }_{i}-{\delta }_{i}^{*}\right|.$$

(14)

To ensure reproducibility and compatibility with the COBRA toolbox⁴⁰, we solved all optimization problems using the optimizeCbModel of the COBRA toolbox. Within this environment, we used the Gurobi solver v9.1.1⁴³ but we note that any other supported solver can also be used. As we observed numerical instability of the problems in some cases, we decreased the feasibility tolerance (i.e., feasTol parameter) for the COBRA solver to ${10}^{-9}$ for all predictions. The results were visualized using MATLAB⁴².

Validation of corrected models

We used the adapted GECKO pipeline (fitting a condition-specific GAM; excluding manual ${k}_{{{\rm {cat}}}}$ adaptions) to obtain models with ${k}_{{{\rm {cat}}}}$ values adapted according to the objective control coefficient heuristic. We note that, when no manual modifications were introduced to the S. cerevisiae models, the ${k}_{{{\rm {cat}}}}$ adaption of the GECKO pipeline would stop because no objective control coefficient above the threshold of 0.001 could be found, and corrected models would still be below the predicted growth error tolerance of 10%. To compare the predictive performance of PRESTO and GECKO corrected models, the models were adapted with the same condition-specific GAM, biomass reaction, and total protein content, ${P}_{{{\rm {tot}}}.}$ Additionally, PRESTO models were constrained using the same condition-specific saturation rate, $\sigma$ and enzyme mass fraction, f, as obtained from the GECKO pipeline. In contrast to the GECKO formulation, we did not subtract the mass of measured enzymes from the total protein pool constraint but instead introduced the measured protein concentration as the upper bound on the enzyme usage reaction, ${E}_{i}$, in the respective scenario. This formulation still guarantees that the mass of all used enzymes is lower or equal to the approximated cellular protein pool according to

$$\mathop{\sum}\limits_{i\in r{{{{{\rm{GPR}}}}}}(r)}{e}_{i,\, j}\cdot {{{{{{\rm{MW}}}}}}}_{i}\le {P}_{{tot},\, j}\cdot f\cdot {\sigma }_{j},\,\forall j\in C$$

(15)

where ${{{{{\rm{MW}}}}}}$ is the respective molecular weight of the protein. By considering measured and unmeasured enzymes in Eq. (15) we do not have to change f and use the same factor as for the scenario where no protein abundance measures are used⁷. Maximum growth was predicted in three different constraint scenarios: (i) using only the protein pool constraint and default uptake rates (1000 mmol/gDW/h), (ii) using the pool constraint and experimentally measured uptake rates, (iii) using the previous constraints plus the absolute enzyme abundance.

The two studies which generated in vivo ${k}_{{{\rm {cat}}}}$ values from pFBA^12,15 calculated a single value per reaction irrespective of the presence of isoenzymes. Thus, to parameterize the raw pcGEM (containing only uncorrected BRENDA values) we substituted the in vitro ${k}_{{{\rm {cat}}}}$ values of all isoenzyme reactions with the respective estimate provided in the study. Reactions catalyzed by complexes were not corrected. Since PRESTO and the pFBA studies provide a single condition-independent model, we generated a condition-independent GECKO model by following the maximum overall conditions approach: For the comparisons, shown in Supplementary Figs. 11 and 15, the condition-wise GECKO models were aggregated into a single union model in which for each reaction the maximum ${k}_{{{\rm {cat}}}}$ value was used.

Pathway enrichment analysis

The KEGG pathway terms⁴⁴, associated with each enzyme that was measured in all conditions, were acquired using the KEGG REST API. The one-sided p-value, p, for significant enrichment of a pathway term among the enzymes with corrected ${k}_{{{\rm {cat}}}}$ values was calculated using the hypergeometric density distribution:

$$p\left(x\right)=1-\mathop{\sum }\limits_{i=1}^{x-1}\frac{\left(\begin{array}{c}K\\ i\end{array}\right)\left(\begin{array}{c}M-K\\ N-i\end{array}\right)}{\left(\begin{array}{c}M\\ N\end{array}\right)}.$$

(16)

Only KEGG pathway terms associated with at least two corrected enzymes were taken into consideration. The p-values associated with all tested pathway terms were corrected for a false discovery rate of 0.05 using the Benjamini–Hochberg correction⁴⁵.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The protein abundance data used in this study have been previously published^12,14,15. The UniProt database⁴⁶ (www.uniprot.org) was used for mapping gene IDs to protein IDs, and the KEGG⁴⁴ (www.kegg.jp) database was used to retrieve pathway information for genes. Source data are provided with this paper.

Code availability

All code that was used to generate the results of this study, including the PRESTO method, are available at GitHub [https://github.com/pwendering/PRESTO] and at Zenodo [https://zenodo.org/record/7675009]⁴⁷.

References

Goelzer, A. et al. Quantitative prediction of genome-wide resource allocation in bacteria. Metab. Eng. 32, 232–243 (2015).
Article CAS PubMed Google Scholar
Lerman, J. A. et al. In silico method for modelling metabolism and gene product expression at genome scale. Nat. Commun. 3, 929 (2012).
Article ADS PubMed Google Scholar
O’Brien, E. J., Lerman, J. A., Chang, R. L., Hyduke, D. R. & Palsson, B. Ø. Genome‐scale models of metabolism and gene expression extend and refine growth phenotype prediction. Mol. Syst. Biol. 9, 693 (2013).
Article PubMed PubMed Central Google Scholar
Chen, Y. & Nielsen, J. Mathematical modeling of proteome constraints within metabolism. Curr. Opin. Syst. Biol. 25, 50–56 (2021).
Article CAS Google Scholar
Adadi, R., Volkmer, B., Milo, R., Heinemann, M. & Shlomi, T. Prediction of microbial growth rate versus biomass yield by a metabolic network with kinetic parameters. PLoS Comput. Biol. 8, e1002575 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Beg, Q. K. et al. Intracellular crowding defines the mode and sequence of substrate uptake by Escherichia coli and constrains its metabolic activity. Proc. Natl Acad. Sci. USA 104, 12663–12668 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Sánchez, B. J. et al. Improving the phenotype predictions of a yeast genome‐scale metabolic model by incorporating enzymatic constraints. Mol. Syst. Biol. 13, 935 (2017).
Article PubMed PubMed Central Google Scholar
Malina, C., Yu, R., Bjorkeroth, J., Kerkhoven, E. J. & Nielsen, J. Adaptations in metabolism and protein translation give rise to the Crabtree effect in yeast. Proc. Natl Acad. Sci. USA 118, e2112836118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nilsson, A., Nielsen, J. & Palsson, B. O. Metabolic models of protein allocation call for the kinetome. Cell Syst. 5, 538–541 (2017).
Article CAS PubMed Google Scholar
van Eunen, K. & Bakker, B. M. The importance and challenges of in vivo-like enzyme kinetics. Perspect. Sci. 1, 126–130 (2014).
Article Google Scholar
Labhsetwar, P., Melo, M. C. R., Cole, J. A. & Luthey-Schulten, Z. Population FBA predicts metabolic phenotypes in yeast. PLoS Comput. Biol. 13, e1005728 (2017).
Article ADS PubMed PubMed Central Google Scholar
Davidi, D. et al. Global characterization of in vivo enzyme catalytic rates and their correspondence to in vitro kcat measurements. Proc. Natl Acad. Sci. USA 113, 3401–3406 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Heckmann, D. et al. Kinetic profiling of metabolic specialists demonstrates stability and consistency of in vivo enzyme turnover numbers. Proc. Natl Acad. Sci. USA 117, 23182–23190 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, R., Razaghi-Moghadam, Z. & Nikoloski, Z. Maximization of non-idle enzymes improves the coverage of the estimated maximal in vivo enzyme catalytic rates in Escherichia coli. Bioinformatics 37, 3848–3855 (2021).
Article CAS Google Scholar
Chen, Y. & Nielsen, J. In vitro turnover numbers do not reflect in vivo activities of yeast enzymes. Proc. Natl Acad. Sci. USA 118, 2108391118 (2021).
Article Google Scholar
Küken, A., Gennermann, K. & Nikoloski, Z. Characterization of maximal enzyme catalytic rates in central metabolism of Arabidopsis thaliana. Plant J. 103, 2168–2177 (2020).
Article PubMed Google Scholar
Zikmanis, P. & Kampenusa, I. Relationships between kinetic constants and the amino acid composition of enzymes from the yeast Saccharomyces cerevisiae glycolysis pathway. Eurasip J. Bioinforma. Syst. Biol. 2012, 11 (2012).
Article Google Scholar
Li, F. et al. Deep learning-based k_cat prediction enables improved enzyme-constrained model reconstruction. Nat. Catal. 5, 662–672 (2022).
Article CAS Google Scholar
Heckmann, D. et al. Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models. Nat. Commun. 9, 5252 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Bekiaris, P. S. & Klamt, S. Automatic construction of metabolic models with enzyme constraints. BMC Bioinforma. 21, 19 (2020).
Article CAS Google Scholar
Wendering, P. & Nikoloski, Z. Genome-scale modeling specifies the metabolic capabilities of Rhizophagus irregularis. mSystems 7, e01216–e01221 (2022).
Article CAS PubMed PubMed Central Google Scholar
Domenzain, I. et al. Reconstruction of a catalogue of genome-scale metabolic models with enzymatic constraints using GECKO 2.0. Nat. Commun. 13, 3766 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Arend, M. et al. Proteomics and constraint-based modelling reveal enzyme kinetic properties of Chlamydomonas reinhardtii on a genome scale. Preprint at bioRxiv https://doi.org/10.1101/2022.11.06.515318 (2022).
Khodayari, A. & Maranas, C. D. A genome-scale Escherichia coli kinetic metabolic model k-ecoli457 satisfying flux data for multiple mutant strains. Nat. Commun. 7, 13806 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Hu, M. et al. Comparative study of two Saccharomyces cerevisiae strains with kinetic models at genome-scale. Metab. Eng. 76, 1–17 (2023).
Article CAS PubMed Google Scholar
Monk, J. M. et al. iML1515, a knowledgebase that computes Escherichia coli traits. Nat. Biotechnol. 35, 904–908 (2017).
Article CAS PubMed PubMed Central Google Scholar
Valgepea, K., Adamberg, K., Seiman, A. & Vilu, R. Escherichia coli achieves faster growth by increasing catalytic and translation rates of proteins. Mol. Biosyst. 9, 2344–2358 (2013).
Article CAS PubMed Google Scholar
Peebo, K. et al. Proteome reallocation in Escherichia coli with increasing specific growth rate. Mol. Biosyst. 11, 1184–1193 (2015).
Article CAS PubMed Google Scholar
Schmidt, A. et al. The quantitative and condition-dependent Escherichia coli proteome. Nat. Biotechnol. 34, 104–110 (2016).
Article CAS PubMed Google Scholar
Ferreira, D. M., Batista, W. & Nikoloski, Z. PARROT: prediction of enzyme abundances using protein-constrained metabolic models. Authorea Preprints https://doi.org/10.22541/au.166117417.77605988/v1 (2022).
Bar-Even, A. et al. The moderately efficient enzyme: evolutionary and physicochemical trends shaping enzyme parameters. Biochemistry 50, 4402–4410 (2011).
Article CAS PubMed Google Scholar
Davidi, D., Longo, L. M., Jabłońska, J., Milo, R. & Tawfik, D. S. A bird’s-eye view of enzyme evolution: chemical, physicochemical, and physiological considerations. Chem. Rev. 118, 8786–8797 (2018).
Article CAS PubMed Google Scholar
Mao, Z. & Ma, H. iMTBGO: an algorithm for integrating metabolic networks with transcriptomes based on gene ontology analysis. Curr. Genom. 20, 252–259 (2019).
Article CAS Google Scholar
Lahtvee, P. J. et al. Absolute quantification of protein and mRNA abundances demonstrate variability in gene-specific translation efficiency in yeast. Cell Syst. 4, 495–504.e5 (2017).
Article CAS PubMed Google Scholar
Yu, R. et al. Nitrogen limitation reveals large reserves in metabolic and translational capacities of yeast. Nat. Commun. 11, 1881 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Yu, R., Vorontsov, E., Sihlbom, C. & Nielsen, J. Quantifying absolute gene expression profiles reveals distinct regulation of central carbon metabolism genes in yeast. Elife 10, e65722 (2021).
Article CAS PubMed PubMed Central Google Scholar
Di Bartolomeo, F. et al. Absolute yeast mitochondrial proteome quantification reveals trade-off between biosynthesis and energy generation during diauxic shift. Proc. Natl Acad. Sci. USA 117, 7524–7535 (2020).
Article ADS PubMed PubMed Central Google Scholar
Bateman, A. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
Article Google Scholar
Lu, H. et al. A consensus S. cerevisiae metabolic model Yeast8 and its ecosystem for comprehensively probing cellular metabolism. Nat. Commun. 10, 3586 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Heirendt, L. et al. Creation and analysis of biochemical constraint-based models using the COBRA Toolbox v.3.0. Nat. Protoc. 14, 639–702 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sharma, P. & Guptasarma, P. ‘Super-perfect’ enzymes: Structural stabilities and activities of recombinant triose phosphate isomerases from Pyrococcus furiosus and Thermococcus onnurineus produced in Escherichia coli. Biochem. Biophys. Res. Commun. 460, 753–758 (2015).
Article CAS PubMed Google Scholar
MATLAB. version 9.9.0.1524771 (R2020b) Update 2. (The Mathworks, Inc., 2020).
Gurobi Optimization, L. Gurobi Optimizer Reference Manual https://www.gurobi.com (2021).
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).
MathSciNet MATH Google Scholar
Bateman, A. et al. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Article ADS Google Scholar
Wendering, P. & Arend, M. Correction of turnover numbers in enzyme-constraint metabolic models. Repository name: PRESTO. https://doi.org/10.5281/zenodo.7675009 (2023).
Jeske, L., Placzek, S., Schomburg, I., Chang, A. & Schomburg, D. BRENDA in 2019: a European ELEXIR core data resource. Nucleic Acids Res. 47, D542–D549 (2019).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

P.W. and Z.N. would like to thank the Research Focus Group “Evolutionary Systems Biology” of University of Potsdam for funding. Z.N., M.A., and Z.R. would like to thank the Max Planck Society for funding. Z.R. was supported by the European Union’s Horizon 2020 research and innovation program grant 862201 (to Z.N.). M.A. and Z.N. were supported by the European Union’s Horizon 2020 research and innovation program, project PlantaSYST (SGA-CSA No. 739582 under FPA No. 664620, to Z.N.) (this publication reflects only the author’s view and the Commission is not responsible for any use that may be made of the information it contains).

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: Philipp Wendering, Marius Arend.

Authors and Affiliations

Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, Germany
Philipp Wendering, Marius Arend & Zoran Nikoloski
Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany
Philipp Wendering, Marius Arend, Zahra Razaghi-Moghadam & Zoran Nikoloski

Authors

Philipp Wendering
View author publications
You can also search for this author in PubMed Google Scholar
Marius Arend
View author publications
You can also search for this author in PubMed Google Scholar
Zahra Razaghi-Moghadam
View author publications
You can also search for this author in PubMed Google Scholar
Zoran Nikoloski
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.W., M.A. performed research and analyzed data, P.W. contributed code for PRESTO approach, M.A. assessed model performance and performed the statistical analysis, P.W., M.A., Z.R., Z.N. designed research, P.W., M.A., Z.N. wrote the paper.

Corresponding author

Correspondence to Zoran Nikoloski.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Claudio Angione and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplemental Data 1

Supplemental Data 2

Supplemental Data 3

Supplemental Data 4

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wendering, P., Arend, M., Razaghi-Moghadam, Z. et al. Data integration across conditions improves turnover number estimates and metabolic predictions. Nat Commun 14, 1485 (2023). https://doi.org/10.1038/s41467-023-37151-2

Download citation

Received: 15 September 2022
Accepted: 03 March 2023
Published: 17 March 2023
DOI: https://doi.org/10.1038/s41467-023-37151-2
Springer Nature Limited

This article is cited by

Proteomics and constraint-based modelling reveal enzyme kinetic properties of Chlamydomonas reinhardtii on a genome scale
- Marius Arend
- David Zimmer
- Zoran Nikoloski
Nature Communications (2023)

Data integration across conditions improves turnover number estimates and metabolic predictions

Abstract

Similar content being viewed by others

Introduction

Results

Protein-abundance-based correction of turnover numbers

PRESTO outperforms a contending heuristic in S. cerevisiae

PRESTO provides precise corrections of turnover numbers

Pathways enrichment for corrected turnover numbers

Comparison of turnover number corrections from GECKO

PRESTO with protein-constrained model of E. coli metabolism

Robustness of turnover number corrections

Discussion

Methods

Experimental data

Model preparation

PRESTO approach

Variability analysis for δ

Validation of corrected models

Pathway enrichment analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation