Background

The gram-negative bacterium E. coli is one of the best-studied microorganisms. This bacterial species includes pathogenic strains that cause disease in various tissues in mammalian and other vertebrate hosts. Some of the more common diseases associated with pathogenic E. coli strains are caused by bacteria found in the gastrointestinal tract or urinary tract, and is a major cause of human morbidity and mortality worldwide. E. coli infections cost the healthcare industry over a billion dollars annually with the enterohemorrhagic (EHEC) and uropathogenic (UPEC) E. coli strains alone responsible for more than 73,000 and 7,000,000 illnesses annually in the United States, respectively [13]. A number of genome sequences for these pathovars exist, and comparative analysis between commensal and pathogenic strains has revealed different virulence strategies [410]. However, the metabolic properties that differentiate these strains have not been thoroughly investigated. The metabolic content of the genomes of these strains is complex with each strain predicted to contain over 1,000 genes encoding metabolic enzymes and transporters [11]. One method to investigate the complexity of genome-scale metabolic networks is through the construction of computational models.

Computational modeling of bacterial metabolism offers a promising approach to predict strain-to-strain variation in metabolic capabilities and microbial strategies used during host association. The number of available genome-scale metabolic models (GEMs) has grown recently, and they capture the metabolic capabilities of numerous microbial taxa important to human health, biotechnology and bioengineering [12, 13]. Systems biology combines computational and experimental approaches to study the complexity of biological networks at a systems level, where the cellular components and their interactions lead to complex cellular behaviors. Genome-scale biological networks have proven useful for interpreting high-throughput data and generating computational models. Mathematical models are constructed from network reconstructions, and they include variables, parameters, and equations to describe the potential behavior of these networks. Numerous types of genome-scale biological networks have been constructed including metabolic, regulatory, and transcriptional and translational machinery for E. coli K-12 [1417].

To date, GEMs have been constructed for only two commensal strains of E. coli, E. coli K-12 (strain MG1655) and E. coli W [15, 18]. The E. coli K-12 GEM has been used to engineer strains to increase valuable product formation [1923], facilitate enzyme function discoveries [24], provide insight into the genome evolution of other enterobacteria [25, 26], and improve the understanding of the connectivity of metabolic reactions within the cell [27]. Furthermore, computational metabolic models can be validated and refined by comparing in silico predictions with experimental data, where the discovery of disagreements or incorrect in silico predictions can lead to improvements and/or hypotheses about component interactions and unknown network components. An iterative process thus develops where the models are used to analyze experimental data and discrepancies lead to improved models and additional biological discovery. Such approaches have proven successful for updates to the E. coli models for regulation and metabolism [14, 24, 28, 29].

Currently the construction of metabolic networks relies primarily on information derived from genome annotations, enzymatic/pathway databases, and published literature. By combining these resources, the elementally- and charged-balanced reactions catalyzed by enzymes associated with a given gene can be identified [30, 31]. These reactions incorporate pertinent information such co-factors, substrates, products, reversibility, stoichiometry, and subcellular location. A genome-scale metabolic network contains a list of reactions, as well as the gene to protein to reaction (GPR) associations, and is used to formulate constraint-based GEMs. By comparing GEMs for pathogenic and non-pathogenic E. coli strains, metabolic differences can be identified that may lead to the development of new control strategies for E. coli associated disease.

Here we describe the construction of a detailed GEM for the pangenome of the species E. coli, and the use of this GEM to rapidly generate six strain-specific GEMs to compare genome-scale metabolism between four strains from two pathogenic lineages with two commensal K-12 strains. In addition, an ancestral E. coli core GEM was constructed consisting of only those metabolic reactions associated with genes that are conserved across 16 E. coli genomes. The metabolic potential of this ancestral core was also examined. Experiments were performed to iteratively refine and validate the six strain-specific GEMs under aerobic and anaerobic conditions. Once strain-specific GEMs were validated, the properties and metabolic differences distinguishing these pathogenic and commensal E. coli strains were computationally investigated, revealing that some pathogenic E. coli strains are more metabolically efficient than other strains in some environmental conditions. The E. coli GEMs generated in this work provide new tools for investigating the evolutionary and metabolic differences of these strains in conditions reflecting those environments encountered in human hosts. This is the first study to examine the metabolic properties of numerous strains of such a phylogenetically related group of microorganisms, and provides insight into the evolution of metabolism for the species E. coli.

Methods

Bacterial strains and growth conditions

Six E. coli strains and one Salmonella strain were used in this study (listed in Table 1). Frozen cultures were streaked onto Luria Bertani (LB) agar plates and grown overnight at 37°C. Isolated colonies were then used to inoculate MOPS (morpholinepropanesulfonic acid) minimal media (TekNova, Hollister, CA) and incubated overnight with shaking (220 rpm) at 37°C, and then overnight cultures were used to inoculate batch cultures grown with continuous sparging aerobically (70% N2, 25% O2, and 5% CO2) or anaerobically (95% N2 and 5% CO2) as previously described [32]. For carbon plate utilization assays, isolated colonies were used to inoculate Sheep Blood Agar plates (Biolog, Hayward, CA) and incubated at 37°C overnight aerobically or anaerobically in sealed Whirl-Pak® Long-Term Sample Retention Bags (Nasco, Fort Atkinson, Wisconsin) saturated with an anaerobic gas mixture (95% N2 and 5% CO2) as described [33, 34]. Anaerobic conditions were confirmed using an obligate aerobic bacterium that exhibited no growth and no respiration in any of the anaerobic conditions examined. Cells were collected and used to inoculate Biolog PM1 plates following the manufacturers recommendations with a minor modification of adding a top layer of mineral oil to each well for anaerobic culture conditions.

Table 1 List of bacterial strains used in this study.

Updates to the E. coli K-12 MG1655 metabolic network

Prior to generating a pangenome GEM, additional genes in the genome of E. coli K-12 MG1655 were evaluated as possible updates to the most recent E. coli GEM (iAF1260)[15]. The annotations for E. coli K-12 MG1655 were obtained and examined from the ASAP, EcoGene, KEGG, and EcoCyc databases [3538]. ORFs encoding enzymes that were not included in iAF1260 were further investigated to develop elementally and charge-balanced reactions and to assign the reaction to a subcellular location based on pSORT predictions [39]. In some instances new ORFs were added as isozymes to existing reactions and the gene-to-protein-to-reaction associations updated. This resulted in the addition of 79 new ORFs to the iAF1260 GEM to create iEco1339_MG1655 (Additional file 1). Of the new 79 ORFs (Additional file 2), 62% were based on experimental data from the literature for E. coli strains [4068] and the rest were based on sequence homology to enzymes already existing in E. coli metabolic networks or to experimentally characterized enzymes from other enterobacteria. These gene additions resulted in 42 new reactions, 32 new isozymes, and 30 new metabolites to the in silico model for E. coli K-12 MG1655.

Generation of an E. coli pangenome metabolic network

Draft and complete enterobacterial genomes in the ASAP database have been continually updated using new publicly accessible genomes since the database's inception [35]. There are more than 150 genomes of enterobacteria in the ASAP database (along with predicted orthologs), 39 of which are E. coli genomes. Of these E. coli genomes, 16 are completely finished, and we have used the information from these genomes and that of Salmonella typhimurium LT2 (Table 2) to generate an E. coli pangenome metabolic network based on metabolic enzymes present in the union of 76,990 ORFs. Each ORF was assigned in the ASAP database to an ortholog cluster group (OCG), and the 76,990 ORFs map to 17,647 OCGs. This reduced the number of genes that had to be evaluated for inclusion in the metabolic network and allowed generation of strain-specific GPRs to rapidly be formulated (Additional file 3). Not all of these OCGs play a metabolic role and/or have sufficient experimentally determined details for metabolic network inclusion. For example only 32.3% of ORFs in the total genome (1,339/4,141) are accounted for in the updated GEM for E. coli K-12 MG1655 (iEco1339_MG1655). All of the gene to protein to reaction association information from the E. coli K-12 MG1655 (iEco1339_MG1655) and the Salmonella LT2 (iRR1083)[69] GEMs were mapped to the OCGs that contained the respective ORFs. The annotations for the genes composing the remaining OCGs were analyzed for additional new metabolic reactions and isozyme additions (Additional file 4) leading to the generation of an E. coli pan-GEM named iEco1712_pan (Additional file 5). All eight SBML files generated in this work were checked for syntax and internal consistency using the validation tool (http://sbml.org/validator/validate.php) and found to conform to all specifications of SBML through Level 3 Version 1 Core (Release 1).

Table 2 E. coli genomes used to construct the pangenome metabolic network.

Flux Balance Analysis

Fluxes through metabolic network reactions can be predicted using flux balance analysis (FBA) [70]. In FBA, fluxes are constrained by steady-state mass balances, enzyme capacities and reaction directionality. These constraints yield a solution space of possible flux values, and FBA uses an objective function to identify flux distributions that maximize (or minimize) the physiologically relevant predicted solution. Cellular growth rate (or biomass production) is often used as an objective function for FBA [71], and was used for FBA analyses performed in this study. The same biomass equation, growth (GAM) and non-growth (NGAM) associated ATP requirement values, and PO (number of ATP molecules produced per pair of electrons donated to the electron transport system) ratio were used for all E. coli developed models, and were the same as that in iAF1260[15]. For FBA and dynamic simulations the reported [15] wildtype biomass was used and for determination of essential reactions the core biomass was used. Using FBA, in silico predictions of growth yield, growth rate, and carbon source utilization were compared to experimentally determined values for all six E. coli strains and for Salmonella LT2 in both aerobic and anaerobic conditions (Additional files 6 and 7). For carbon source utilization and gene deletion simulations, a maximum uptake rate of 10 mmol per gram of dry weight per hour (mmol/gDW cell/h) was used. FBA was also used to predict essential reactions by constraining reactions to have zero flux and maximizing growth rate. If the resulting maximum predicted growth rate (using FBA) was zero then the reaction was considered to be essential. Reaction deletion simulations were evaluated under both aerobic and anaerobic conditions.

Batch Growth Experiments and Simulations

We performed dynamic FBA simulations of batch growth [72] and compared these results with experimental data (Additional file 6). In the laboratory, cells were grown overnight in MOPS minimal medium with the addition of glucose (11 mM) as the sole carbon source, and used to inoculate batch cultures to an optical density (OD600 nm) of ~0.02. Batch growth under aerobic and anaerobic conditions was conducted at 37°C and spectrophotometric measurements, viable cells/ml, and biomass (g dry cell weight) were determined at each time point. Samples were collected every hour and passed through a 0.2 μm syringe filter, and then frozen at -70°C for subsequent UPLC analysis. The filtered supernatants were then analyzed to determine glucose concentrations at each time point using a UPLC following manufacturers recommendations (Waters Co., Milford, MA). Growth rates, growth yields, and glucose specific uptake rates were determined from experimental data using the linear least squares estimate as described [73]. Biomass to OD600 conversion values were also calculated for each strain (Table 3). These conversion values were used to estimate initial biomass (T0) values using the initial OD600 measurements (O.D.600 = ~0.02) for each experiment. The T0 values were used as parameters for dynamic FBA simulations of batch growth to determine the exponential growth rates (1/h), biomass yields (g biomass/1g glucose), and times (h) needed to reach stationary phase for the corresponding experimental conditions (aerobic or anaerobic growth in MOPS with 11 mM glucose). For each strain, experimental values from three biological replicates were then compared to those calculated from three computational simulations with matching starting biomass values under aerobic and anaerobic conditions.

Table 3 Experimental strain-specific conversion factors for aerobic or anaerobic growth conditions.

UPLC analysis

Glucose analysis was conducted using an Acquity UPLC equipped with an Acquity BEH Glycan column (Waters). A mobile phase (75% [v/v] Acetonitrile/25% [v/v] H2O with 0.2% [v/v] Triethylamine; pH 9.1) was used at a flow rate of 0.1 ml/min to separate small molecules on a Waters Acquity UPLC equipped with an evaporative light scattering detector and photodiode array.

Phylogenetic Analysis

A maximum parsimony phylogenetic analyses of seven taxa were conducted in MEGA4 [74] using a concatenated protein sequence data set of AcnA, GapA, IcdA, Mdh, MtlD, Pgi, and ProA with S. typhimurium LT2 used as the outgroup species. These genes were chosen since they have been successfully used for phylogenetic analyses of enterobacteria [75]. The alignment for this data set and subsequent maximum likelihood phylogenetic analyses was performed in MEGA4 using default parameters.

Results

The metabolic model for E. coli K-12 MG1655 was developed 20 years ago and has undergone numerous improvements and updates. It is now a sophisticated compartmentalized GEM containing over 1,200 genes and 2,000 reactions. It has been used extensively for biotechnology and discovery applications. Here we generated a GEM for the pangenome of E. coli, and used the information from this larger metabolic network to generate strain-specific E. coli GEMs for two pathogenic lineages and an ancestral core GEM containing reactions conserved across all E. coli strains. Using this new collection of GEMs we validated strain-specific models by comparing predictions to experimental data, conducted a comparison of strain-specific GEMs from three E. coli lineages (commensal, EHEC, and UPEC), and examined the metabolic networks of numerous E. coli strains in an evolutionary perspective based on phenotypic traits.

Updating the E. coliK-12 MG1655 metabolic model

The contents of the E. coli K-12 MG1655 genome were surveyed for new genes/reactions to add to the existing GEM (iAF1260). This effort added a total of 79 genes to iAF1260, of which 15 encoded proteins with significant similarity to proteins with characterized enzymatic activity, 15 were added based on orthology to genes found in the S. typhimurium LT2 GEM (iRR1083), and 49 were added based on experimental evidence from the scientific literature (Additional file 2). Three of the new genes were linked to metabolic reactions that were already included in iAF1260, but whose associated genes were previously unknown (i.e. orphan reactions). The 79 new genes added 42 new metabolic reactions and 30 new metabolites to the GEM. Exchange/transport reactions to permit cis-dihydrodiol-phenylacetyl-CoA utilization were also added resulting in an updated GEM for E. coli K-12 MG1655 designated as iEco1339_MG1655 composed of 1,339 genes, 1,069 metabolites, and 2,428 reactions (Additional file 1). This includes eight new reactions for phenylacetate metabolism that were added following our observation that E. coli K-12 MG1655 can grow in minimal media with phenylacetate as a sole carbon source (data not shown). The 42 new reactions in iEco1339_MG1655 were classified into 15 metabolic subsystems (Figure 1A). There were 24 genes added to the GEM that likely encode isozymes that participate in 32 existing reactions across 13 metabolic subsystems (Additional file 2). A total of 370 reactions in iEco1339_MG1655 contain multiple isozymes.

Figure 1
figure 1

A summary of metabolic reaction additions and deletions to GEMs used in this study. In comparison to the previous E. coli K-12 MG1655 GEM (iAF1260), subsystem classification for new reaction additions to (A) iEco1339_MG1655. In addition, in comparison to iEco1339_MG1655, subsystem classification for reactions additions to (B) iEco1712_pan, or reaction deletions for (C) iEco1053_core, (D) reaction deletions shared in both EHEC strains (iEco1344_EDL933 and iEco1345_Sakai), (E) reaction deletions shared in both UPEC strains (iEco1288_CFT073 and iEco1301_UTI89), and (F) reaction deletions shared in both EHEC and EPEC strains.

Generation of an E. colipangenome metabolic model

Understanding the evolution of metabolism for the species E. coli requires comparing genome-scale metabolic content among different strains of E. coli and its relatives. To faciltate these comparisons we mined the contents of 16 E. coli genomes to identify reactions that could be added to iEco1339_MG1655 to generate a pangenome metabolic network representing all metabolic reactions associated with genes present in any one of the E. coli genomes (Table 2). All genes from the sixteen E. coli and the S. typhimurium LT2 genomes (81,496 ORFs total) were classified into orthologous cluster groups (OCGs) based on ortholog relationships from the ASAP database [35]. This analysis resulted in a total of 17,647 OCGs with 16,417 representing the E. coli pangenome and 1,230 OCGs unique to S. typhimurium LT2. Of the 16,417 E. coli pangenome OCGs, 2,894 are found in all sixteen E. coli genomes, 4,146 are shared by two or more E. coli strains, and 9,377 are unique to individual E. coli genomes. Each additional genome added on average 806 new genes; however this number decreased as more genomes were analyzed (Figure 2).

Figure 2
figure 2

Escherichia coli core and pangenome metabolic network evolution according to the number of sequenced genomes listed in table 2. Number of conserved genes (black diamonds), total number of unique genes without orthologs in prior genomes (blue squares), and total number of gene additions to the pan-GEM (red triangles), and total number of metabolic reactions additions to the pan-GEM (green circles) for a given number of genomes analyzed for the different strains of E. coli.

A GEM for the pangenome was then constructed. Existing gene to protein to reaction (GPR) associations from GEMs of iEco1339_MG1655 and S. typhimurium LT2 (iRR1083) were mapped to their corresponding OCGs. The remaining OCGs were analyzed to see if they could be added to the pangenome metabolic network by adding additional isozymes or new reactions. This led to the addition of 373 OCGs and 32 new reactions beyond those found in the updated E. coli K-12 MG1655 GEM (iEco1339_MG1655), resulting in a pan-GEM, iEco1712_pan (Additional file 5), consisting of 1,712 genes, 1093 metabolites, and 2,452 reactions (Additional file 4). Each additional E. coli genome added to the pangenome metabolic analysis resulted on average added 27 new metabolic genes, 20 isozymes, and approximately 2 new metabolic reactions to the pan-GEM (Figure 2). The 32 reactions added to iEco1712_pan fall into 11 metabolic subsystems (Figure 1B), with the majority being related to alternate carbon metabolism (56%, associated with 4-hydroxyphenylacetate and propanediol metabolism) and oxidative phosphorylation (9%). Other added OCGs resulted in addition of isozymes associated with 14 metabolic subsystems, with the most abundant being alternate carbon metabolism (18%), cell envelope biosynthesis (11%), oxidative phosphorylation (11%), nitrogen metabolism (8%), glutamate metabolism (6%), and the remaining 9 subsystems consisted of a single reaction addition (Additional file 4).

Generation of an E. colicore metabolic model

E. coli strains are thought to have diverged from a common ancestor ~10 million years ago (mya) [76] and it is of interest in understanding how strain-specific metabolism has evolved over time to have an estimate of the metabolic capabilities of an ancestral core for the species E. coli. We assume that genes conserved across the genomes of all strains represents a conservative estimate of the core genome of the ancestor of modern E. coli strains and used this collection of 2,894 conserved genes to construct an ancestral core GEM named iEco1053_core (Additional file 8). There are 1,053 of these genes in the E. coli K-12 MG1655 GEM (iEco1339_MG1655). The GEM for the E. coli ancestral core was made by removing OCGs and their associated reactions from the iEco1339_MG1655 GEM if one or more of the sixteen E. coli genomes did not have a gene assigned to the OCG (Additional file 3). If removing a reaction prevented biomass production for anaerobic growth on glucose (predicted using FBA) then the reaction was added back to the metabolic reconstruction without a gene associated with it and this occurred 24 times (Additional file 9). Using this approach 286 ORFS associated with 184 reactions and 177 isozymes were removed from iEco1339_MG1655 resulting in an E. coli ancestral core GEM (iEco1053_core) consisting of a total of 1,053 ORFs and 2,244 reactions (Table 4), and these 184 reactions we removed were classified based on metabolic subsystem (Figure 1C).

Table 4 E. coli strain-specific metabolic model information.

Characteristics of five new E. colistrain specific models

The pan-GEM was used to expedite the process of generating five new strain-specific E. coli GEMs, since the pangenome has reactions connected to cluster groups, a given strains genome contents were analyzed to identify what cluster groups its genes belong to and those associated reactions were included (Additional file 3). The E. coli strains we selected include an additional E. coli K-12 strain (W3110), two enterohemmoraghic E. coli O157:H7 strains (EDL933 and Sakai), and two uropathogenic strains (CFT073 and UTI89). Comparisons to iEco1339_MG1655, including the total number of strain-specific gene additions and deletions and the corresponding metabolic reactions are shown in Table 4. The two K-12 strains are laboratory strains derived from the same isolate and not surprisingly their GEMs, were very similar with the sole difference being removal of a few isozymes and a gene associated with galactitol transport from the W3110 GEM named iEco1335_W3110 (Additional file 10) due to a W3110-specific IS insertion in the gatA gene [77]. We built the five new E. coli GEMs named iEco1335_W3110 (Additional file 10), iEco1344_EDL933 (Additional file 11), iEco1345_Sakai (Additional file 12), iEco1288_CFT073 (Additional file 13), and iEco1301_UTI89 (Additional file 14) by deleting genes and reactions from the pan-GEM when missing from the genome under consideration (Additional file 15). If removing a reaction prevented biomass production for anaerobic growth on glucose (predicted using FBA) then the reaction was added back to the metabolic reconstruction without a gene associated with it. The number of these reactions without associated genes varied from 5 to 10 for each of the pathogenic E. coli strains (Table 4). Five of these reactions without associated genes were required in all four pathogenic E. coli GEMs (Additional file 9). Of the remaining five reactions, two were required for strains EDL933 and Sakai, one was required for strains EDL933, Sakai, and UTI89, and the remaining three are specific to E. coli strain CFT073.

Even though the genomes of the four pathogenic E. coli strains contain between 700-1,000 genes not present in the genome of E. coli K-12 MG1655, relatively few pathogen-specific metabolic genes were added to each GEM (Table 4). Eight new reaction additions were unique to the GEMs of the EHEC strains (iEco1344_EDL933 and iEco1345_Sakai) and consisted of urease, UDP-N-acetylglucosamine 4-epimerase, salicylate hydroxylase, gentisate 1,2,-dioxygenase, sucrose transport, tellurite reduction, fucose synthetase, and perosamine synthetase reactions. The two UPEC strain GEMs (iEco1288_CFT073, and iEco1301_UTI89) shared only one lineage-specific reaction addition for propionate CoA-transferase and each has a single strain-specific reaction addition unique to each strain; galactose isomerase activity for iEco1288_CFT073 and hydroxypyruvate reductase activity for iEco1301_UTI89.

In contrast to the relatively small number of gene and reaction additions there were a large number of reaction deletions for the pathogenic strain GEMs compared to iEco1339_MG1655 (Figure 3, Additional file 15). For EHEC strain GEMs, iEco1344_EDL933 and iEco1345_Sakai, there were 52 genes found in the E. coli K-12 MG1655 GEM (iEco1339_MG1655) that had no orthologous gene in the genomes of the two EHEC strains. These missing genes resulted in 60 reaction deletions in both iEco1344_EDL933 and iEco1345_Sakai, and these were classified based on metabolic subsystem (Figure 1D). There was only one additional reaction deletion unique to iEco1345_Sakai for D-cysteine desulfhydrase, whereas the ORF encoding this enzyme was still intact in the EDL933 strain. For each of the two UPEC strains there were 55 reactions that were missing in both iEco1288_CFT073 and iEco1301_UTI89 compared to iEco1339_MG1655, and these were further classified into metabolic subsystems (Figure 1E). Each of the two UPEC strains also contained numerous reaction deletions unique to each strain (Additional file 5).

Figure 3
figure 3

A summary of lineage-specific reaction additions and deletions in comparison to the E. coli K-12 GEMs.

When GEMs for all four pathogens (iEco1344_EDL933, iEco1345_Sakai, iEco1288_CFT073, and iEco1301_UTI89) were compared to those of the two K-12 strains (iEco1339_MG1655 and iEco1335_W3110), 21 shared reaction deletions were common to all four pathogenic E. coli strains, and they were categorized into the metabolic subsystems of alternate carbon metabolism, cell envelope biosynthesis, inorganic ion transport and metabolism, lipopolysaccharide biosynthesis, methionine metabolism, nitrogen metabolism, inner membrane transport, and outer membrane transport (Figure 1F).

Assessment and validation of models for carbon source utilization

To evaluate the accuracy of the GEMs for all six E. coli strains, we examined each strain's ability to use different carbon sources under aerobic and anaerobic conditions using Biolog phenotypic arrays. There were numerous strain-specific differences in carbon source utilization in both aerobic (Figure 4A) and anaerobic conditions (Figure 4B). These experimental results were then compared to FBA predictions of growth using different carbon sources. For those compounds included in the Biolog plates that have transporters in the model, FBA was used to predict if they could be used for growth as sole carbon source. This included 76 potential carbon sources for the six E. coli strains and 54 potential carbon sources for Salmonella LT2. If FBA calculated a zero growth rate then the compound was predicted not to be usable as a sole carbon source, while positive calculated growth rates indicated that the model predicted the compound could be used as the sole carbon source. Of the 76 compounds, there were 59 (aerobic) and 56 (anaerobic) carbon sources where model predictions and experiments agreed for all six E. coli models (Additional file 7). The 16 and 19 carbon sources with discrepancies between in silico and experimental results in at least one model (shown in Figure 5) fall into two categories i) when strains did not grow and the model predicted growth (false positive), and ii) instances where strains grew and the model predicted no growth (false negative).

Figure 4
figure 4

Carbon source utilization results based on phylogeny of E. coli and S. typhimurium strains used in this study. Experimental carbon source utilization results for both aerobic (A) and anaerobic conditions (B).

Figure 5
figure 5

Resolution of in silico and experimental carbon source discrepancies. Carbon source utilization discrepancies for comparison of experimental and in silico data and the respective objective function (Z) used for flux balance analysis.

For aerobic carbon source utilization, the number of false positives varied from one to four, and the most accurate models (one false positive each) were for E. coli K-12 W3110 (iEco1335_W3110) and UTI89 (iEco1301_UTI89) (Figure 5). The number of aerobic false negatives was greater than false positives and ranged from three to seven with the least observed for iEco1339_MG1655 and the most observed for iEco1288_CFT073. Of these model-data discrepancies, there were some carbon sources that led to inaccurate predictions by all six E. coli strain specific models such as the utilization of D-aspartic acid (false positive) and M-inositol and tyramine (false negative). The two pathogenic lineages (EHEC or UPEC) exhibited some lineage-specific false negatives for alpha-keto-butyric acid and acetoacetic acid utilization (iEco1344_EDL933 and iEco1345_Sakai) or D-melibiose (iEco1288_CFT073 and iEco1301_UTI89). FBA predictions using a mixture of L- and R- isomers of 1,2 propanediol (a racemic mixture of 1,2 propanediol is used as the sole carbon source in Biolog PM1 plates) resolved aerobic false negative discrepancies for 1,2 propanediol for five E. coli strains, without introducing any new false positives (data not shown).

For anaerobic carbon source utilization, there were no false positives observed for three E. coli strains (iEco1335_W3110, iEco1288_CFT073, and iEco1301_UTI89) and either one or two false positives observed for carbon sources such as D-Threonine (iEco1339_MG1655), D-Aspartic acid (iEco1344_EDL933), and L-Lyxose (iEco1344_EDL933 and iEco1345_Sakai). In contrast to the aerobic results, there were generally more false negative than false positive predictions, with the number of false negatives ranging from 10 (for iEco1339_MG1655) to 14 (for iEco1288_CFT073) for the six E. coli strains (Figure 5). Of these compounds associated with anaerobic false negatives, there were 10 that led to inaccurate growth predictions for all six E. coli strains (succinic acid, formic acid, acetic acid, 1,2-propanediol, alpha-keto-butyric acid, M-inositol, propionic acid, glycolic acid, acetoacetic acid, and tyramine) (Figure 5).

The Biolog phenotype assay uses reduction of a colorimetric tetrazolium dye to measure microbial respiration. Our initial FBA predictions used an objective function that relates to the ability of the bacterium to convert a particular carbon source into biomass. There may be carbon compounds that the bacteria are able to metabolize but which do not result in measurable growth, thus leading to false negatives. To see if changing the objective function from biomass production to indicator dye reduction improves the FBA predictions under both aerobic and anaerobic conditions, two additional reactions were added to each of the models representing the movement of electrons from reduced quinones to the indicator dye used in Biolog plates (RESP1: mql8 => 2H+ + mqn8; and RESP2: q8h2 => 2H+ + q8). FBA was used again but a new objective function, equal to the sum of flux through these two new reactions, was maximized. If the maximum sum of fluxes was zero then the model predicted the carbon source could not be metabolized, while a positive sum of fluxes indicated a carbon source could be metabolized. These new FBA predictions (using respiration instead of growth as an objective) significantly reduced the number of anaerobic false negatives to between two (iEco1339_MG1655) and eight (iEco1288_CFT073) (Figure 5 and Table 5), while not affecting the number of false positives.

Table 5 Carbon source utilization with respiration as FBA objective function.

Overall, once the FBA objective was changed from biomass to respiration, all E. coli models exhibited a statistically significant relationship between model predictions and experimental growth phenotypes (chi-squared test statistic yields p < 0.05) for both aerobic (>88% accurate) and anaerobic (>89% accurate) conditions (Table 5). The carbon sources M-Inositol and tyramine still led to false negative predictions for all E. coli models examined under both aerobic and anaerobic conditions, which may indicate that missing reactions or gaps may exist in pathways for utilization of these carbon compounds. When considering both the aerobic and anaerobic conditions the overall accuracy for individual strain-specific models was iEco1301_UTI89 (95.3%), iEco1335_W3110 (94.7%), iEco1339_MG1655 (93.4%), iEco1345_Sakai (91.4%), iEco1344_EDL933 (90.8%), and iEco1288_CFT073 (88.8%).

Batch growth predictions

To further evaluate model predictions, dynamic FBA was used to predict time-courses (for substrate, product and cell concentrations), exponential growth rates (1/hr) and biomass yields (gDW cells/g glucose) for aerobic and anaerobic batch cultures in MOPS minimal media with the addition of glucose as the sole carbon and energy source. Batch culture experiments were conducted for each strain, and conversion factors for optical density to biomass, optical density to viable cell concentration, and biomass per viable cell values (Table 3) were determined for each strain in both anaerobic and aerobic conditions. These conversion values were then used to approximate starting biomass values used in the dynamic FBA simulations. For each E. coli strain, the maximum glucose uptake rates used for dynamic FBA were those reported for E. coli K-12 strain W3110 [73] (10 and 18.5 mmol glucose/gDW/h for aerobic and anaerobic conditions, respectively). The predicted growth rates and biomass yields from the model were then compared to experimental results of batch culture of each strain under anaerobic or aerobic conditions. For both aerobic or anaerobic growth conditions, the calculated growth rate (1/hr) for each E. coli strain was compared to the experimentally determined values (Table 6) and the agreement between in silico and experimental values was strong and significant (Pearson correlation test statistic yields p < 0.0002 for both aerobic and anaerobic conditions), yet when viewed separately for aerobic or anaerobic conditions, the correlation was not as strong and was not significant (Pearson correlation test statistic yields p < 0.37 for aerobic, and p < 0.45 for anaerobic conditions). In addition, the growth yields were calculated under aerobic or anaerobic growth conditions in MOPS minimal media for both aerobic and anaerobic growth conditions, and the in silico growth yields for each strain were compared to those determined experimentally (Figure 6) and the agreement between in silico and experimental values was strong and significant (Pearson correlation test statistic yields p < 0.0001 for both aerobic and anaerobic conditions).

Table 6 Comparison of experimental and in silico net specific growth rates (h-1).
Figure 6
figure 6

Comparison of in silico and experimentally determined growth characteristics. Strain specific batch growth values determined using MOPS minimal media with the addition of 0.2% glucose for (A) aerobic and (B) anaerobic growth conditions.

The maximum glucose uptake rate from one E. coli strain from each lineage K-12 (MG1655), EHEC (EDL933), UPEC (UTI89) was determined from experimental data for both aerobic and anaerobic conditions. These results revealed that the uptake rates for E. coli K-12 MG1655 were not similar to previously published values for E. coli K-12 W3110 (Table 6). When compared to the two pathogenic lineages, the two E. coli K-12 strains appear to have significantly higher glucose uptake rates in aerobic conditions but significantly lower glucose uptake rates in anaerobic conditions. The dynamic FBA simulations were repeated using the measured lineage-specific glucose uptake rates as parameters. The recalculated growth rates (Table 6), still showed a moderately strong correlation overall between in silico and experimental values (p < 0.0006), yet when viewed separately, resulted in a decreased correlation for aerobic conditions, and a increased correlation between in silico and experimental values for anaerobic conditions (Pearson correlation test statistic yields p < 0.00003). Figure 7 shows a phylogenetic tree of the strain relationships plotted along with the growth rate data, displaying the aerobic and anaerobic growth rates calculated without the uptake rate correction. Additionally we determined the amount of time needed to reach stationary phase experimentally and computationally using dynamic FBA for each strain in each condition. Evaluation of these results reveal that some of the pathogenic lineages attain final biomass in less time compared to the E. coli K-12 strains (Table 7).

Figure 7
figure 7

Maximum likelihood phylogeny of E. coli and S. typhimurium strains used in this study constructed using the concatenated nucleotide sequence for 7 housekeeping genes. In silico E. coli core and strain-specific experimentally determined growth rates (blue), growth yield (green), and time to attain final biomass during batch growth (red) during aerobic (A) or anaerobic conditions (B).

Table 7 Time (h) to reach final biomass values in batch growth under both aerobic and anaerobic conditions.

Analysis of reaction essentiality

To further explore the metabolic differences and similarities between all six E. coli strain-specific GEMs, we compared reaction essentiality predictions for in silico conditions simulating aerobic and anaerobic growth in glucose minimal media. The number of predicted essential reactions shared in common for all six E. coli strains and also the E. coli ancestral core in both aerobic and anaerobic growth conditions was determined (n = 280) (additional file 16), and their corresponding reactions were further classified by metabolic subsystem (Table 8). Additionally for all six E. coli strains and the E. coli ancestral core, there were 15 additional conserved essential reactions predicted to be required under anaerobic conditions and these involve reactions assigned to subsystems for anaplerotic reactions (1), citric acid cycle (1), cofactor and prosthetic group biosynthesis (5), glycolysis/gluconeogenesis (2), purine and pyrimidine biosynthesis (1), inner membrane transport (2), and outer membrane transport (1). In addition to the shared predicted essential reactions for all E. coli strains examined, there were two lineage-specific reactions predicted as essential under both aerobic and anaerobic conditions for both EHEC strains (iEco1344_EDL933, iEco1345_Sakai), and the corresponding reactions were for fumarate reductase and glycolate oxidase. For the E. coli ancestral core (iEco1053_core), there were five additional reactions predicted as essential that were not predicted for any of the six E. coli strains (additional file 16), and these involve reactions assigned to subsystems for alanine and aspartate metabolism, glutamate metabolism, inner membrane transport, glycolate exchange, and outer membrane porin transport.

Table 8 Subsystem classification for essential reactions predicted for all six E. coli strains under aerobic conditions (n = 282).

Discussion

This study describes the generation of GEMs representing the union (pangenome) and also the intersection (core) of all identifiable metabolic reactions contained in sixteen genomes of E. coli. We used the E. coli pan-GEM to rapidly construct six E. coli strain-specific GEMs. A comparison between model growth predictions and Biolog phenotypes measured in the laboratory demonstrated an accuracy of more than 88%, including those under anaerobic conditions Additional quantitative data was generated for each strain and used to validate the correlation between model predictions and experimental physiology of the strains in the laboratory. These new E. coli GEMs serve as a framework to examine genome-scale metabolic similarities and differences between strains in an evolutionary context with respect to the commensal, EHEC, and UPEC lineages.

The two E. coli K-12 strains (MG1655 and W3110) are widely used laboratory strains that are believed to have diverged from the same parental strain (strain EMG2 or WG1) approximately 50 years ago [77]. The sole identified metabolic differences between the two E. coli K-12 strains based on genome comparison include the gatA gene that is involved in galactitol transport, dcuA and dcuC involved in C4-dicarboxylate transport metabolism, and also tnaB thought to be involved in the utilization of tryptophan as a carbon and/or nitrogen source [77]. Of these four metabolic gene differences, only inactivation of gatA leads to a loss of a reaction in iEco1335_W3110, compared to iEco1339_MG1655 since dcuA, dcuC and tnaB have other isozymes. The gatA gene contains an insertion sequence (IS) element in E. coli W3110, which suggests a phenotypic loss for galactitol utilization as a carbon source, yet experimental data (Figure 5) reveals that the strain can still use this substrate as sole carbon source, indicating that other transporters may permit galactitol transport for E. coli W3110. Although the two E. coli K-12 strains (MG1655 and W3110) exhibited no differences in their GEMs, quantitative and strain-specific differences were observed during batch growth in minimal media with glucose as the sole carbon source. While in silico predictions for growth yield were similar for iEco1339_MG1655 and iEco1335_W3110, experimental data reveal that in both aerobic and anaerobic conditions, strain MG1655 had higher growth yields, higher growth rate, and attained the final biomass value in less time than strain W3110 (Figure 7). Therefore, although the in silico models for these two strains are nearly indistinguishable, strain specific differences in complex traits such as biomass composition [78], ATP requirements, PO ratios, and glucose uptake rates may account for these experimental differences. Previous studies have shown that despite their nearly identical genomes and very similar growth patterns in a bioreactor, W3110 and MG1655 have many significant differences in their transcriptomes and proteomes. These include differential expression of pathways affecting central metabolism and the generation of precursor metabolites and energy [79] suggesting that future models for even these very similar strains will need to account for subtle genetic differences between strains to accurately predict phenotypic traits in simulated culture conditions.

Previous analyses of the E. coli pangenome estimated that on average each new E. coli genome sequence added about 176 unique genes to the pangenome [8, 9], and among these unique genes, we found each additional E. coli genome resulted in 27 metabolic gene additions corresponding to about 2 new metabolic reactions and 20 isozymes suitable for inclusion in the pan-GEM (Figure 2). Clearly some of the metabolic differences between E. coli strains are due to the addition of genes with new metabolic activity. However, our ability to add new reactions to the metabolic reconstructions is severely limited by the paucity of experimental characterization of the metabolic genes, proteins, and reactions unique to pathogenic strains. Since the strain-specific portions of the genomes remain largely uncharacterized, our current understanding of the metabolic functions they encode is dominated by the presence and absence of genes encoding functions represented in the iEco1339_MG1655 GEM. Many of the genes included in this model are not universally conserved among the genomes we examined; resulting in strain-specific GEMs with an average of 70 fewer genes than iEco1339_MG1655 (Table 9). This observation is also consistent with draft GEMs generated using the Model SEED [80] where the GEM for E. coli MG1655 contained more genes (>60) and reactions (>460) than the draft GEMs for all four pathogenic E. coli strains examined in this work (data not shown).

Table 9 Number of strain-specific orthologous genes in common with those contained in iEco1339_MG1655

Although carbon source utilization has become a standard method to assess the validity of computational metabolic model predictions, this study was the first to examine this procedure under anaerobic conditions. Initially, the accuracy of predictions for carbon source utilization during anaerobic conditions was less than those determined during aerobic conditions. We account this difference to comparisons between Biolog carbon source assays, which examine the ability of a microbial strain to generate energy from each sole carbon source, to in silico analysis that determines growth as a positive flux value for the biomass reaction. One possible explanation for experimental and in silico data discrepancies may be that a microbial strain may be able to generate energy from a given carbon source, but that the carbon source is not suitable to sustain growth (i.e. generate a positive biomass value). Therefore, rather than maximize the objective value for the biomass equation, we added two reactions to monitor the ability to generate energy through electron transfer to quinones, and in many cases this analysis resolved discrepancies between in silico predictions and experimental data, especially for anaerobic conditions. Although this methodology of examining carbon source utilization seems trivial, validation for accurate carbon source utilization is important for modeling complex environments such as those encountered in a host, as 31 of the 76 carbon sources tested here were used to simulate the conditions reflecting invasion of a human cell to study S. typhimurium LT2 infection [69]. Therefore, the validation of these strain-specific metabolic models for carbon source utilization will prove useful for future computational modeling of pathogenic E. coli strains in conditions encountered in the gastrointestinal tract or in other locations such as the urinary tract in mammalian hosts.

With the generation of the first GEMs for pathogenic E. coli strains, two EHEC strains and two UPEC strains, properties of these genome-scale metabolic networks were investigated to identify differences that may play a role in human disease. We analyzed two E. coli O157:H7 strains associated with foodborne outbreaks, strain EDL933 isolated from ground beef in the U.S in 1982 and strain Sakai isolated from contaminated radish sprouts that sickened thousands in Japan in 1996. Strains CFT073 and UTI89, which cause human disease outside of the intestine, were isolated from patients with acute urinary tract infections. A comparison of reaction deletions between the EHEC and UPEC metabolic networks reveals that the EHEC strains have more missing genes corresponding to reactions for inner membrane transport in comparison to the UPEC strains. In addition, the reaction deletions that occur in both pathogenic lineages relative to E. coli K-12 strains are mainly associated with genes involved in lipopolysaccharide biosynthesis/recycling and alternate carbon utilization. It seems likely that some of these missing reactions are the result of acquisition of genes during the evolution of the K-12 lineage. Perhaps some of the reactions missing from both pathogen lineages arise from parallel deletions arising from selective pressures common to both pathogens.

Batch growth experiments were conducted to compare growth yields, growth rates, and the amount of time to attain final biomass among strains. We were surprised that EDL933, Sakai and CFT073 have significantly higher growth rates than MG1655 during aerobic growth conditions yet the in silico predictions reveal little to no differences. We sought to determine if strain-specific glucose uptake rates may improve in silico growth rate predictions. Experimentally determined glucose uptake rates were actually lower for EDL933 and CFT073 than for MG1655, and did not improve in silico predictions. The growth yield values we measured in the laboratory also showed significantly (student's t-test statistic yields p < 0.05) higher yields for EDL933 and CFT073 than the two K-12 strains, but in silico predictions showed only minor strain-to-strain variations. Dynamic FBA using the strain specific E. coli GEMs predicts a similar growth rate from all models including the model for the ancestral core of E. coli. Yet the actual growth rates determined experimentally vary significantly between strains suggesting that our models are not accounting for some strain-specific factors such as oxygen uptake rates, biomass composition, ATP requirement parameters, or additional uncharacterized reactions. The length of time required to attain final biomass was significantly (student's t-test statistic yields p < 0.05) shorter for the four pathogens suggesting that they may be more efficient at biomass production during glucose catabolism, and dynamic FBA analysis accurately predicted this phenotypic difference among the strains.

In anaerobic batch growth conditions there were also differences between strains. All pathogenic strains have higher growth rates than the K-12 strains. The FBA predictions for EHEC strains both reflect this phenotype, but the in silico growth rate predictions for the UPEC strains did not reflect this trend. The experimentally determined glucose uptake rates are higher for both pathogenic lineages than K-12, and these organism-specific parameters improved the FBA predictions. The growth yields determined experimentally are significantly (student's t-test statistic yields p < 0.005) higher for the four pathogens than the K-12 strains. The length of time required to attain final biomass predicted by FBA and determined experimentally was significantly (student's t-test statistic yields p < 0.05) shorter for the EHEC strains than the K-12 strains. Overall, for anaerobic glucose catabolism, all four pathogens appear to grow better than both E. coli K-12 strains.

Even though the metabolic networks of each E. coli strain differ, there were relatively few strain-to-strain differences in reactions predicted as essential for the two growth conditions examined. While there were some identified for all strains that were unique for anaerobic growth in comparison to aerobic, there were relatively few differences between all strains. The two reactions (fumarate reductase and glycolate oxidase) predicted as essential for the E. coli O157:H7 strains, play essential metabolic roles for glycolate recycling and the reoxidation of menaquinol, and represent new targets for control strategies that may help to prevent and treat human EHEC illness.

The comparison of the pan- and core-GEMs reveals that a substantial fraction of the reactions in our current pan-GEM are also in the ancestral core-GEM (92%). However, our knowledge of the detailed biochemistry of the pangenome is likely incomplete since many of the genes in other E. coli strains have unknown functions. One reason why the number of reactions in the core- and pan-GEMs are so similar is because the genes that have been well-characterized biochemically in E. coli tend to be the genes that are conserved and likely ancestral. While the pathogenic E. coli strains are of great interest medically, they are not typically the focus of intense biochemical study to uncover the functions of their novel metabolic genes.

Overall, when data for aerobic conditions is viewed phylogenetically (Figure 7A), there is no clear trend specific to the two pathogenic lineages, yet it appears that E. coli CFT073 has evolved with a similar growth rate in comparison to the E. coli ancestral core predictions, where as all other strains have evolved with higher growth rates and yields (Figure 7A).

In contrast, in anaerobic conditions (Figure 7B), higher growth yields and faster batch growth performance were observed for both EHEC E. coli strains (EDL933 and Sakai), and the insight derived from E. coli ancestral core in silico predictions suggest that the UPEC and K-12 lineages have evolved with less efficient anaerobic glucose catabolism then the EHEC lineage. One possible explanation for this behavior may be that the K-12 and UPEC strains do not routinely encounter the selective pressure from anaerobic conditions, whereas the EHEC strains may have evolved for improved growth in anaerobic conditions enabling their growth in both bovine and mammalian GI tracts, thus suggesting that many EHEC strains may have a better-suited anaerobic metabolism for glucose utilization. These findings suggest that E. coli K-12 strains could be engineered to be more efficient for anaerobic batch growth and that other E. coli strains not examined in this work may yield similar results, yet additional studies are warranted to examine more E. coli strain-specific GEMs, quantitative parameters, and catabolism of additional substrates other than glucose.

Conclusions

Here we have presented an update to the E. coli K-12 MG1655 GEM and an extensive new collection of GEMs for five E. coli strains including the first for two pathogenic lineages. These models have been validated through experimental data for aerobic and anaerobic conditions. This work demonstrated a new approach for validation of carbon source utilization, yielding accuracies of >88% for aerobic and anaerobic conditions for all six E. coli strains examined. In addition new lineage-specific quantitative data were generated and led to validation of the correlation between in silico predictions and experimental batch culture data for glucose catabolism during aerobic and anaerobic growth conditions. Thus, the iEco1339_MG1655, iEco1335_W3110, iEco1344_EDL933, iEco1345_Sakai, iEco1288_CFT073, and iEco1301_UTI89 GEMs provide new suitable platforms for computing cellular phenotypes in conditions reflecting those encountered in mammalian hosts such as the intestine or urinary tracts and for further integration of high throughput data generated from these bacterial strains during the course of infection in animal models.

Distinctive lineage-specific differences in the GEMs were identified and reveal that the main delineating metabolic factors between pathogenic and commensal E. coli strains are due to numerous gene/reaction deletions and not additions, and this observation was consistent with the number of genes and reactions contained in draft GEMs for all six E. coli strains generated using Model SEED. Historically many researchers have noticed that some pathogenic E. coli strains grow faster in comparison to commensal strains such as K-12, yet this phenotype has remained unexplained. These strain-specific models offer new tools for further investigation to determine precisely what combination of gene/reaction deletions account for the faster and more efficient biomass production observed experimentally for some of the pathogenic strains, thus providing new insight for bioengineering of industrial E. coli strains.

The generation of an E. coli pan-GEM (iEco1712_pan) consisting of all metabolic genes and reactions from 16 E. coli genomes, represents a new framework to rapidly generate additional E. coli strain/lineage-specific GEMs consisting of > 1,200 genes and >2,000 metabolic reactions. Finally, this study is the first to use a "paleo systems biology" approach to generate a GEM for an ancestral core of E. coli (iEco1053_core) providing the first insight to metabolic traits of an E. coli relative that may have existed ~10 mya, and demonstrated the use of an ancestral model to examine a closely related phylogenetic group of E. coli strains in the context of evolution.