Introduction

Metabolic engineering aims at the proper rewiring of cell metabolism to construct genetically engineered strains that can serve as robust cell factories for a variety of purposes, including the biosynthesis of target substances [1]. Extensive studies have been conducted in this field to develop methods for efficiently producing suitable natural compounds by using either native cells or heterologous hosts [2, 3]. Systems metabolic engineering employs the concepts and capabilities of systems biology, synthetic biology, and evolutionary engineering at the systems level. It uses approaches from these disciplines and combines them with standard metabolic engineering techniques to facilitate the development of high-performance strains [4,5,6,7]. Metabolic systems biology plays a significant role in systems metabolic engineering because it incorporates a systems-level perspective on cellular metabolic functionalities [8,9,10,11]. Using metabolic systems biology, scholars can integrate omics data with results from genome-scale computational simulations to improve metabolic engineering techniques. These techniques can lead to the development of potentially productive and operationally optimized microbial strains [10,11,12,13].

The growth-coupled overproduction of (bio)chemicals is one of the most vital and practical objectives in systems metabolic engineering. Using this approach, synthesis of a desired compound can be guaranteed along with the reproduction of the engineered cell(s) [14, 15]. Genome-scale metabolic network reconstructions (GENREs) [16] and their relevant mathematical representatives (genome-scale metabolic models (GEMs)) have been developed for numerous microorganisms (e.g., Escherichia coli [17,18,19,20], Pseudomonas putida [21, 22], and Saccharomyces cerevisiae [23,24,25,26]). These tools are commonly used in computational systems biology for in silico production strain design. In particular, biased COnstraint-Based Reconstruction and Analysis (COBRA) computational techniques such as flux balance analysis (FBA) [27] and flux variability analysis (FVA) [28] are useful in analyzing GEMs [11, 12, 29, 30] (Additional file 1: Supplement A). Using COBRA, one can take advantage of the synergistic effects of a variety of basic elements including genes, gene products and metabolites to evaluate cells’ potential and make model-driven discoveries. Accordingly, in silico studies based on systems-level analyses inspire researchers to examine intervention strategies, including gene or reaction insertions, knockouts, and up- or down-regulations [31, 32]. For example, in several studies on gene and reaction knockouts, the candidates for the best combination of eliminations were identified [15, 33,34,35,36].

There are two basic conventional approaches for designing metabolic intervention strategies: top-down (e.g., OptKnock [33], OptGene [37], MoMAKnock [34], CiED [38]) and bottom-up (e.g., FSEOF [39], CosMos [40]) procedures [41, 42]. The top-down strategies are used to determine whether the potential interventions are advantageous and they iteratively search for the metabolic reaction network of interest until the optimal solutions are identified. The search space in the corresponding problems includes all combinations of a predefined number of reactions in a GEM. Due to the size of the developed and highly curated GEMs, this search space is extremely large and would explode with the cardinality of the combination. Thus, it would not be feasible to conduct an exhaustive exploration within a reasonable time frame.

Optimization techniques are commonly proposed to address this computational challenge. For example, OptKnock [33] is one of the most popular top-down frameworks. It uses bi-level optimization for in silico metabolic engineering. It aims to identify the appropriate sets of genes or reactions that, when knocked out, maximize the production rate of the desired biochemical coupled with biomass formation. To find an optimal solution for the growth-coupled production of the biochemical(s) of interest, OptReg [31] expands the capabilities of OptKnock by predicting appropriate up- or down-regulation of revealed crucial genes or reactions. RobustKnock [43] has been developed based on optimization techniques that guarantee the minimum production rate of the desired biochemical. Despite its novel approach, RobustKnock has not been widely used due to the difficulty of implementation.

The challenge in employing these optimization approaches is that the time required for finding an optimal solution grows exponentially with the cardinality of the combination. Worse, the solvers may fall into a deadlock situation and become trapped in an infinite loop. Several metaheuristic algorithms have been proposed to overcome this obstacle. These algorithms can pinpoint the suboptimal solutions within a reasonable time. For example, BAFBA [44] is a top-down metaheuristic method that deploys the bees algorithm [45] to find candidate gene knockouts and evaluate the results through FBA (Additional file 1: Supplement A).

Bottom-up approaches discover appropriate intervention strategies by comparing two flux distributions. One of these distributions relates to the wild-type, which aims to maximize the cell’s growth rate. The other distribution relates to the functional state, which takes into account the goal of the desired biochemical overproduction. Examples include the flux distribution comparison analysis (FDCA) algorithm [46] and OptForce [32]. Using OptForce, all coordinated reaction modifications contributing to target overproduction are identified based on significant differences between the two flux patterns (initial and desired) in the introduced network, calculated using FVA. FVA finds the boundaries of the reaction fluxes that can satisfy the optimality of the solution under steady-state flux analysis (Additional file 1: Supplement A).

In a nutshell, primitive top-down approaches use optimization methods to find an optimal solution at the cost of significant execution time. While top-down metaheuristic approaches require less computational resources, they are not guaranteed to find a globally optimal solution because the search space contains many local optima. On the other hand, bottom-up approaches can be used to find a set of potential solution candidates [14]. Despite various integrated computational and experimental studies, it is challenging to identify the most proper and operative alterations by only comparing the flux distributions of the wild-type to the ideally engineered states. Considering high order cardinalities and interventions [47] adds to the complexity of the problem.

State-of-the-art approaches have been developed to dramatically alleviate the computational challenges and significantly reduce the computational costs including (iteratively) pruning the search space [48, 49] and sequentially enumerating the smallest minimal cut sets (MCSs) in order to provide several solutions [50]. For example, Fast-SL properly explores a metabolic network of interest to find the most appropriate synthetic lethal reaction sets. Fast-SL improves the performance of a brute-force search algorithm by iteratively reducing the size of the search space, which substantially shortens the execution time [49]. MCSEnumerator is another novel method that attempts to find many solutions using MCSs aimed at the identification of either synthetic lethal sets or optimal strain design targets [50].

Calculating the MCSs in GEMs is a complex and challenging computational problem [51]. The scalability of MCSEnumerator algorithms paves the way for both theoretical and practical studies considering high-order simultaneous reaction interventions for strong growth-coupled product formation [52, 53]. However, for in silico strain design, the MCSEnumerator approach require predefining of the acceptable thresholds for growth and target product yields and this contributes to different drawbacks such as neglection of some appropriate suboptimal solutions [54].

In this paper, we present FastKnock as a next-generation knockout strategy algorithm that provides the user with all possible solutions for multiple gene and reaction knockouts to overproduce a (bio)chemical of interest. Unlike the MCSEnumerator approach, FastKnock does not rely on any special parameter settings and additional assumptions (except for predefining the maximum number of simultaneous reaction knockouts). We developed a delicate search and prune algorithm to accomplish this goal at a greatly reduced computational time and cost. Our method combines (and benefits from) both basic approaches to tackle the problems described above. It incorporates reaction knockouts to couple the biosynthesis of both primary (e.g., succinate, lactate, ethanol, etc.) and secondary metabolites (e.g., dodecanoic acid, polyketides such as erythromycin and terpenoids such as lycopene) with cell reproduction. It examines the GEM at the level of metabolic reactions while checking the corresponding genes to consider the gene dependency of the reactions.

The availability of all solutions allows us to systematically characterize and rank these strategies in accordance with some criteria including (a) substrate-specific productivity (SSP) [14, 15, 55, 56], (b) strength of growth coupling (SoGC), defined as the square of the product yield per unit substrate divided by the slope of the lower edge of the production curve [14, 15, 55, 56], (c) strain dynamic performance, which depends on yield, productivity, and titer [57, 58], and (d) other important indices reflecting environmental and operational considerations such as minimal production of undesired or toxic byproducts and the feasibility of CO2 biofixation. Some alternative criteria are discussed in [59]. Furthermore, it would be possible to evaluate the solutions and categorize them in the different major classes: potentially, weakly, directionally growth-coupled production (pGCP, wGCP, dGCP) and substrate-uptake coupled production (SUCP) raised in [60].

The article is structured as follows: Initially, the FastKnock algorithm is introduced. Subsequently, we present the outcomes of in silico experiments utilizing meticulously curated GEMs of E. coli. Finally, discussions and conclusions are articulated.

The proposed method

We developed the FastKnock algorithm, a versatile framework intended to enhance the production rate of a targeted metabolite within a cell while promoting growth. This desired metabolite may belong to either a primary or secondary category and can be of native or heterologous origin. Specifically, the algorithm can be applied to heterologous metabolites through the inclusion of the associated pathways into the GEM set.

In other words, FastKnock identifies reactions to be deleted from the network while ensuring that the flux of the biomass formation reaction remains above a specific cut-off (i.e., 1% of grWT, (Additional file 1: Supplement D) and maximizes the production of the desired substance(s) [61]. For practical applications, FastKnock can be utilized to identify subsets of network reactions that can be removed to significantly enhance the production of the desired biochemical. Specifically, FastKnock identifies the strains in which the production rate of the desired biochemical surpasses a predefined threshold in the base model (i.e., the model without any interventions). We refer to this threshold as Thchemical, defined as 5% of the maximum theoretical yield (i.e., the optimal production rate of the desired biochemical when it is considered the objective of the cell metabolism) in the base model. FastKnock, like other common approaches, employs preprocessing to reduce the size of metabolic model reactions and the search space. In the preprocessing phase (Additional file 1: Supplement C), a subset of reactions is identified and structurally excluded from the metabolic network to generate a reduced model denoted as Reduced_model. Additionally, the set of candidate reactions for deletion from the model is determined and denoted as Removable.

The search space of the exhaustive search includes all members of the power set of the Removable set with a particular maximum cardinality.

The search space grows exponentially as the size of the set increases. Therefore, conducting an exhaustive search and examining all subsets of reactions is highly time-consuming and infeasible. To address this challenge, our proposed algorithm utilizes information available only during the search procedure to dynamically narrow the search space—iteratively pruning the space and temporarily excluding certain reactions. This reduced search space is employed to identify knockout strategies, and we refer to it as the target space.

The FastKnock algorithm

Our proposed method aims to identify all solutions to a strain optimization problem (with a predefined maximum number of reaction deletions), enabling the growth-coupled overproduction of a metabolite (biochemical) of interest. Each solution represents a set of k reactions (i.e., a knockout strategy) in which the elimination of these reactions results in a new engineered strain, coupling the overproduction of the biochemical of interest with cell growth.

Testing whether a set of reactions is a proper solution is equivalent to solving an optimization problem in which the objective function is the growth of the cell and the elimination of reactions corresponds to modifying the associated constraints of the optimization problem (Additional file 1: Supplement F). By solving this optimization problem, we obtain the flux of all the reactions including the production rate of a desired biochemical. An appropriate solution (i.e., a knockout strategy) should satisfy the objective function along with providing a suitable production rate for the desired biochemical product.

To find all reaction subsets of size ≤ k, we employ a tree-based representation that encompasses all combinations of reactions with a maximum size of k, as outlined below. Figure 1 illustrates the overall procedure using a depth-first traversal tree. The root node at level zero corresponds to the base model in which no reaction is deleted (i.e., the reduced model). All sets of k reactions are placed in nodes of the tree in depth k (i.e., at the level k). The FastKnock procedure starts with investigating the elimination of a single arbitrary reaction r1 at level one. Whether knocking out r1 is a solution or not, we proceed to explore the simultaneous elimination of r1 and another reaction at level two. At each level, we consider only the reactions with nonzero flux, determined by the optimization problem solved in the parent node at the upper level (Additional file 1: Supplement F, part 2). The procedure of adding reactions with nonzero flux to the set of knockout reactions continues at lower levels of the tree until one of the two stopping conditions is met: a) we reach a leaf at level k (the predefined number of knockouts), or b) we reach a node guaranteed to have no solution in its subtree.

Fig. 1
figure 1

The traversal tree: All possible solutions are identified through a depth-first traversal of the tree. First, the identifyTargetSpace function is applied in the root node to the reduced wild-type network to determine the target space. Each reaction in this set is individually selected and removed from the network in Level 1. For each deleted reaction (or equally node) in Level 1, the identifyTargetSpace function is recalled to obtain the target space for the next level. For simplicity, we show only two levels of the traversal of the tree, which is enough to identify all single and double deletions

To check condition b in each node at level l < k, we determine whether the subtree may lack a solution by investigating the optimization problem. If the optimization problem already indicates an infeasible region at a node, adding more constraints in the subtree of the node would not lead to a proper solution (see Additional file 1: Supplement F).

The merit of the procedure is the technique of bounding the search by a) excluding the reactions with zero flux at each node from the target space of the node (Additional file 1: Supplement F, part 2) and b) checking the feasibility of reaching a solution before expanding the subtree of each node. If a reaction has zero flux based on the functional state of a node in the traversal tree, it is excluded from the target space of that node. However, in the children of that node, the functional states may change and the reaction can get nonzero flux. Thus, it might reappear in the search space when we explore the descendants at consequent levels. This dynamic and effective pruning of the search space enhances the efficiency of the algorithm.

Algorithm 1 represents the definition of a node in the tree, as well as, the main procedure of the FastKnock algorithm. Each instance of the Node contains the model, the set of the removed reactions, the search space, and the target space for the next level (Fig. 1). Specifically, at each node X of the tree at level L, we investigate a set of L reactions (deleted_rxns) to determine (a) whether X is a solution and (b) the new target space, which is the set of all reactions that could potentially be added to deleted_rxns for investigation at the next level.

Determining the target space at each node is critical, and it allows us to avoid the combinatorial explosion of the tree that would inevitably result from an exhaustive search. In particular, while we investigate drastically fewer subsets of reactions at the children nodes in Level L + 1, our analysis guarantees that FastKnock will find every candidate solution (Additional file 1: Supplement F).

In Algorithm 1, the traversal of the tree shown in Fig. 1 is represented by a set of queues: queue1 to queuetarget_level. Each queue contains a set of nodes. At each moment during the execution of the algorithm, queue l contains all children of a certain node at level l-1 being investigated. In this way, the subtrees are gradually constructed and removed (pruned).

Algorithm 1:

The FastKnock main procedure

figure a

The main algorithm consists of three functions: identifyTargetSpace, constructSubTree, and traverseTree. For each node, we compute a target space and a flux distribution using the identifyTargetSpace function. This function temporarily narrows the search space for the whole subtree of the node. The subtree of a node is constructed using the constructSubTree function. The traverseTree function recursively navigates the tree, based on a depth-first traversal.

We elaborate on these functions in the following subsections. Firstly, we determine the target space and subsequently describe the search procedure—detailing how the traversal tree is partially constructed and traversed. In our implementation, we enhanced the quality of the obtained solutions by ensuring a minimal chemical production rate (Additional file 1: Supplement I) and increased the speed of the algorithm through parallel processing (Additional file 1: Supplement G).

Identifying the target space

At steady state, a specific flux range for each reaction r is obtained (minFluxr ≤ fr ≤ maxFluxr), which leads to the optimal cellular objective (e.g., maximizing the biomass formation flux). Knocking out a reaction r is implemented by setting the allowable flux range [62] of the reaction to zero (i.e., lbr = ubr = 0 in the optimization problem of Equations a.1 and a.5 in Additional file 1: Supplement A). Note that when a reaction is reversible (i.e., the obtained flux range of a reaction includes zero minFluxr ≤ 0 ≤ maxFluxr), knocking out that reaction alone has no effect on the optimal linear objective value of the network in FBA (Additional file 1: Supplement F).

Here, the main idea is to prune the target space by considering only the set of reactions with nonzero flux values. This approach significantly reduces the size of the target space and thus reduces the execution time of the algorithm.

We denote reactions that lack a zero value in their obtained flux range as Rxns+ in each node of the tree:

\({Rxns}^{+}=\{r \in Rxns ~|~{minFlux}_{r}>0 ~~~or~~~ {maxFlux}_{r}<0\}.\)

The target space of each node, which is the set of reactions that could be appropriate for deletion, is obtained using the identifyTargetSpace function (Algorithm 2). The search operation at each node is limited to Rxns+  ∩ Removable, as shown in Line 6 of Algorithm 2.

It is worth mentioning that by any manipulation of the model, the fluxes of other reactions may change. Therefore, the functional states (i.e., flux distributions) should be analyzed repeatedly after each modification (i.e., after each reaction knockout) using FBA to identify the reactions that carry nonzero flux in the network (modelX) (Lines 4–5). The flux_dist variable of the node is updated at Line 4. The intersection of these reactions and the Removable set construct the target space of node X in Line 6.

Algorithm 2:

Identifying target space for each node

figure b

The search procedure

Here, we introduce a depth-first search procedure based on the traversal tree shown in Fig. 1. Each node of the tree has its own subtree, which is traversed before moving on to its sibling nodes. This depth-first search procedure is implemented using the traverseTree function of Algorithm 3.

In each call, the traverseTree function visits a certain node X (i.e., the first node of the queuelevel) and, if needed, calls the constructSubTree function to create the corresponding subtree of the node (Algorithm 4). The constructSubTree function creates the children nodes of X, which is a set of nodes that are placed in level = X.level + 1. For each child, deleted_rxns is initialized by adding one of the reactions in X.target_space to the X.deleted_rxns.

It is clear that the order of the knocked-out reactions is not important. In FastKnock, repetitive permutations of the reactions are ignored using a checkedlevel queue for each level of the tree. Generally, N levels are considered for simultaneously knocking out N reactions from the cell. Precisely, the reaction selected for the ith level is not allowed in the (i + 1)th to Nth levels. To generate all combinations of these reactions, the checkedL queue is used at level L. At level L, by deleting a reaction r from the target space, r is added to the checkedL. This excludes the reaction from the target space of the subsequent levels.

Algorithm 3:

Traversing the tree

figure c

Algorithm 4:

Constructing subtrees of the traversal tree

figure d

A traversal example

To illustrate the formation of the traversal tree, a sample node of Fig. 1 is explained here. Consider node X = {r1, r4} representing a double knockout of the reactions r1 and r4. Deletion of the reaction r1 as a single reaction knockout strategy has been checked in the parent node {r1} beforehand. Also, double knockout of the reactions r1 and r2 and triple knockout of {r1, r2, r3}, {r1, r2, r4}, and {r1, r2, r6} have been checked in the sibling node {r1, r2} and its children nodes before visiting node X. Visiting node X corresponds to checking the removal of {r1, r4} as a potential knockout strategy. Afterward, its subtree is generated to investigate the simultaneous removal of all the subsets of the removable reactions along with r1 and r4.

Naively, for each reaction in the removable set, we should generate a child node for X (obviously except for the reactions r1 and r4). As mentioned in the root node of Fig. 1 in this example, the set of removable reactions is supposed to be {r1, r2, r3, r4, r5, r6}. In a very simple search procedure, node X would have four child nodes (i.e., {r1, r4, r2}, {r1, r4, r3}, {r1, r4, r5}, {r1, r4, r6}). Generally in an exhaustive search, for each node, we may have too many children nodes and such a branching factor leads to a large search space and hence an excessive runtime.

In FastKnock, the size of the target space determines the number of children nodes of X, which is limited to Rxns+  ∩ Removable, where Rxns+ consists of nonzero flux reactions (suppose {r2, r3, r7} for node X). Because the reaction r2 is checked in the subtree of the sibling node {r1, r2} (see checkedL2 = {r1, r2, r4} in node X), and the reaction r7 does not exist in the removable set of the model, the target space of node X contains only r3. In this way, the search space is drastically narrowed down by generating a limited number of children.

In this example, the reaction r5 does not exist in the Rxns+ of node X, due to its zero flux. It means that the node {r1, r4, r5} will not be added as a child of X, because it produces the same conditions as exist in node X (i.e., the same target space that results in a duplicate node). As discussed in Part 2 of Supplement F, no feasible solution would be missed because of this search space reduction (See Additional file 1: Supplement F).

It should be noted that the target space is temporarily reduced and its size may increase in the descendant nodes. In the node {r1, r4, r3}, the set of nonzero flux reactions could include any of the reactions in the model.

Co-knockout of the reactions

For practical applications, one important feature of FastKnock is that it can optionally consider genes as the basis of candidate reactions for deletion. This is a realistic assumption because knocking out genes to remove a specific reaction often leads to removing a predetermined set of reactions that are simultaneously knocked out.

In fact, a reaction cannot be removed from a living cell while its genes are being manipulated in vivo. Therefore, the mapping of reactions to genes should be considered in the algorithm to reach realizable results. In other words, a reaction is knocked out from the network based on its associated gene rule. Furthermore, the clustering of reactions based on the associated gene rules could improve the efficiency of the search procedure for finding the appropriate targets.

In the simplest form of gene rules, a reaction could be removed by knocking out at least one gene from a set of genes (logical AND relation) or by simultaneously deleting a set of genes (logical OR relation). However, in general form, gene rules describe complex relationships between genes and reactions. Thus, well-known knockout strategies for in silico strain design are based on reactions or genes but do not simultaneously consider both of them.

For capturing the complexity of gene-reaction relationships, in this work, we label a set of reactions as co-knocked out if they are removed due to the elimination of a single gene. In the preprocessing phase of the proposed framework, for each reaction r, a set of reactions named Co_KnockedOutr is defined that contains all the reactions that are intrinsically removed by the deletion of a set of genes that should be knocked out for removing the reaction r. Supplement E elaborates a modified version of the proposed algorithm based on knocking out genes rather than reactions, which discusses different forms of gene rules (See Additional file 1: Supplement E).

Although the presented method enhances time efficiency, it can be excluded from the main method to obtain comparable results with the state-of-the-art reaction-based approaches. On the other hand, this technique can be incorporated as a preprocessing step in other metabolic engineering algorithms and in silico strain design approaches.

Results

We implemented the FastKnock algorithm using Python language programming (Version 2.7) and the COBRApy library (Version 0.15.4) [63]. We evaluated the performance of FastKnock using various examples, and we compared these results to OptKnock and MCSEnumerator approaches.

To assess FastKnock’s performance and demonstrate its capabilities while addressing potential limitations of other methods, such as the impact of model size and culture medium richness on method performance, we selected four highly-curated GEMs for E. coli (i.e., iJR904 [17], iAF1260 [18], iJO1366 [19], and iML1515 [20]) for our experiments. We investigated the excessive production of renowned metabolites (succinate, lactate, 2-oxoglutarate, and lycopene, functioning as both primary and secondary biological products) across various media types, including mineral and rich mediums, as diverse case studies.

We assessed the overproduction of the primary metabolites using these GEMs as wild-type models (referred to as Strain0 in the in-silico experiments), focusing on two mineral and one rich cultivation conditions. The first condition, CM1, involved iM9 mineral medium supplemented with glucose (a maximum allowable glucose uptake rate of 10 mmol.gDW−1 h−1) under aerobic conditions (a maximum allowable oxygen uptake rate of 15 mmol.gDW−1 h−1). The second condition, CM2, included iM9 mineral medium with the same glucose supplementation (a maximum allowable glucose uptake rate of 10 mmol.gDW−1 h−1) under anaerobic conditions (an oxygen uptake rate of 0 mmol.gDW−1.h−1). In a complex and rich environment, more inputs activate cellular functions, leading to the involvement of more pathways and reactions in the network. In order to further evaluate the exhaustive enumeration performance of the FastKnock algorithm in a rich cultivation condition, we conducted additional in silico experiments considering succinate overproduction in Luri-Bertani (LB) medium. The iLB medium constraints were determined based on [64, 65].

The secondary metabolite, lycopene, as a heterologous product is produced in E. coli only under aerobic conditions. We considered two strains for lycopene production. For the first recombinant strain (Strain1), the lycopene biosynthesis pathway (i.e., the methylerythritol phosphate (MEP) pathway [66]) is added to the wild-type E. coli model [39, 67, 68]. For the second recombinant strain (Strain2), some other modifications are applied based on [69]. This provides an intracellular pool of pyruvate as the important precursor of lycopene production [70]. Additional file 2: Tables S1 and S2 in Supplement J.I show the maximum theoretical yield for the biosynthesis of the metabolites (i.e., maximum of vchemical) and our threshold for their production (Thchemical = 0.05 \(\times\) vchemical).

Some results of the preprocessing phase is shown in Additional file 2: Table S3 of Supplement J.I, illustrating the number of reactions excluded from the search space before the main exploration procedure and before obtaining the removable reactions. The size of the search space is drastically reduced to 20% of all the reactions. In the Reduced_model, the blocked reactions and dead ends are removed [62]. Also, as described after the preprocessing phase, the search space is reduced iteratively and temporally during the search procedure of the FastKnock algorithm. This significantly reduces the number of linear programming problems (LPs) that must be solved. Specifically, compared to an exhaustive search, the reduction rates are 80%–85% for single knockouts, 96%–97.5% for double knockouts, 99.0–99.5% for triple knockouts, and above 99.8% for quadruple and quintuple knockouts (Table 1). The number of LPs is equal to the number of nodes in the traversal tree shown in Fig. 1, and it is independent of the target metabolite to be produced.

Table 1 The number of linear programming problems (LPs) solved by the FastKnock algorithm compared to an exhaustive search of the preprocessed search space (Strain0 in CM2 cultivation medium)

In comparison, in the exhaustive search the algorithm must check all the combinations of the reactions in the search space. For instance, iJR904 in CM2 has 208 reactions in its search space. For finding double-knockout results in the exhaustive search, the algorithm must check all the double combinations of the elements in the search space (c(208, 2) = 21,528). Due to its time complexity, the exhaustive approach is not feasible for high-order reaction knockouts; thus, we compared FastKnock to a simple exhaustive search method for single, double, or triple knockouts. Our experiments showed that a significant reduction in the number of LPs is critical because it allows us to investigate and find all possible solutions.

Table 2 presents the total number of solutions obtained (regarding CM2 cultivation medium) using the FastKnock algorithm. The results are reported in two cases: the maximum production rate (vmax) and the guaranteed production rate (vgrnt) as discussed in Supplement I.

Table 2 The number of solutions in iJR904 (Strain0 in CM2 cultivation medium)

We also compared our solutions to the results of the exhaustive search for single, double, and triple deletions for succinate production in iJR904 to verify the completeness of the FastKnock algorithm. Both approaches found two solutions for a single deletion. The exhaustive search for a double deletion found 398 solutions, of which only 58 solutions were true double deletions. The rest of the solutions were not acceptable because either (a) the combination of each single deletion solution and a zero-flux reaction was inappropriately considered as a double-deletion solution or (b) the elimination of a reaction in the co-knocked-out sets led to the removal of all the reactions in the set, while in the exhaustive search, the removal of each reaction in the set is counted as a separate solution. For triple deletions, the exhaustive search found 39,407 solutions, of which 887 were unique and acceptable. FastKnock found all the 887 solutions.

Table 3 presents the best solutions found for iJR904 GEM (See also Additional file 2: Tables S4-S10). Supplement J.II includes the results for the iAF1260 (Additional file 2: Tables S11-S17) and iJO1366 (Additional file 2: Tables S18-S27) GEMs as well. As an example, we found that the best result for succinate overproduction is obtained by deleting one reaction, ADHEr, which is knocked out by the deletion of the gene b1241. Consequently, the deletion of the b1241 gene also causes the deletion of the LCADi_copy2 reaction. In this situation, the growth rate is 0.16 (h−1) as shown in the “biomass formation rate” column. After the deletion of ADHEr, the succinate production can vary between 5.11 and 9.50 mmol.gDW−1 h−1, which is more than the considered 0.85 mmol.gDW−1 h−1 threshold; hence, an acceptable amount of succinate production is guaranteed. Figure 2 presents the production envelopes calculated for the best cases presented in Table 3.

Table 3 The guaranteed rate of succinate growth-coupled production in in iJR904 (Strain0 in CM2 cultivation medium)
Fig. 2
figure 2

Production envelopes for the best solutions presented in Table 3 regarding succinate production from single to quintuple reaction deletions in iJR904. Knocking out more genes improves growth coupling. In particular, with quadruple and quintuple knockouts, significant production is guaranteed for any growth rate

The analyses carried out with relatively older models, specifically iJR904, iAF1260, and iJO1366, were primarily focused on comparing the performance of FastKnock with both earlier methods (i.e., OptKnock) along with experimental studies and more recent approaches (i.e., MCSEnumerator) documented in the literature. As previously mentioned, additional tests were conducted to demonstrate that the effectiveness of the FastKnock method remains unaffected by the size of the model and the richness of the culture medium. These supplementary examinations included assessing succinate overproduction in medium CM2 using model iML1515 and investigating succinate overproduction in iLB rich environment under aerobic conditions using both iJR904 and iML1515. The maximum rates of succinate growth-coupled production associated with these supplementary examinations are presented in Tables 4, 5, 6.

Table 4 The maximum rates of succinate growth-coupled production in iML1515 (Strain0 in CM2 cultivation medium)
Table 5 The maximum rates of succinate growth-coupled production in iJR904 in rich medium (Strain0 in LB cultivation medium)
Table 6 The maximum rates of succinate growth-coupled production in iML1515 in rich medium (Strain0 in LB medium)

For practical applications, various evaluation indices, including product yield, SSP, and SoGC [55], and other important indices reflecting environmental and operational considerations, can be used to choose the most appropriate cases from the solutions found by FastKnock (Tables 7 and Table 8). In particular, the feasibility of CO2 biofixation and minimal production of undesired or toxic byproducts are also significant indexes for systems metabolic engineering purposes. For instance, an engineered strain that can simultaneously fix CO2 and produce a suitable biochemical might be preferred regarding environmental considerations. When all solutions are available, the analysis and identification of such appropriate cases is easily possible.

Table 7 The best solutions based on the desired evaluation indexes for succinate production under anaerobic condition (Strain0 in CM2 cultivation medium) in iJR904
Table 8 Comparison of FastKnock, OptKnock and experimental results reported in the literature for succinate production. The iJR904 model (Strain0) is used in the in silico experimentations (M9 cultivation medium)

Comparing FastKnock to OptKnock (case study: succinate overproduction in E. coli iJR904)

We analyzed FastKnock solutions in order to find the most appropriate outcomes based on three criteria, yield, SSP, and SoGC (Table 8). Additionally, the feasibility of CO2 biofixation is also examined and the relevant results are summarized, where a negative CO2 exchange flux represents a desirable CO2 uptake rate. We compared these best solutions obtained by FastKnock with the associated OptKnock results as well as experimental data available in the literature [71,72,73]. Note that OptKnock aims at, and terminates on, finding a single solution. Therefore, comparing it with FastKnock in terms of computational costs is not meaningful.

We found that a solution with the best production rate or an optimal solution of the optimization algorithms such as OptKnock does not necessarily bring the best SoGC and the other desired indexes. However, by identifying all the possible solutions for the problem, FastKnock allows a comprehensive analysis. For example, knocking out ADHEr, ATPS4r, and LDH_D is expected to lead to the best biomass formation rate (0.16 h−1) and the highest SoGC (3.01E-2 h−1), which is twice the best SoGC provided by OptKnock solutions while the other indices corresponding to this knockout are comparable with the best numbers shown in the table (i.e., a production rate of 8.90 vs. 12.24 mmol.gDW−1.h−1, a yield of 0.89 vs. 1.22, an SSP 1.42E-1 vs. 1.46E-1 h−1, and a CO2 exchange flux of −8.76 vs. −9.36 mmol.gDW−1.h−1). A relatively high value of SoGC can also be desirable from a dynamic perspective because it indicates that even under non-optimal conditions, the biosynthesis of the target biochemical is coupled with the growth of the production strain. This situation is usually encountered in batch and fed-batch cultivations in the logarithmic phase of growth.

A more striking example is the comparison between the PTAr, PYK, ATPS4r, and SUCD1i quadruple knockout identified by OptKnock with the two solutions with the best production rate (ADHEr, LDH_D, PFL, and THD2) and the best SoGC (ADHEr, LDH_D, HEX1, and THD2) identified by FastKnock. While the biomass formation rate of the FastKnock solutions (0.11, 0.13 h−1, respectively) are comparable with the OptKnock solution (0.16 h−1), the yield and SSP is an order of magnitude higher for FastKnock solutions. A serious issue with this OptKnock solution is the very low SoGC (1E-4 h−1), which indicates that the production rate would be hardly coupled with growth. In comparison, the predicted SoGC for FastKnock solutions are 2.85E-2 and 3.09E-2 h−1, respectively. Another disadvantage of OptKnock solution is a relatively high CO2 production rate of 9.03 mmol.gDW−1.h−1 while in the FastKnock solutions the CO2 exchange fluxes are −6.12 and −8.77 mmol.gDW−1.h−1, respectively.

Among the quintuple knockouts, the predicted SSP and SoGC for one of the FastKnock solutions (ADHEr, LDH_D, GLUDy, PFL, and THD2) are almost twice those of the OptKnock solution (ADHEr, LDH_D, PTAr, PYK, and GLCpts) while the other indices are comparable.

An important concern about OptKnock is possible false positive outcomes due to different scenarios. Firstly, false positives could be obtained due to the associated linear programming problem, focusing on maximizing the target reaction flux neglecting minimum possible production flux. In other words, OptKnock relies on FBA, potentially leading to false positives by not considering flux variabilities [43]. In contrast, FastKnock could guarantee the minimum production flux, regarding FVA. The second scenario is about the nature of the associated primal bi-level optimization problem, which is reformulated in the form of a single-level Mixed-Integer Linear Programming (MILP) problem. To solve the MILP problem, OptKnock utilizes the branch and bound method, which may generate false positives and even pose a risk of the algorithm getting trapped in an infinite loop. In contrast, FastKnock employs a different approach based on a search problem to explore the entire solution space. With appropriate evaluation criteria, unlike OptKnock, if it fails to provide a solution, it implies that no valid solution exists for the given criteria.

It is also important to note that, in some cases, false positives stem from limitations of the models due to incomplete knowledge of the genotype–phenotype relationships of the (micro)organism at hand in the process of model development. In this case, any in silico strain design approach intrinsically produces false positives [19].

Comparing FastKnock to MCSEnumerator (case study: ethanol overproduction in E. coli iAF1260)

As mentioned previously, MCSEnumerator is a novel method for metabolic engineering based on the identification of minimal cut sets [50]. This approach applies a filtering step to reduce the computation time, which allows the user to find thousands (but not all) of the most efficient knockout strategies in genome-scale metabolic models. MCSEnumerator can be used to find a large number of metabolic engineering interventions, but it has various drawbacks. In this section, we compare MCSEnumerator with FastKnock. To aid in this comparison, we consider the case study of ethanol production in E. coli iAF1260 GEM with an 18.5 mmol.gDW−1 h−1 glucose uptake rate under anaerobic conditions (iM9 medium) as presented in the MCSEnumerator publication.

We should discuss the effect of the MCSEnumerator thresholds on its solution set. It would not be feasible to apply MCSEnumerator using thresholds that are relaxed enough to find all the solutions (Supplement H). We illustrate this with an example in Fig. 3. The blue production envelope, which has the best SoGC value, is associated with a solution found by both MCSEnumerator and FastKnock. The associated solutions (with the red and green diagrams), which are the worst cases among the shown envelopes, were not found by MCSEnumerator because of the production threshold considered. This illustrates the efficiency of the primary filtration of the MCSEnumerator method. The starting point might not be the best factor for filtering appropriate solutions. For example, the minimum production rate based on the orange envelope is similar to the green envelope in Region Y3, which is below the threshold considered for ethanol production flux. Nevertheless, the orange envelope may still be associated with a proper solution due to its relatively high SoGC, but it was not found by MCSEnumerator.

Fig. 3
figure 3

Five exemplar production envelopes for strategies identified by FastKnock for ethanol production in iAF1260, which is partitioned into four regions based on the growth rate (x axis) and the production flux (y axis) as in [15]. The horizontal dashed line indicates the threshold for production rate as considered in [15], and the vertical dashed line indicates the growth rate threshold. SoGC(× 100), product yield (Yp/s) and SSP(× 10) of the quadruple knockout strategies are shown in the top right legend. Unlike FastKnock, MCSEnumerator finds none of these strategies except the one shown in blue

Moreover, the predefined thresholds may result in the situation where some solutions obtained by MCSEnumerator are not necessarily and genuinely minimal. This implies that an appropriate solution with a cardinality of 'n' might exist but goes undiscovered, while it may appear in some higher-order solutions (> n) that include irrelevant additional reactions.

While the MCSEnumerator algorithm and its modified versions may exhibit shorter execution times, the number of solutions they can provide, given certain settings, constitutes only a very small percentage of the total potential solutions. Therefore, comparing the MCSEnumerator and FastKnock algorithms based solely on execution time is not rational, as these algorithms neither yield the same output nor pursue the same objective.

Discussion

Overproduction of biochemicals of interest coupled with significant growth rates might be optimistic and may not always be easily achievable due to e.g., competing pathways in a metabolic network [43]. This can lead to weak coupling especially under suboptimal growth conditions. Alternatively, strong coupling requires that production must occur even without growth [14]. Specifically, product synthesis rate is said to be strongly coupled with biomass formation if the product yields of all steady-state flux vectors are equal to or larger than a predefined product yield threshold [15]. Accordingly, SoGC is defined as the square of the product yield per unit substrate divided by the slope of the lower edge of the production curve [55] (see Fig. 2).

SoGC is a non-linear objective function and thus OptKnock and most of the in silico strain design methods cannot be used to find knockouts with optimal SoGC. OptGene [37] is a heuristic approach that can be used to identify a single knockout strategy with optimal SoGC [55]. However, knocking out the single identified solution by OptGene may not be practically feasible e.g., due to the genes’ loci. Therefore, identification of all knockout strategies by FastKnock is desired and provides the expert experimentalists with the opportunity to choose from a short list of knockout strategies that are filtered for a relatively high SoGC, SSP, yield, etc. This shortlist can be investigated for advantageous solutions in terms of environmental considerations such as CO2 biofixation [71, 72], minimal production of undesired or toxic byproducts, practicality of knocking or silencing genes, etc. (Table 8) [6, 55, 73,74,75].

We proposed an efficient next-generation algorithm, FastKnock, which identifies all proper reaction or gene knockout strategies (with predefined maximum number of deletions) for the overproduction of a desired biochemical. We reached this goal by significantly pruning the search space without omitting any solutions. For example, in our experiments, FastKnock was required to explore only 1% of the search space in the pruned model when identifying all triple-knockout strategies. The rate of this reduction increases as more reactions are knocked out (e.g., about 0.1% for quadruple-knockout strategies and about 0.01% for quintuple-knockout strategies) (Table 1). This drastic reduction of the search space enables our novel FastKnock method to find the set of all possible solutions in a feasible time duration.

Finding the best and most suitable trade-off between cellular growth and the production of the desired biochemical is one of the key benefits of FastKnock results. Moreover, determining all possible solutions allows for the selection of the most appropriate strategy based on any desired evaluation index, including product yield, SSP, and SoGC (Tables 7 and 8). This is an important and useful feature of our search strategy, especially for practical applications [59].

We compared FastKnock to MCSEnumerator [50], which has been shown to find more efficient solutions than the MCS methods [76,77,78]. We found that the solutions identified by MCSEnumerator may not be minimal. Also, due to initial filtering, MCSEnumerator misses solutions that may be practically more appropriate than the best solutions it finds. In comparison, FastKnock identifies all minimal solutions, which can be mined later based on any desired criteria.

When all solutions are available, one interesting analysis that can be conducted is to identify the reactions or genes that are common among a relatively large number of solutions. For instance, in the case of iJR904, to produce succinate in iM9 under anaerobic conditions (CM2), about 70% of solutions include at least one of ADHEr or PFL reactions (Fig. 4). Moreover, when three or more reactions are to be deleted, the best results in terms of the succinate production rate include both ADHEr and PFL (Table 7). Collectively, this analysis suggests that ADHEr and PFL reactions support pathways that compete with succinate production, and these pathways are blocked when ADHEr and PFL are eliminated [79, 80]. Based on this analysis, we suggest using a heuristic for higher-level knockout combinations in which one or more reactions (e.g., ADHEr or PFL) are removed in searches for six or more knockouts. In this way, one would need to search for fewer reactions to knockout. We believe this heuristic would reduce the search space by an order of magnitude at the expense of losing not more than half of the solutions.

Fig. 4
figure 4

The rate of presence of the ADHEr and PFL reactions in all the possible solutions counted in Table 7 for succinate production

Conclusion

While in silico strain design results do not necessarily lead to in vivo overproduction, obtaining all possible knockout strategies is critical for determining the best practical and most efficient strategy. The FastKnock algorithm is a general framework that can be used to overproduce any metabolite. It is not limited by factors such as richness and complexity of the cultivation conditions or large size of the metabolic network of the strain of interest. FastKnock identifies strategies, if exist, with a production rate higher than the desired threshold determined by the user.