Artificial chemistries (Dittrich et al., 2001) are computational models of chemical systems and, in particular, of biochemical systems such as metabolic pathways. An artificial chemistry consists of a set of molecules, a set of reaction rules that produce new molecules from already existing molecules, and the definition of the dynamics of the system, which specifies the application conditions of the rules, the preference in their application, etc. (Rosselló and Valiente, 2005b).
A metabolic pathway can be regarded as a coordinated sequence of biochemical reactions and is often described in symbolic terms, as a succession of transformations of one set of substrate molecules into another set of product molecules (Rosselló and Valiente, 2004). Substrate and product must be compatible chemical graphs for a pathway between them to exist (Rosselló and Valiente, 2004, 2005a, 2005b).
Metabolic pathways are often represented as directed hypergraphs, with substrate and product molecules as nodes and biochemical reactions as hyperarcs. Since a chemical graph can represent the disjoint union of a set of molecules, though, the equivalent representation of artificial chemistries and, in particular, metabolic pathways as directed graphs becomes more natural. An artificial chemistry defined by a set of chemical reaction graphs, is thus represented as a directed second-order graph with the chemical graphs that represent the sets of substrate and product molecules as vertices and applications of the chemical reaction graphs, including information on atom mapping, as arcs.
Unfortunately, the size of the artificial chemistry defined by a setM of chemical graphs and a set R of chemical reaction graphs is often exponential in the size of M and R, and thus artificial chemistries are known for very small instances only, involving a few dozens of molecules and biochemical reactions. Therefore, we consider in this paper the problem of obtaining a substantial portion of the artificial chemistry defined by a set of biochemical reactions while avoiding the complexity of reconstructing the whole artificial chemistry.
The constraints we impose on the reconstruction process are threefold:
-
(1)
The initial chemical graphs represent all sets of at most m metabolites among those involved in the set R of reactions, for some fixed, but arbitrary, m (in examples and applications in this paper we shall always take m = 2).
-
(2)
The reconstruction process is restricted to a fixed, but arbitrary, number k of derivation steps.
-
(3)
The initial and final sets of metabolites of every metabolic pathway belong to the set of initial chemical graphs.
While the first two constraints (on the size of the initial chemical graphs and the lengths of the metabolic pathways under inspection) are motivated by complexity considerations alone, the third constraint allows for directing the search of new metabolic pathways inside the artificial chemistry. That is, instead of building the artificial chemistry by applying the biochemical reactions in every possible way to each of the initial chemical graphs, we perform a bidirectional search by constructing forward metabolic pathways of length at most k starting in initial chemical graphs and backward metabolic pathways of length at most k ending in initial chemical graphs, and then gluing them to obtain all metabolic pathways of length at most 2k starting and ending in initial chemical graphs.
Given a set R of biochemical reactions and a number k of derivation steps, the detailed procedure for reconstructing all metabolic pathways of length up to 2k using the metabolites and reactions in R and starting and ending in multi-molecules of at most m components, is the following:
-
First, we extract the set M of all chemical graphs representing sets of at most m any metabolites appearing in substrates and products of the reactions in R. We call the elements of M the initial chemical graphs.
-
Next, we identify all compatibility classes in M (maximal subsets of compatible initial chemical graphs). Biochemical reactions transform chemical graphs into compatible chemical graphs and, therefore, the origin and the end of a metabolic pathway will be compatible sets of metabolites. Thus, since we restrict ourselves to metabolic pathways starting and ending in initial chemical graphs, we can restrict ourselves to search for metabolic pathways starting and ending in each compatibility class of initial chemical graphs.
-
Then each compatibility class C in M is considered as a set of potential substrates C
(0)
F
and a set of potential products C
(0)
R
for the reactions in R.
-
For every i = 1, …, k, the forward application of the reactions in R to the elements of C
(i−1)
F
produces a set of multi-molecules C
(i)
F
, while the reverse application of these reactions to the molecules in C
(i−1)
R
produces a set of multi-molecules C
(i)
R
.
-
Any nonempty intersection of a set obtained by forward application and a set obtained by reverse application of reactions yields a new pathway between elements of C. To avoid repetitions, it is enough to check whether each C
(i)
F
intersects C
(i)
R
and C
(i−1)
R
. More specifically:
-
For i = 1, the forward application of the reactions in R to the molecules in C
(0)
F
produces a set C
(1)
F
of new molecules, and the reverse application of the reactions in R to the molecules in C
(0)
R
produces a set C
(1)
R
of new molecules.
Then
-
Every member of C
(1)
F
∩ C
(0)
R
yields a new pathway C
(0)
F
→ C
(1)
F
∩ C
(0)
R
of length 1.
-
Every member of C
(1)
F
∩ C
(1)
R
yields a new pathway C
(0)
F
→ C
(1)
F
∩ C
(1)
R
→ C
(0)
R
of length 2.
-
For i = 2, the forward application of the reactions in R to the molecules in C
(1)
F
produces a set C
(2)
F
of new molecules, and the reverse application of the reactions in R to the molecules in C
(1)
R
produces a set C
(2)
R
of new molecules.
Then
-
Every member of C
(2)
F
∩ C
(1)
R
yields a new pathway of length 3
$$C_F^{\left( 0 \right)} \to C_F^{\left( 1 \right)} \to C_F^{\left( 2 \right)} \cap C_R^{\left( 1 \right)} \to C_R^{\left( 0 \right)}$$
.
-
Every member of C
(2)
F
∩ C
(2)
R
yields a new pathway of length 4
$$C_F^{\left( 0 \right)} \to C_F^{\left( 1 \right)} \to C_F^{\left( 2 \right)} \cap C_R^{\left( 2 \right)} \to C_R^{\left( 1 \right)} \to C_R^{\left( 0 \right)}$$
.
-
And, recursively, the forward application of the reactions in R to the molecules in I
F
= C
(i−1)
F
produces a set C
F
= C
(i)
F
of new molecules, and the reverse application of the reactions in R to the molecules in I
R
= C
(i−1)
R
produces a set C
R
= C
(i)
R
of new molecules.
Then
-
Every member of C
F
∩ I
R
yields a new pathway of length 2i − 1
$$C_F^{\left( 0 \right)} \to ... \to {I_F} \to {C_F} \cap {I_R} \to ... \to C_R^{\left( 0 \right)}$$
.
-
Every member of C
F
∩ C
R
yields a new pathway of length 2i
$$C_F^{\left( 0 \right)} \to ... \to {I_F} \to {C_F} \cap {C_R} \to {I_R} \to ... \to C_R^{\left( 0 \right)}$$
.
The following result shows that in this way we obtain all metabolic pathways of length at most 2k under constraints (1) and (3) above.
Lemma 1.
For every i = 1, …, k, all metabolic pathways of length 2i−1 and 2i starting and ending in initial chemical graphs are obtained in the ith iterative step of the procedure explained above.
Proof: If
$${m_0} \to {m_1} \to ... \to {m_i} \to ... \to {m_{2i - 1}}$$
is a pathway with m
0 and m
2i−1 initial chemical graphs, then m
j
∈ C
(j)
F
for every j = 0, …, i and m
2i−1−l
∈ C
(l)
R
for every l = 0, …, i − 1, and hence in particular, m
i
∈ C
(i)
F
∩ C
(i−1)
R
. Therefore, this path is obtained in the ith iterative step of the procedure explained above
On the other hand, if
$${m_0} \to {m_1} \to ... \to {m_i} \to ... \to {m_{2i}}$$
is a pathway with m
0 and m
2i
initial chemical graphs, then m
j
∈ C
(j)
F
for every j = 0, …, i and m
2i−l
∈ C
(l)
R
for every l = 0, …, i, and hence, in particular, m
i
∈ C
(i)
F
∩ C
(i)
R
. Therefore, this path is also obtained in the ith iterative step of that procedure.
Example 1. Let a, b, c, d, e, f be metabolites such that b, d, e, f are compatible with each other, a is compatible with b + b and c is compatible with b + b + b. Consider the toy artificial chemistry given by the following reactions (where only the first four reactions are reversible):
$$\eqalign{ & a + b \leftrightarrow c,a \leftrightarrow d + e,b + d \leftrightarrow b + e,b + b \leftrightarrow d + f,\cr & c \to e + b + b,d + d \to a,a + f \to b + e + e \cr} $$
Let us look for metabolic pathways starting and ending with metabolites and pairs of metabolites a, …, f globally compatible with b + b + b. Then the set M of all initial chemical graphs can be identified with the set of monomials of total weight at most 2 over the alphabet {a, b, c, d, e, f } and the class C of the initial chemical graphs compatible with bbb (we omit henceforth the + sign for simplicity) is
$$C = \left\{ {c,ab,ad,ae,af} \right\}$$
. So, we are looking for metabolic pathways starting and ending in elements of this set C. The intermediate multi-molecules of these pathways will belong to the set of all multimolecules formed by metabolites a, b, c, d, e, f compatible with bbb: these are the multimolecules in C plus any combination of three metabolites b, d, e, f.
Taking
$$C_F^{\left( 0 \right)} = C_R^{\left( 0 \right)} = C = \left\{ {a,ab,ad,ae,af} \right\}$$
, we obtain the following one step derivations:
Notice that some elements of C
(1)
F
and C
(1)
R
do no longer belong to M, as we warned
Then
$$C_F^{\left( 1 \right)} = \left\{ {c,ab,bbe,bde,bee,dde,dee,def} \right\}$$
$$C_R^{\left( 1 \right)} = \left\{ {c,ab,bdd,bde,ddd,dde,ddf,dee,def} \right\}$$
and hence
$$C_F^{\left( 1 \right)} \cap C_R^{\left( 0 \right)} = \left\{ {ab,c} \right\},C_F^{\left( 1 \right)} \cap C_R^{\left( 1 \right)} = \left\{ {ab,c,bde,dde,dee,def} \right\}$$
. From these intersections, we deduce that all metabolic pathways of lengths 1 and 2 starting and ending in C are
$$\eqalign{ & C_F^{\left( 1 \right)} \cap C_R^{\left( 0 \right)} = \left\{ {ab,c} \right\},C_F^{\left( 1 \right)} \cap C_R^{\left( 1 \right)} = \left\{ {ab,c,bde,dde,deee,def} \right\}\cr & c \to ab,ab \to c,c \to ab \to c,ab \to c \to ab,ab \to bde \to ab,\cr & ad \to dde \to ad,ad \to dde \to ae,ae \to dee \to ae,af \to def \to af \cr} $$
.
For k = 2, we obtain:
Then
$$C_F^{\left( 2 \right)} = \left\{ {c,ab,ad,ae,af,bbd,bbe,bdd,bde,bee,def} \right\}$$
$$C_R^{\left( 2 \right)} = \left\{ {c,ab,ad,ae,af,bbd,bbe,bdd,bde,bee} \right\}$$
and hence
$$C_F^{\left( 2 \right)} \cap C_R^{\left( 1 \right)} = \left\{ {c,ab,bdd,bde,def} \right\}$$
,
$$C_F^{\left( 2 \right)} \cap C_R^{\left( 2 \right)} = \left\{ {c,ab,ad,ae,af,bbd,bbe,bdd,bde,bee} \right\}$$
. From these intersections, we deduce that all metabolic pathways of lengths 3 and 4 starting and ending in C are
As it can be seen in the previous example, the raw application of the procedure explained above generates all metabolic pathways of length up to 2k starting and ending in sets of at most m metabolites used by the reactions in R, but most of these metabolic pathways will be redundant, for instance because they are cyclic, or because they do not contain any new multi-molecule that has not appeared in shorter metabolic pathways. Therefore, several reconstruction problems may be addressed in this context. In this work, we consider only three of them:
-
(a)
to produce all metabolic pathways of length up to 2k
-
(b)
to produce all shortest metabolic pathways of length up to 2k
-
(c)
to produce all minimal acyclic metabolic pathways of length up to 2k in all cases under restrictions (1) to (3) made explicit above.
Here, by a shortest metabolic pathway between metabolite sets I and F, we understand a metabolic pathway from I to F of shortest length among all metabolic pathways from I to F, and by a minimal acyclic metabolic pathway we understand a metabolic pathway that contain no directed cycles and no other, shorter metabolic pathways with intermediates in I or F. For instance, the shortest path derivation
$$ab \to c \to bbe \to def \to af$$
in Example 1 is acyclic but not minimal, because it contains the derivation c → bbe → def → af, while the minimal acyclic derivation
$$c \to bbe \to bbd \to ddf \to af$$
is not shortest, because there is a shorter derivation c → bbe → def → af from c to af .
We give our reconstruction algorithms in full pseudocode next. Algorithm 1 one formalizes the procedure explained above.
The first three lines of this algorithm produce the different compatibility classes of initial chemical graphs. Then for each compatibility class C and for each i = 1, …, k:
-
It receives the sets I
F
= C
(i−1)
F
and I
R
= C
(i−1)
R
of the results of all direct and reverse applications, respectively, of i − 1 consecutive rules in R to multi-molecules in C (when i = 1, C
(0)
F
= C and C
(0)
R
= C) and it produces the sets N
F
= C
(i)
F
and N
R
= C
(i)
R
of the results of all direct and reverse applications, respectively, of rules in R to multimolecules in I
F
and I
R
, respectively. That is, the sets of the results of all direct and reverse applications, respectively, of i consecutive rules in R to multi-molecules in C.
-
The lines starting with output call a procedure that outputs the list of all metabolic pathways of lengths 2i − 1 and 2i obtained so far. When i = 1:
-
the first output line gives all length 1 pathways m → m
(1)
f
, with m ∈ C,
-
the second output line gives all length 2 pathways m → m
(1)
r
→ m′ with m,m′ ∈ C.
And when i > 1:
Algorithm 1. Given a set R of biochemical reactions and a number k of derivation steps, obtain the set of all metabolic pathways of length up to 2k using the metabolites and reactions in R starting and ending in sets of at most m metabolites among those involved in the reactions in R.
-
Thefirst output line gives all length 2i − 1 pathways
$$m \to m_f^{\left( 1 \right)} \to ... \to m_f^{\left( {i - 1} \right)} \to m_f^{\left( i \right)} = m_r^{\left( {i - 1} \right)} \to m_r^{\left( {i - 2} \right)} \to ... \to m_r^{\left( 1 \right)} \to m'$$
with m,m′ ∈ C.
-
The second output line gives all length 2i pathways
$$m \to m_f^{\left( 1 \right)} \to ... \to m_f^{\left( {i - 1} \right)} \to m_f^{\left( i \right)} = m_r^{\left( i \right)} \to m_r^{\left( {i - 1} \right)} \to ... \to m_r^{\left( 1 \right)} \to m'$$
with m,m′ ∈ C.
Algorithm 2 produces a metabolic network (X, Y) containing all metabolic pathways up to a given length, where the vertex set X contains the initial and final metabolite sets together with all those new metabolite sets produced by the forward and reverse application of the given biochemical reactions, and the arc set Y consists of all direct derivations thus obtained.
Now, upon the metabolic network (X, Y ) obtained with the previous algorithm, the set of all shortest metabolic pathways of length up to 2k, using the metabolites and reactions in R starting and ending in sets of at most m metabolites among those involved in the reactions in R, can be obtained by using an all-pairs shortest path algorithm (Dijkstra, 1959; Floyd, 1962; Johnson, 1977; Takaoka, 1998) upon each element of C as source vertex and each element of C as target vertex in turn.
Algorithm 2. Given a set R of biochemical reactions and a number k of derivation steps, obtain the metabolic network (X, Y) containing all metabolic pathways of length up to 2k, using the metabolites and reactions in R starting and ending in sets of at most m metabolites among those involved in the reactions in R.
Example 2. The toy artificial chemistry of Example 1, obtained from the class C = {c, ab, ad, ae, af} of the initial chemical graphs compatible with bbb by bidirectional search of metabolic pathways of length up to 4, is the following:
Then the enumeration of all-pairs shortest paths in (X, Y) starting and ending in the elements of C = {c, ab, ad, ae, af} produces the following derivations:
$$\eqalign{ & c \to ab,\cr & c \to bbe \to def \to af,\cr & ab \to c,\cr & ab \to c \to bbe \to def \to af,\cr & ad \to dde \to ae,\cr & af \to bee \to bde \to ab,\cr & af \to bee \to bde \to ab \to c \cr} $$
.
Algorithm 3 extracts the set of all minimal acyclic metabolic pathways of length up to 2k, using the metabolites and reactions in R starting and ending in sets of at most m metabolites among those involved in the reactions in R, from the metabolic network (X, Y) produced by Algorithm 2.
In this algorithm, each path of the form u → …→ υ is extended in all possible ways by arcs in Y of the form υ → w until reaching an element w ∈ C, where the test w ∉ p ensures the resulting paths are acyclic.
Algorithm 3. Given a metabolic network (X, Y) and a set C of initial and final metabolite sets, enumerate all minimal acyclic metabolic pathways contained in (X, Y) which start and end in metabolite sets from C.
where acyclic(C,E, υ, p) is defined as follows:
Example 3. In the metabolic network (X, y) of Example 2, which corresponds to the toy artificial chemistry of Example 1, the enumeration of minimal acyclic paths starting and ending in the elements of C = {c, ab, ad, ae, af} produces the following derivations:
$$\eqalign{ & c \to ab,\cr & c \to bbe \to bbd \to ddf \to af\cr & c \to bbe \to def \to af,\cr & ab \to c,\cr & ad \to dde \to ae,\cr & af \to bee \to bde \to ab,\cr & af \to bee \to bde \to bdd \to ab \cr} $$
.
Remark 1. Notice that the shortest path derivation ab → c → bbe → def → af is not minimal, and the minimal acyclic derivation c → bbe → bbd → ddf → af is not shortest.