## Abstract

### Background

A popular model for gene regulatory networks is the Boolean network model. In this paper, we propose an algorithm to perform an analysis of gene regulatory interactions using the Boolean network model and time-series data. Actually, the Boolean network is restricted in the sense that only a subset of all possible Boolean functions are considered. We explore some mathematical properties of the restricted Boolean networks in order to avoid the full search approach. The problem is modeled as a Constraint Satisfaction Problem (CSP) and CSP techniques are used to solve it.

### Results

We applied the proposed algorithm in two data sets. First, we used an artificial dataset obtained from a model for the budding yeast cell cycle. The second data set is derived from experiments performed using HeLa cells. The results show that some interactions can be fully or, at least, partially determined under the Boolean model considered.

### Conclusions

The algorithm proposed can be used as a first step for detection of gene/protein interactions. It is able to infer gene relationships from time-series data of gene expression, and this inference process can be aided by a *priori* knowledge available.

## Background

One of the goals of Systems Biology is to study the various cellular mechanisms and components. In many cases these mechanisms are complex, where some of the interactions between the proteins are still unknown. To represent these interactions it is common to use gene regulatory networks (GRN). There are several models of GRN, both discrete and continuous. The simplest discrete model was introduced by Kauffman [1] and its known as *Boolean network*. Later, this model was modified to express uncertainty giving rise to the *probabilistic Boolean network*[2, 3]. Friedman introduced *Bayesian networks*[4] as a probabilistic tool for the identification of regulatory data and showed that they can reproduce certain known regulatory relationships. Among the continuous models we can cite the *ordinary differential equations* which was suggested several decades ago [5]. For a more detailed review about models of gene regulatory networks see [6].

Models of gene regulatory networks help us study biological phenomena (e.g. cell cycle) and diseases (e.g. cancer). Therefore, revealing such networks, or at least some of its connections, is an important problem to address. The ability to uncover the mechanisms of GRN has been possible due to developments in high-throughput technologies, allowing scientists to perform analysis on the DNA and RNA levels. The most common type of data provided by these technologies are gene expression data (microarray). The biological systems are notoriously complex. Determining how the pieces of this puzzle come together to create living systems is a hard challenge known as *reverse engineering*, which is the process of elucidating the structure of a system by reasoning backwards from observations of its behavior [7]. However, in many cases, GRN cannot be precisely unraveled due to measurement noise and the limited number of data sets compared to the number of genes that are involved.

The most common approach to reverse engineering GRN is to use gene expression data. For example, Marshall et al. [8] considered the approach of using time-series data and probabilistic Boolean networks (PBNs) as a model of GRN. In fact, there are many other works about inferring PBNs [9–13].Computational algebra approaches were also proposed in [14, 15]. Jarrah et al. [16, 17] used *polynomial dynamical systems* for reverse engineering of GRN. One good survey for inferring GRN from time-series data can be found in [18]. Some algorithms use additional information from heterogeneous data sources, e.g. genome sequence and protein-DNA interaction data, to assist the inference process. Hecker et al. [19] presents a good review of GRN inference and data integration.

Usually, an inference algorithm aims to construct one single network which is believed to be the true network. The issue is that the inverse problem is ill-posed, meaning that several networks could explain (or generate) the data set given as the input for the algorithm. In fact, a study for validation of GRN inference procedures can be found in [20]. The problem becomes more complicated if we take into account the noise that may be present in the data and the small amount of samples. For this reason, our approach aims to analyze several networks that could explain the data. By analyzing the similarities among these networks, we propose a confidence measure of the regulatory relationship between the genes.

In this paper, we present an algorithm for analysis of gene interactions. Although this analysis is directly connected to the process of inference of gene regulatory networks, the main goal of this work is not the inference. The idea is that the algorithm could be used as a first step of an inference process, that is, a pre-processing of the data, in order to support an inference process. To perform the analysis, the algorithm generates a limited number of *consistent* networks (to be explained in the next section). Unlike any inference algorithm, our algorithm does not take these networks as the final result (the true network). It uses these networks to perform the analysis of gene interactions.

The algorithm is based on Boolean networks and time-series gene expression. Actually, the Boolean networks are called *restricted* in the sense that not all Boolean functions are allowed in the model. Restricting the network reduces the search space, which can be significant, since the inverse problem is very complex. This restricted model allows us to find constraints that turn our problem into what can be seen as a Constraint Satisfaction Problem (CSP) and CSP techniques can be used to find feasible solutions, that is, networks. The time-series data allows us to observe part of the dynamics of the system. These observations are used to generate the constraints of the CSP.

A challenge always presented in any gene regulatory model is its usefulness. It would be interesting if a model could help biological experiments in understanding gene interactions. The model here presented is capable of inferring some of these connections from time-series data of gene expressions, and this inference process is aided by all a *priori* knowledge available. What we envisage with our method is a model that points out which connections should be inspected in the wet lab that would constrain as many other connections as possible and consequently could facilitate some biological experiments.

## Methods

### Restricted Boolean network model

A Boolean network (BN) is defined by a set **X** = {*x*_{1}, *x*_{2},…,*x*_{
n
}} of *n* Boolean variables and a set **F** = {*f*_{1}, *f*_{2},…,*f*_{
n
}} of *n* Boolean functions. In the case of GRN the variables are called genes. Each gene *x*_{
i
}, *i* = 1,…,*n*, can assume only two possible values: 0 (OFF) or 1 (ON). The value of the gene *x*_{
i
} at time *t* + 1 is determined by genes at time *t* through a Boolean function . Given that, there are *k*_{
i
} genes assigned to gene *x*_{
i
}, and the mapping *j*_{
k
} : {1,…,*n*} → {1,…,*n*}, *k* = 1,…,*k*_{
i
} determines the “wiring” of *x*_{
i
}[21]. This way,

We assume that all genes are updated synchronously by the functions in **F** assigned to them and this process is repeated. The artificial synchrony simplifies computation while preserving the qualitative, generic properties of global network dynamics [22, 23]. A *state* of the network at time *t* is a binary vector *s*(*t*) = (*x*_{1}(*t*),…,*x*_{
n
}(*t*)). Therefore, the number of states is 2^{n}, labeled by *s*_{0}, *s*_{1},…,*s*_{2}^{n}–1. The dynamics of the network is represented by the transition between states. This model is deterministic given that there is a single Boolean function to regulate each gene. Because of the finite number of states and the deterministic behavior, some of the states may be visited cyclically. These states form what is known by the *attractor* of the BN. The states outside the attractor are called *transient* states. The transient states together with the corresponding attractor states form the *basin of attraction* of that attractor.

In the case of restricted Boolean networks, the regulatory relationships is represented by a matrix *A*_{n×n} using the following convention: *a*_{
ij
} = 1 for a positive regulation on gene *x*_{
i
} from gene *x*_{
j
}; *a*_{
ij
} = –1 for a negative regulation on *x*_{
i
} from *x*_{
j
}; For the remaining cases *a*_{
ij
} = 0. The Boolean function *f*_{
i
} is defined according to the matrix *A* and the values of the genes *x*_{
j
}, *j* = 1,…,*n*, at time *t*:

We call the summation ∑_{
j
}*a*_{
ij
}*x*_{
j
}(*t*) the *input* of *x*_{
i
} at time *t*. Besides the regulatory relationships of the matrix *A*, each gene can have a self-degradative behavior implying that its value is set to 0 whenever its input is null. Observe that not all Boolean functions can be represented using (2) and that is why the Boolean network is called “restricted”. It is also worth to notice here that each gene *x*_{
i
} depends only on the *i*-th row of *A*.

### Constraint satisfaction problem

A constraint satisfaction problem (CSP) is defined by a set of variables *X* = {*x*_{1}, *x*_{2},…,*x*_{
n
}}; a collection of finite sets ** D** = {

*D*

_{1},

*D*

_{2},…,

*D*

_{ n }}, where

*D*

_{ i }is the domain set for

*x*

_{ i }; and a collection of set of constraints

**= {**

*C**C*

_{1},

*C*

_{2},…,

*C*

_{ m }} restricting the values that all variables can assume simultaneously, where each set

*C*

_{ i }involves constraints of a subset of variables and specifies the allowable combinations of values for that subset. A solution of a CSP is an assignment, to every variable

*x*

_{ i }, a value from its domain

*D*

_{ i }such that all constraints in

*C*are satisfied [24, 25].

CSPs defined on finite domains are usually solved by search algorithms meaning systematically assigning possible values to variables and verifying whether all constraints may be satisfied or not. The most used techniques are variants of *backtracking*, *constraint propagation*, and *local search*[24–27]. In these search algorithms, the assignment process requires an order in which the variables are considered. In addition, after selecting a variable, it is necessary to decide the order in which a value is picked up from its domain (to assign to it). In fact, there are many heuristics for variable and value ordering [24, 27]. If one is interesting in generating a sample of uniform solutions, a good heuristic for variable and value ordering may be, firstly, select randomly a variable (with uniform probability), and, then choose randomly (again, with uniform probability) a value from its domain. There are many CSP solvers in the literature; the one we are using is from Gecode project (http://www.gecode.org) [27].

### Analysis of gene interactions using CSP

One important remark about the Boolean model presented by Equation (2) is that each row in the regulation matrix *A* is independent from each other. Using this property, instead of applying the CSP to find all the possible solutions for the *A* matrix, we can apply the CSP to find all the feasible rows *r*_{
i
} for each gene *x*_{
i
}. In this way, the time complexity of the algorithm is reduced by decreasing the number of possible combinations from 3^{n}^{2} (number of possible matrices) to *n* · 3^{n} (total number of possible rows). Thus, in this context, the gene interaction problem considered in this paper can be modeled as a set of *n* CSPs. For each gene *x*_{
i
} (*i* = 1,2, …,*n*), the problem *P*_{
i
} is defined by the set of variables *R*_{
i
} = {*a*_{i,1}, *a*_{i,2},…,*a*_{
i,n
}} (corresponding to the *n* entries of the *i*-th row of the regulation matrix); a collection of domain sets *D*_{
i
} = {*D*_{i,1}, *D*_{i,2},…,*D*_{
i,n
}} (each *D*_{
i,j
} = {–1,0,1}); and a set of constraints obtained by considering all successions of states in the time-series data (see next subsection).

In order to analyze the gene interactions, since there may be a combinatorial explosion in generating the rows, we consider some of them (a random sampling process) to perform the analysis. In fact, as the number of genes grows, the problem becomes intractable and there could be too many consistent networks for a given data set. We would like to highlight that we are not concerned in generate all the consistent networks, but a limited number of them in order to perform an analysis of the gene interactions.

In situations where a large number of genes is considered and a small number of time-series data is available, one can reduce the number of genes to perform the gene interaction analysis by employing clustering analysis [28–31] or feature selection algorithms [32], and/or building small subnetworks by using the paradigm of growing seed genes [11, 33]. Of course, one can still use prior knowledge by selecting a small number of genes involved in a specific biological process (e.g., cell cycle division, metabolic pathway).

### Constraints generation for the CSP

The algorithm was designed under the assumption that the gene expression data were generated by a biological system which can be modeled as a restricted Boolean network. Let **S** = {*S*(1), *S*(2),…,*S*(*m*)} be a set of *m* time-series gene expression profiles, where *S*(*t*) ∈ {0,1}^{n} for *t* = 1,…,*m*. The algorithm aims to analyze networks that produce the sequence

*S*(1) → *S*(2) → ⋯ → *S*(*m*) . (3)

When the network produces the time-series data we say that the network is *consistent* with the data.

Naturally, there may exist several consistent networks for a single sequence. That is, the inverse problem is a “one-to-many” or ill-posed problem, and this is very difficult to handle.

One naïve way to solve this ill-posed problem is to find all possible networks by a full search algorithm. In fact, Lau et al. [34] proposed a “smart” full search algorithm to enumerate all possible networks. Here, in this paper, we explore some mathematical properties of the restricted Boolean networks in order to avoid this full search approach.

The algorithm aims to analyze the interactions between the genes through the information provided by the time-series sequence, which can be seen as a state transition sequence of the corresponding BN. These time-series data and the restricted Boolean network model are used to generate the constraints of the CSP, as we show next.

#### First constraints set

The first set of constraints is generated by analyzing the sample in triplets, *S*(*t* – 1), *S*(*t*) and *S*(*t* + 1). An important point to notice here is that if two consecutive states *S*(*t* – 1) and *S*(*t*) differ only in one single gene *x*_{
k
}, then any gene *x*_{
i
} that had its value changed from *S*(*t*) to *S*(*t* + 1) is directly regulated by *x*_{
k
}. To illustrate this situation, consider the time-series data (Table 1). Looking at the time points *S*(1) and *S*(2) we observe that only *x*_{2} had its value changed (from 1 to 0). Now, looking at *S*(2) and *S*(3) we can see that *x*_{3} was turned to 1. Following the restricted Boolean network model, this change was caused, necessarily, by the gene *x*_{2}. In fact, *x*_{2} inhibits *x*_{3} at time *t* = 1 and it is self degraded at time *t* = 2, allowing *x*_{1} to activate *x*_{3} at time *t* = 3. Using this approach, we state the following proposition.

**Proposition 1.** *Let S*(*t* – 1), *S*(*t*) *and S*(*t* + 1) *be three consecutive states according to the restricted Boolean network model. If S*(*t* – 1) *and S*(*t*) *differ by a single gene x*_{
k
}*, then for each gene x*_{
i
} *such that x*_{
i
}(*t*) ≠ *x*_{
i
}(*t* + 1) *we have that x*_{
k
} *regulates x*_{
i
} *directly, that is, a*_{
ik
} ≠ 0.

*Proof*. Suppose that *S*(*t* – 1) and *S*(*t*) differ by a single gene *x*_{
k
}, and that there is at least one gene *x*_{
i
} such that *x*_{
i
}(*t*) ≠ *x*_{
i
}(*t* + 1). As *x*_{
i
}(*t*) ≠ *x*_{
i
}(*t* + 1), the summations ∑_{
j
}*a*_{
ij
}*x*_{
j
}(*t* – 1) and ∑_{
j
}*a*_{
ij
}*x*_{
j
}(*t*) have different signs. Given that *x*_{
k
} is the only gene possessing different values in *S*(*t* – 1) and *S*(*t*), this difference signal must have been caused by *x*_{
k
}. Therefore, *a*_{
ik
} ≠ 0.

#### Second constraints set

To generate the second set of constraints, the algorithm takes into account two consecutive states, *S*(*t*) and *S*(*t* + 1). There is one important observation here: only the active genes at time *t* can possibly regulate genes at time *t* + 1. This fact becomes clear when we look at Equation (2). The active genes can give us an insight of which genes are regulating other gene, although the type of the regulatory relationship can not be determined. However, the input given by the summation in Equation (2) can help us to determine the regulatory relationships. For example, if we observe that a gene *x*_{
i
} changes its value from 0 (at time *t*) to 1 (at time *t* + 1), we can deduce that the input for gene *x*_{
i
} is positive at time *t* and only the active genes at time *t* are responsible for this positive input. Following this logic, the algorithm generates all possible combinations of regulatory relationships using the active genes such that the input of gene *x*_{
i
} at time *t* is coherent to the values of *x*_{
i
} at time *t* + 1. More formally, we can state the following proposition.

**Proposition 2.** *For each gene i, the state transition from x*_{
i
}(*t*) *to x*_{
i
}(*t* + 1) *generates constraints for variables a*_{
ij
} *according to* Table 2.

*Proof*. Let us first prove the first constraint, that is, if *x*_{
i
}(*t*) = 1 and *x*_{
i
}(*t* + 1) =0, then . According to Equation (2), the only way to change the state of gene *x*_{
i
} from 1 (at time *t*) to 0 (at time *t* + 1) is if its input . If we just consider the genes *x*_{
j
} that are active at time *t*, that is, those which *x*_{
j
}(*t*) = 1, we can rewrite this constraint as . Analogously, one can prove the other constraints given in Table 2.

To exemplify, consider the data in Table 1 where *t* = 3, that is, *S*(3). At this time, there are two active genes, *x*_{1} and *x*_{3}. These genes are the only ones that can contribute to the sign of the input for each gene for the next time. If we look at gene *x*_{4}, we observe that its value turned from 0 to 1. According to Equation (2), the input must be positive in this case, that is, . Considering only the active genes at time *t* = 3, we must have *a*_{41} + a_{43} > 0. Therefore, neither *a*_{41} or a_{43} can take the value –1, only 1 or 0 (not both). The same procedure can be applied to all genes and then, the constraints for the CSP are generated.

#### Third constraints set

This set is generated by analyzing any two pairs of consecutive states in the time-series data. Let *t*_{1} and *t*_{2} be two time points in the time-series data:

⋯*S*(*t*_{1}) → *S*(*t*_{1} + 1) → ⋯ → *S*(*t*_{2}) → *S*(*t*_{2} + 1) ⋯.

Now, let us suppose that *S*(*t*_{1}) and *S*(*t*_{2}) are very similar. Hence, the difference between *S*(*t*_{1} + 1) and *S*(*t*_{2} + 1) must be caused by the differentially expressed genes of their predecessors. For instance, let us suppose that *S*(*t*_{1}) and *S*(*t*_{2}) differ in one single gene:

And the succession occurs as stated:

Therefore, the difference between *S*(*t*_{1} + 1) and *S*(*t*_{2} + 1) in this case must be caused by the change on *x*_{4}. In this step, the algorithm checks how each gene changed in the two pairs of consecutive states.

In our example, let us concentrate on gene *x*_{1}. It was inhibited in the first pair and had no change in the second pair. Let *I* be the total input generated by those genes with similar expression in *S*(*t*_{1}) and *S*(*t*_{2}), *M* be the input generated by *x*_{4} in *S*(*t*_{1}), and be the input generated by *x*_{4} in *S*(*t*_{2}). Therefore, to explain the changes of *x*_{1} in the two pairs, we must have:

If *a*_{
ij
} represents the influence of gene *x*_{
j
} over *x*_{
i
}, we can calculate I, *M* and as follows:

*M* = *a*_{14} · 0 = 0 and (8)

Henceforth,

This result implies that the entry *a*_{14} of the matrix must have value 1.

If *S*(*t*_{1}) and *S*(*t*_{2}) differ in more than one gene, we can still generate hypotheses of regulation. In fact, this step tries to construct a system of inequalities with the inputs of each gene for every combination of two consecutive pairs. More formally, we can state the following proposition.

**Proposition 3.** *Let t*_{
a
} *and t*_{
b
} *be two different time instants. The state transitions from x*_{
i
}(*t*_{
a
}) *to x*_{
i
}(*t*_{
a
} + 1) *and from x*_{
i
}(*t*_{
b
}) *to x*_{
i
}(*t*_{
b
} + 1) *generate constraints for variables a*_{
ij
} *according to* Table 3.

*Proof*. Let us first prove the first constraint, that is, if *x*_{
i
}(*t*_{
a
}) = 1, *x*_{
i
}(*t*_{
a
} + 1) = 0, *x*_{
i
}(*t*_{
b
}) = 0, and *x*_{
i
}(*t*_{
b
} + 1) = 1, then

Considering the state transition from *x*_{
i
}(*t*_{
a
}) = 1 to *x*_{
i
}(*t*_{
a
} + 1) =0, by Proposition 2, we have that . From set theory, we can write the index set of all active genes at time *t*_{
a
}, *A*(*t*_{
a
}) = {*j* : *x*_{
j
}(*t*_{
a
}) = 1}, as a union of two disjoints sets:

Hence,

Analogously, from the transition from *x*_{
i
}(*t*_{
b
}) = 0 to *x*_{
i
}(*t*_{
b
}) = 1, we have that

Now subtracting the last two inequalities, we have

The other constraints can be obtained in a similar way.

### Bar chart of connection frequencies

For a fixed row *r*_{
i
}, the algorithm generates a collection of consistent rows *R*_{
i
} = {*r*_{i1}, *r*_{i2},…,*r*_{
im
}} from the constraints generated by the time-series data and the CSP solver. Each consistent row has *n* entries , each one corresponding to a one type of connection (inhibition, no connection or activation) on gene *x*_{
i
} from gene *x*_{
j
}. Thus, we can estimate the frequency of all possible connections for the entry *a*_{
ij
} of the regulation matrix by computing the frequency of entries –1, 0 and 1 of all , for *k* = 1,2,…,*m*.

We can exhibit the frequency of different types of connections on gene *x*_{
i
} from gene *x*_{
j
}, by showing the estimated frequencies of –1, 0 and 1 for *a*_{
ij
} using a bar chart, as we will see in the next section. Evidently, for a fixed row *r*_{
i
}, *determined connections* on gene *x*_{
i
} will appear with frequency 100% in all rows *R*_{
i
}; while *partially determined connections* on gene *x*_{
i
} will have, at least, one type of connection (inhibition, no connection or activation) with frequency 0%; and, for *undetermined connections*, all relationships will have nonzero frequencies.

### Interactions rank

One way to validate our results is to use this bar chart. To do so, we rank the interactions found by the connection frequencies and compare the most relevant ones to known interactions found in the literature. By searching through the literature, the direction of some interactions could not be determined. For instance, in some cases we know that there is an interaction between two genes *x*_{
i
} and *x*_{
j
}, but we do not know whether *x*_{
i
} is activating/inhibiting *x*_{
j
} or vice-versa. Therefore, we rank undirected gene interactions by adding the frequency of different types of connections in both ways. For example, the interaction of genes *x*_{
i
} and *x*_{
j
} is ranked according to the equation:

where *inh*(*x*_{
i
} ← *x*_{
j
}) and *act*(*x*_{
i
} ← *x*_{
j
}) denote the estimated frequency of –1 (inhibition) and 1 (activation), respectively, for the entry *a*_{
ij
} obtained from the set of the rows in *R*_{
i
}.

For a set of *n* genes, we rank *n*^{2}/2 gene interactions. Typically, the interactions with the highest rank are used in order to search for interactions already known in the literature.

### Inducing a connection

An interesting application of the bar chart of connection frequencies would be the answer of the following question: which *non-determined* (partially determined or undetermined) connection, once determined, would constrain as many other connections as possible? In an experimental context, a method that can point which connection would aggregate more “knowledge” to the network if determined leads to an empirical construction of optimal GRNs.

To investigate this point, a simple exhaustive search was implemented in the space of consistent rows *R*_{
i
} for a gene *x*_{
i
}, according to the following steps:

1. Set a *non-determined* (partially determined or undetermined) connection *a*_{
ij
} of the set *R*_{
i
} as a determined connection with a value *v* of its domain;

2. Set ;

3. Construct the bar chart connection frequencies of ;

4. Compute *score*(*a*_{
ij
},*v*) = *score*(*x*_{
i
}) = ∑_{
j
} *determination*(*x*_{
i
},*x*_{
j
}) on the constrained set , where *determination*(*x*_{
i
}, *x*_{
j
}) is 0 if the connection *x*_{
i
} ← *x*_{
j
} is undetermined, 0.5 if it is partially determined, and 1 if it is completely determined;

5. Repeat Steps 1-4 for all non-determined connections and all their domain values. In the limit, 3^{g} will be tested, where *g* is the number of non-determined connections;

6. Chose *a*_{
ij
} = *v* that has the highest *score*.

By the end of these steps, we can say that connection *a*_{
ij
} = *v* determines as many other connections as possible for gene *x*_{
i
}.

## Results and discussion

### Budding yeast cell cycle model

We applied the algorithm in an artificial data set extracted from a model for the budding yeast (*Saccharomyces cerevisiae*) cell cycle. The model, proposed by Li et al. [35], is based on a network of eleven regulators as shown in Figure 1. The eleven genes *x*_{1},…,*x*_{11} are *Cln3*, *MBF*, *SBF*, *Cln1*, *Cdh1*, *Swi5*, *Cdc20*, *Clb5*, *Sic1*, *Clb1*, and *Mcm1*, respectively. The “cell-size” node was introduced just to indicate a checkpoint to start the cell-cycle process.

Considering the restricted Boolean network model, Li et al. [35] studied the dynamics of the network. They found that there are seven attractors, shown in Table 4. In this table, each row represents an attractor where the first column indicates the size of the basin of attraction. There is one big basin composed by 1,764 (or ≈ 86% of) states. According to Li et al. [35], the corresponding attractor is the biological G_{1} stationary state.

Biologically, the cell-cycle sequence starts when the cell commits to division by activating Cln3. To simulate the cell cycle, they started the process by “exciting” the G_{1} stationary state with the cell size signal, that is, inducing the gene Cln3 to an active state. Applying Equation (2) to simulate the process it was observed that the system goes back to the G_{1} stationary state. The temporal evolution of the states, presented in Table 5, follows the cell-cycle sequence, going from excited G_{1} state (Start) to the S phase, the G_{2} phase, the M phase, and finally to the stationary G_{1} state. This is the biological trajectory or pathway of the cell-cycle network. The states presented in Table 5 are used as the artificial time-series data to perform the analysis using our algorithm.

#### Results for the budding yeast artificial data

For each gene *x*_{
i
}, the algorithm generates a collection of consistent rows *R*_{
i
} using the time-series data (the 13 states presented in Table 5) to generate the constraints of the CSP. If we compute the frequency of the types of connections, we are able to assigning probabilities of connection for each pair of genes. In Figures 2 and 3 we show the frequency of different types of connections to each gene *x*_{
i
} from all other genes. From these figures, we can see that the algorithm was capable of identifying 11 determined connections and 13 partially determined connections. The results are shown in Figure 4. Note that, in this figure, the arrows do not necessarily indicate activation.

Using these frequencies, we rank the undirected gene interactions using Equation 11. As an example, the 10 highest ranks are present in Table 6.

To validate our results, we consider a variable number of highest gene interactions ranks, from 5 to 25, and verify how many of these are present in the yeast cell cycle network (Figure 1), which allow us to compute a true positive rate, shown in Figure 5. This figure shows a true positive rate between 75% and 100% for different quantities of predicted gene interactions.

### HeLa cells

The immortal HeLa cell line is one of the oldest and most widely distributed human cell line. These cells are derived from cervical cancer cells of an African-American woman named Henrietta Lacks, who died in 1951. We applied our algorithm in a data set provided by Whitfield et al. [36] where the gene expression during the human cell cycle was characterized using cDNA microarrays. We used one of the five experiments consisting of 48 samples representing approximately three cycles and selected 20 well-characterized cell cycle genes. According to [36], each gene is assigned to a cell cycle phase shown in Table 7. The expression profiles of the 20 genes presented a cyclical pattern through the three cell cycles. Since our algorithm deals with Boolean values, we had to discretize the gene expression. To this end, we simply computed, for each gene *x*_{
i
}, the mean *m*_{
i
} of the expression profile. Then, for the gene *x*_{
i
}, all the values exceeding the value of *m*_{
i
} were set to 1, and the remaining values were set to 0. After this operation, each gene preserved its cyclical pattern, now in the Boolean domain.

Another pre-processing step was to split the data into three data sets, one for each cycle. Considering the sample **S** = {*S*(1), *S*(2),…,*S*(48)}, we identified the binary state vectors that represent the attractors of the system. For example, the sequence of states *S*(1),…,*S*(5) are very similar and we consider them as equal states and represent them as a one singleton attractor (an attractor composed by a single state). The same approach was taken regarding the sequences of states *S*(16), *S*(17), *S*(18); *S*(30), *S*(31), *S*(32) and *S*(43),…,*S*(48). Although the binary states in the sequence may be not exactly the same, we are assuming that this difference is caused by the noise in the data. These singleton attractors are similar to the G_{1} stationary state of the budding yeast cell cycle model [35].

We identified three cell cycles *C*_{1} = *S*(6),…,*S*(17); *C*_{2} = *S*(19),…,*S*(31) and *C*_{3} = *S*(33),…,*S*(47). Therefore, to apply the algorithm on this data, we considered the three cell cycles separately. To analyze our results, for a fixed gene *x*_{
i
}, we considered the union of all sets of rows *R*_{
i
}, obtained from the application of the algorithm to each cell cycle, and then compute the connection frequencies.

#### Results for the HeLa cells

Comparing the three cycles present in the time-series data, we can see some effects of noise in the gene expression measurements. Supposedly, the three cycles should be equal, but there are minor differences among them. According to [36], the cells utilized in the microarray experiment, by the time of *C*_{3}, could not be in the same cycle phase, compromising the experiment. Therefore, we did not utilize the data from *C*_{3} in our analysis.

The work of Whitfield et al. [36] makes also possible to add some biological knowledge to our algorithm. If we consider that the genes of the initial phases do not interact with genes of later phases, we reduce the set of possible rows. We can add this biological knowledge by setting some values of each row *r*_{
i
} as 0 according to the information in Table 8. This way, we generate two sets of data. One without biological knowledge (named here as *ρ*_{0,i}) and another with biological knowledge (*ρ*_{1,i}).

The algorithm is independently executed using the cell cycles *C*_{1} and *C*_{2} as the input data, generating a set of 10,000 rows for each gene *x*_{
i
}. We take the union of the two set of rows obtained from both cell cycles and plotted a bar chart to observe the connection frequencies (partially shown in Figures 6 and 7, other charts are present in Additional file 1 and Additional file 2).

From the frequency analysis of all rows *r*_{
i
}, (*i* = 1,2,…,*n*), we rank the undirected interactions. Tables 9 and 10 show the 10 highest gene interactions ranks. To validate these interactions we sought information about them through the literature. The tables also show the reference, when the information was found.

In Figure 8, we plot the true positive rate of the predicted interactions using the rank of undirected interactions. With no biological considerations (*ρ*_{0,i}), the true positive rate stands between 17% and 30%.By adding some knowledge (*ρ*_{1,i}), the true positive increases and stands between 25% and 35%.

#### Inducing a connection in HeLa cells data

To illustrate the method of inducing a connection which most determine others, we arbitrarily chose gene CCNF in the HeLa cells data. Executing the steps necessary to this analysis, we found that if the connection CCNF← RRM2 was determined as an inhibition the connections on CCNF from genes E2F1,RFC4, DHFR, STK15, PTTG1, and RAD21, would be determined as well (Figure 9).

The second highest rank produced is the connection CCNF← RAD21 determined as no relationship. As these two genes are classified as members of the same cell cycle phase, and the experimental determination of “no relationship” between two genes is difficult, we do not consider this result relevant.

### Discussion

By looking at Figures 2, 3, 6 and 7, it is interesting to note that, in some cases, the frequency analysis of connections was capable of almost excluding one relationship possibility, transforming some undetermined connections into partially determined connections. These results show that the cell cycle pathway constrains some connections, therefore restricting the whole network [34].

We can attribute this phenomenon to the high dependency that the determination of a gene connection has on other connections. The proposed algorithm performs a search over the space of possibilities of the influence of a set of genes over a single gene. If one of these influences is *a priori* determined (or known), this result can bias other connections. For example, let us suppose that genes A and B have to produce a positive output over a gene C, according to some restriction imposed by the time-series data. If we already know that gene A has no relationship to gene C, gene B must have a positive relationship on gene C. Therefore, this high dependency on the determination of a gene connection over the network makes the use of Figures 2, 3, 6 and 7 very restricted. If we simply use a relationship with a high weight to be our “best guess” on the connection between two genes, this choice can constrain other relationships, leading the system to a more or less determined state, or even creating a connection in a network that is not consistent with the data.

Another fact to be pointed out is the importance of the inferred partially determined connections. Although these connections can not be directly used to construct a network like the determined connections, it can guide some biological experiments, since a partially determined connection states that at least one type of relationship between two genes is not possible.

We could use the connection frequencies generated to attribute a strength of connection to the relationship of a partially determined connection, e.g., in the yeast cell cycle, the interference of Clb1 on SBF can be stated as 80% (or a probability of 0.8) of being an inhibition. In fact, we use the frequencies in Figures 2, 3, 6 and 7 to compute the rank of undirected relationships (Equation 11).

Regarding the validation, in the yeast cell cycle data, which is artificially generated, the true positive rate (Figure 5) of predicted connections is very high (75%) when 25 connections are considered, and 100% for 5 and 10 connections. These results show that the proposed algorithm can be successfully applied in an artificial data, i.e., a data set without noise and good balance between time points and number of genes. Evidently, our rank procedure is constructed in a way that determined connections and partially determined connections would be benefited. Hence, as our algorithm correctly determines 13 directed connections in the yeast cell cycle model, its true positive rate for a small number of undirected predictions is high as well.

Considering the HeLa cells data, the results are not quite optimistic. The true positive rate (Figure 8) stands between 17% and 30% for the inference procedure without biological knowledge and 25% to 35% using biological knowledge. However, if we consider the small amount of time points (12 per cell cycle) and the number of genes (20 in this simplified version), the difficulty of obtaining a higher true positive rate is clear.

Evidently, the method here proposed can only be used to aid a wet lab experiment on finding gene interactions if considerations about the network size and amount of time-series data were made. In situations where a large set of *g* genes is investigated and only a small amount of time-series data is available, as in the HeLa cells data, we would recommend that a rank of the *r* first interactions, with *g* ≪ *r* <*g*^{2}/2, to reduce the set of possible gene interactions to be tested.

To the HeLa cells, we can also explain the low true positive rate by considering the 20 genes version of the network too simplified. Maybe our algorithm predicts interactions that are not directly observed in nature, but only through a series of interactions of genes not present in our network. Therefore, our validation procedure is compromised.

It is worth to notice that adding some biological knowledge the results are improved for the HeLa cells. This fact reinforces the need for an integrated work with biologists in a network inference process, as we show that even using little pieces of biological information we can improve the whole procedure.

Regarding the example of finding a connection which most determines others, we expect to exemplify here that this use of the algorithm could substantially aid biological experiments. It is also worth noticing that the connection found, CCNF← RRM2 as an inhibition, could make biological sense, as RRM2 is classified as a gene of the S phase and CCNF is a G_{2} phase gene. As the connection CCNF← RRM2, defined as an activation, is also well ranked, we can say that this relation is worth for an empirical test.

A closer look at the frequency analysis raises another interesting question: would not the network chosen by nature be easily detectable? Or even better: would not the utilized data be enough to constrain the connection frequencies into nature’s choice? We could answer this question by pointing out a truth that unequivocally distinguishes our model from nature’s choice: the chemical interactions between proteins. Evidently, some of the connections considered on many steps of the algorithm here presented cannot exist due to chemical incompatibilities. In some sense, nature has more information to constrain its network than we do.

## Conclusions

This paper proposes an algorithm to perform analyses for discovering gene regulatory interactions from time-series data under the Boolean network model and in the context of Constraint Satisfaction Problem (CSP). In fact, the inference of gene regulatory network is a one-to-many inverse problem in the sense that there may exist several networks consistent with the dataset. In order to analyze the gene interactions, we have generated several gene connections in consistent networks by using CSP solver techniques which in turn utilized constraints sets built from three algorithms provided by this work. We have applied our methodology to an artificial dataset that had been generated by a Boolean network that models the budding yeast cell cycle [35], and to an experimental dataset of HeLa cells [36]. By these applications, we have shown that our analyses could be a first step for detection of gene relationships with a high flexibility to include biological knowledge.

A challenge always presented in any gene regulatory model is its usefulness. It would be very interesting if a model could help biological experiments in understanding gene interactions. The model presented here together with the algorithm proposed is a first step to aid an inference process from time-series data of gene expression, and it can be improved by all a *priori* knowledge available. As it was made clear in the HeLa cells data, the use of biological knowledge can improve the efficiency of the proposed algorithm. For future steps, an interesting feature to be improved on our method is the ability to indicate which connection should be verified in the wet lab to help determine others. As exemplified in this work, this feature could lead to important contributions on wet lab experiments. To use this method in an empirical gene connection survey, we would recommend a search over all possible connections between all genes, and then proceed with the ranking process. Evidently, biological considerations over the highest ranks produced is heavily necessary.

However, there are other characteristics to be sought that could constrain the network towards nature’s choice. On one hand, so far, only constraints built from successive states are considered. Thus, constraints constructed from considering the whole trajectory (e.g., some kind of powers of the regulation matrix in order consider succession of more than two or three states) could help to obtain more precise solutions. In fact, although we have carefully generated uniform samples to build a set of solutions (to produce representations good enough for connection frequencies) by using an appropriate heuristic for variable and value ordering, it is important to keep in mind that in order to make a more precise frequency analysis, one needs the *consistent solutions* in the CSP context, meaning that, in our case, the solutions obtained from considering the whole trajectory. On the other hand, one feature not explored in this paper is the dynamics of the network. There are indications, as stated by Kauffman [23], that nature would prefer networks with a small amount of attractors - the gene pattern expression that leads the system to itself -and large basins of attraction - the set of gene pattern expressions that leads the system to an attractor. The network assembled by Li et al. [35] has these characteristics. Therefore, an analysis of connections computed only from networks with a few number of attractors - or other dynamical characteristic - could create a well established result. One naïve way to proceed is to build regulation matrices from the solutions of the CSPs subproblems (possible rows) and select the ones such that present the dynamical features described before (large basins of attraction and small number of attractors)

Concluding, the analysis presented here is a remarkable first step for the construction of a system to infer gene interactions. The true positive rate on the artificial data is excellent and, considering noises and lack of time points, the true positive rate for the experimental data is beyond expectation.

We understand that any inference procedure can not have success if it does not contain biological and computational expertise, therefore the future steps of this research tend to be centered on the difficulties of a wet lab, or its limitations.

## References

Kauffman SA: Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology. 1969, 22: 437-467. 10.1016/0022-5193(69)90015-0.

Shmulevich I, Dougherty ER, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002, 18 (2): 261-274. 10.1093/bioinformatics/18.2.261.

Zhang SQ, Ching WK, Ng MK, Akutsu T: Simulation study in Probabilistic Boolean Network models for genetic regulatory networks. International Journal on Data Mining and Bioinformatics. 2007, 1 (3): 217-240. 10.1504/IJDMB.2007.011610.

Friedman N, Linial M, Nachman I, Pe’er D: Using Bayesian networks to analyze expression data. Journal of Computational Biology. Journal of Computational Biology. 2000, 7: 601-620. 10.1089/106652700750050961.

Goodwin BC: Temporal Organization in Cells; A Dynamic Theory of Cellular Control Process. 1963, New York: Academic Press

Karlebach G, Shamir R: Modelling and analysis of gene regulatory networks. Nature. 2008, 9: 770-780.

Hartemink AJ: Reverse engineering gene regulatory networks. Nature. 2005, 5: 554-555.

Marshall S, Yu L, Xiao Y, Dougherty ER: Inference of a probabilistic Boolean network from a single observed temporal sequence. EURASIP Journal on Bioinformatics and Systems Biology. 2007, 32454-

Dougherty ER, Xiao Y: Design of Probabilistic Boolean Networks under the Requirement of Contextual Data Consistency. IEEE Transactions on Signal Processing. 2006, 54 (9): 3603-3613. 10.1109/TSP.2006.877641.

Xiao Y, Dougherty ER: Optimizing Consistency-Based Design of Context-Sensitive Gene Regulatory Networks. IEEE Transactions on Circuits and Systems I. 2006, 53 (11): 2431-2437. 10.1109/TCSI.2006.883883.

Hashimoto RF, Kim S, Shmulevich I, Zhang W, Bittner ML, Dougherty ER: Growing Genetic Regulatory Networks from Seed Genes. Bioinformatics. 2004, 20 (8): 1241-1247. 10.1093/bioinformatics/bth074.

Vahedi G, Ivanov I, Dougherty E: Inference of Boolean networks under constraint on bidirectional gene relationships. Systems Biology. 2009, IET, 3 (3): 191-202.

Liu W, Lahdesmaki H, Dougherty ER, Shmulevich I: Inference of Boolean Networks Using Sensitivity Regularization. EURASIP Journal on Bioinformatics and Systems Biology. 2008, 2008 (780541): 1-12. 10.1155/2008/780541.

Laubenbacher R, Stigler B: A computational algebra approach to the reverse engineering of gene regulatory networks. Journal of Theoretical Biology. 2004, 229 (4): 523-537. 10.1016/j.jtbi.2004.04.037.

Laubenbacher R, Jarrah ASS: Algebraic models of biochemical networks. Methods in Enzymology. 2009, 467: 163-196. full_text.

Jarrah A, Laubenbacher R, Stigler B, Stillman M: Reverse-engineering of polynomial dynamical systems. Advances in Applied Mathematics. 2007, 39: 477-489. 10.1016/j.aam.2006.08.004.

Stigler B, Jarrah A, Stillman M, Laubenbacher R: Reverse engineering of dynamic networks. Annals of the New York Academy of Sciences. 2007, 1115: 168-177. 10.1196/annals.1407.012.

Sima C, Hua J, Jung S: Inference of Gene Regulatory Networks Using Time-Series Data: A Survey. Current Genomics. 2009, 10 (14): 416-429. 10.2174/138920209789177610.

Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R: Gene regulatory network inference: Data integration in dynamic models - A review. BioSystems. 2009, 96: 86-103. 10.1016/j.biosystems.2008.12.004.

Dougherty ER: Validation of Inference Procedures for Gene Regulatory Networks. Current Genomics. 2007, 8 (6): 351-359. 10.2174/138920207783406505.

Shmulevich I, Dougherty ER, Zhang W: From Boolean to Probabilistic Boolean Networks as Models of Genetic Regulatory Networks. Proceedings of the IEEE. 2002, 90: 1778-1792. 10.1109/JPROC.2002.804686.

Huang S: Gene expression profiling, genetic networks, and cellular states: An integrating concept for tumorigenesis and drug discovery. Journal of Molecular Medicine. 1999, 77: 469-480. 10.1007/s001099900023.

Kauffman SA: The Origins of Order: Self-Organization and Selection in Evolution. 1993, USA: Oxford Univ. Press

Tsang E: Foundations of Constraint Satisfaction. 1993, Academic Press

Russell S, Norvig P: Artificial Intelligence: A Modern Approach. 2002, Prentice Hall, second

Tack G: Constraint Propagation - Models, Techniques, Implementation. PhD thesis. 2009, Saarland University

Schulte C, Tack G, Lagerkvist MZ: Modeling. 2010

Jiang D, Tang C, Zhang A: Cluster Analysis for Gene Expression Data: A Survey. IEEE TKDE. 2004, 16 (11): 1370-1386.

D’haeseleer P, Liang S, Somogyi R: Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000, 16 (8): 707-726. 10.1093/bioinformatics/16.8.707.

Dougherty ER, Barrera J, Brun M, Kim S, Junior RMC, Chen Y, Bittner ML, Trent JM: Inference from Clustering with Application to Gene-Expression Microarrays. Journal of Computational Biology. 2002, 9: 105-126. 10.1089/10665270252833217.

Zhou X, Wang X, Dougherty ER, Russ D, Suh E: Gene Clustering Based on Clusterwide Mutual Information. Journal of Computational Biology. 2004, 11: 147-161. 10.1089/106652704773416939.

Hashimoto RF, Dougherty ER, Brun M, Zhou Z, Bittner ML, Trent JM: Efficient Selection of Feature Sets Possessing High Coefficients of Determination Based on Incremental Determinations. Signal Processing. 2003, 83 (4): 695-712. 10.1016/S0165-1684(02)00468-1.

Peña J, Björkegren J, Tegnér J: Growing Bayesian network models of gene networks from seed genes. Bioinformatics. 2005, 21 (Suppl. 2): ii224-ii229.

Lau KY, Ganguli S, Tang C: Function Constrains Network Architecture and Dynamics: A Case Study on the Yeast Cell Cycle Boolean Network. Physics Review E. 2007, 75 (5): 1-9. 10.1103/PhysRevE.75.051907.

Li F, Long T, Lu T, Ouyang Q, Tang C: The Yeast Cell-Cycle Network is Robustly Designed. PNAS of the USA. 2004, 101 (14): 4781-4786. 10.1073/pnas.0305937101.

Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular biology of the cell. 2002, 13 (6): 1977-2000. 10.1091/mbc.02-02-0030..

Koff A, Giordano A, Desai D, Yamashita K, Harper JW, Elledge S, Nishimoto T, Morgan DO, Franza BR, Roberts JM: Formation and activation of a cyclin E-cdk2 complex during the G1 phase of the human cell cycle. Science. 1992, New York, N.Y., 257 (5077): 1689-1694. 10.1126/science.1388288.

Shanahan F, Seghezzi W, Parry D, Mahony D, Lees E: Cyclin E associates with BAF155 and BRG1, components of the mammalian SWI-SNF complex, and alters the ability of BRG1 to induce growth arrest. Molecular and Cellular Biology. 1999, 19 (2): 1460-1469.

Chan AK, Persad S, Litchfield DW, Wright JA: Ribonucleotide reductase R2 protein is phosphorylated at serine-20 by P34cdc2 kinase. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research. 1999, 1448 (3): 363-371. 10.1016/S0167-4889(98)00115-3.

DeGregori J, Kowalik T, JR N: Cellular targets for activation by the E2F1 transcription factor include DNA synthesis- and G1/S-regulatory genes. Molecular Cell Biology. 1995, 15 (8): 4215-4224.

Chiang SY, Azizkhan JC, Beerman TA: A comparison of DNA-binding drugs as inhibitors of E2F1-and Sp1-DNA complexes and associated gene expression. Biochemistry. 1998, 37 (9): 3109-3115. 10.1021/bi9721142.

Kong M, Barnes EA, Ollendorff V, Donoghue DJ: Cyclin F regulates the nuclear localization of cyclin B1 through a cyclin-cyclin interaction. The EMBO Journal. 2000, 19 (6): 1378-1388. 10.1093/emboj/19.6.1378.

Holt LJ, Krutchinsky AN, Morgan DO: Positive feedback sharpens the anaphase switch. Nature. 2008, 454 (7202): 353-357. 10.1038/nature07050.

Cai J, Gibbs E, Uhlmann F, Phillips B, Yao N, O’Donnell M, J H: A complex consisting of human replication factor C p40, p37, and p36 subunits is a DNA-dependent ATPase and an intermediate in the assembly of the holoenzyme. Journal of Biological Chemistry. 1997, 272 (30): 18974-18981. 10.1074/jbc.272.30.18974.

Ohta S, Shiomi Y, Sugimoto K, Obuse C, Tsurimoto T: A proteomics approach to identify proliferating cell nuclear antigen (PCNA)-binding proteins in human cell lysates. Identification of the human CHL12/RFCs2-5 complex as a novel PCNA-binding protein. Journal of Biological Chemistry. 2002, 277 (43): 40362-40367. 10.1074/jbc.M206194200.

## Acknowledgements

This work was supported by FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo), CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico), and CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior).

This article has been published as part of *BMC Proceedings* Volume 5 Supplement 2, 2011: Proceedings of the 6th International Symposium on Bioinformatics Research and Applications (ISBRA'10). The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S2.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

RFH headed the research. All participated in the manuscript development. All authors read and approved the final manuscript. Carlos H A Higa and Vitor H P Louzada have contributed equally to this work.

## Electronic supplementary material

### 12919_2011_34_MOESM1_ESM.pdf

Additional file 1: **Bar charts for** *A*_{
0
} The bar charts for the 20 genes are available at: http://yeast.ime.usp.br/hela/additional_files1.zip The additional files1.zip (235.5 KB) contains charts in PDF format. (PDF 13 KB)

### 12919_2011_34_MOESM2_ESM.pdf

Additional file 2: **Bar charts for** *A*_{
1
} The bar charts for the 20 genes are available at: http://yeast.ime.usp.br/hela/additional_files2.zip The additional files2.zip (234.4 KB) contains charts in PDF format. (PDF 13 KB)

## Rights and permissions

**Open Access**
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
https://creativecommons.org/licenses/by/2.0
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Higa, C.H., Louzada, V.H., Andrade, T.P. *et al.* Constraint-based analysis of gene interactions using restricted boolean networks and time-series data.
*BMC Proc* **5**
(Suppl 2), S5 (2011). https://doi.org/10.1186/1753-6561-5-S2-S5

Published:

DOI: https://doi.org/10.1186/1753-6561-5-S2-S5