Immuno-hybrid algorithm: a novel hybrid approach for GRN reconstruction

Bio-inspired algorithms are widely used to optimize the model parameters of GRN. In this paper, focus is given to develop improvised versions of bio-inspired algorithm for the specific problem of reconstruction of gene regulatory network. The approach is applied to the data set that was developed by the DNA microarray technology through biological experiments in the lab. This paper introduced a novel hybrid method, which combines the clonal selection algorithm and BFGS Quasi-Newton algorithm. The proposed approach implemented for real world E. coli data set and identified most of the relations. The results are also compared with the existing methods and proven to be efficient.


Introduction
It has been seen that, even though the biological systems are very complex, they exhibit very sophisticated behavior in the evolution with time and there are many regularities that have been observed, which leads to the scientific understanding of the biological processes and the behavior. Identification of the relationships between genes leads to better understanding of the complexities of the biological systems at molecular level. GRN is a network of interrelated genes participating in a biological process. Identification of the gene network provides an insight into the important genes that are participating in a bio-chemical process. The major applications of the gene regulatory network include identification of network topology, identification of central node in the network, identification of the sub networks, mutual dependency between genes and dependency between different biological conditions from a computational viewpoint. Recognition of central gene or gene set can be used to synthesize better drugs for the inhibition or activation of the biological processes and the methodology can be extended to drug design or to study the direct drug delivery mechanisms.
This topic of research has been very active for the last one decade. Various researchers have suggested techniques attempting to solve this problem with faster convergence and accuracy. Boolean network is a network of variables whose values are true or false. A gene network can be defined in terms of Boolean network as G(V, E) with a set of states X = {x i |i = 1…n} due to a set of Boolean operations: B = {b i |i = 1,…,n}; x i (t ? 1) = b i (x i1 ,…,x in ) where, x ij denotes the states of the nodes connected to vertex i; 'n' denotes the number of genes involved in the situation. Kauffman (1969) introduced Boolean network that consists of binary state variables for the construction of gene network. After this a number of Boolean methods were proposed by the researchers (Liang et al. 1998;Akutsu et al. 1999;Maucher et al. 2011). Due to the complexity of the biological systems, the data obtained through the experiments are uncertain and incomplete. The exact modeling of gene regulatory network is a difficult problem. It is possible to represent a relation with more than one function. To solve this problem an advancement of Boolean network is introduced, namely, probabilistic Boolean network. Based on this concept (Shmulevich et al. 2002;Liang and Han 2012) proposed methods for reconstruction of GRN. For all the Boolean models, binary states of the genes represent whether a gene is present or not in the gene network, but do not indicate the extent of its participation in the relation. This leads to lower accuracy in predictions.
Bayesian network is a probabilistic directed acyclic graph model based on the ideas proposed by Bayes et al. (1763). Joint probability distribution is used in the calculation of relationships in Bayesian network. Based on this modeling several standard methods are introduced (Yang et al. 2011;Tan and Mohamad 2012;Dondelinger et al. 2012). Gene network interactions are cyclic and non-linearly complex. Therefore, Bayesian networks may fail if such condition happens.
Artificial neural network is a tool for predicting gene network used by several researchers. Based on the recurrent neural network lots of works have been produced by the researchers (Vohradský 2001;Lee and Yang 2008;Noman et al. 2013). Major disadvantage of the neural network model is the increased complexity with respect to number of genes. Support vector machine is another machine-learning tool for predicting GRN (Kimura et al. 2009).
Reconstruction of gene network using the differential equation model is a popular approach. Apart from the coexpression models, differential equation models have the capability to describe the dynamic behavior of the biological system. S-system (Savageau 2000) is a well-accepted mathematical model for the chemical reactions. The system is based on the rate law of chemical reactions and considers entire biological system as chemical process. Optimizing the S-system variables is a crucial step in the gene network reconstruction. There are several bio-inspired algorithms used for the optimization purpose. Based on the evolution, the standard proposals are (Noman and Iba 2005;Huang et al. 2008;Mondal et al. 2010;Spieth et al. 2004;Kabir et al. 2008). Most of them are based on Genetic Algorithm theory. Apart from the evolution algorithms, there are successful proposals employing Cuckoo search (Jereesh and Govindan 2013a, b), particle swarm approach (Xu et al. 2007;Hsiao and Lee 2007;Yang et al. 2013) and artificial immune approaches Govindan 2013c, 2014).
There are many bio-inspired algorithms to solve the reconstruction problem of gene regulatory network. The convergence speed of heuristic algorithms is a serious issue because they are highly nonlinear. High computational complexity and low accuracy are the main issues that need to be tackled. The tradeoff between time and accuracy factors of the algorithms still poses challenges. A systematic study will not end in itself because the biological systems are very sophisticated and the processes are highly nonlinear. Any addition of new knowledge would thus lead to an incremental understanding of the natural processes from a scientific viewpoint. So, there is ample scope for improvements in science and technology in the future. This paper proposes a novel and innovative technique to enhance the performance of the gene related process representations and understanding. The methodology followed is network modeling and verification through computational techniques using gene expression.

Representation of gene network
Gene regulatory network generally can be represented as a graph G(V, E) in which vertices represent the gene and edges represent the relationship between them. The relationship between each gene represents overall performance of the gene network. A function or a set of functions determines the expression rate of each gene at a particular condition as described in Eq. (1).
where, n is the total number of genes participating in the gene network, and f is a mathematical function whose inputs are current states (X(t) = x 1 (t),…, x n (t)) of the genes and output is next state of genes. The updated concentration, x i of each gene i for time t ? 1 is defined as Eq. (2).
The main features used for the pattern identification of gene network are the inhibitory relations and excitatory relations between genes. Inhibitory relation represents the genes capability to reduce the concentration of other genes and excitatory relation represents the opposite actions. In this context, a gene can act as both inhibitory and excitatory agent for another gene. There is positive inhibition and negative inhibition. Positive inhibition represents how gene increases the rate of inhibition. Negative inhibition represents how gene decreases the inhibition rate. Similarly, there is positive and negative excitation values depending on the behavior of the gene. Thus, there can be the following possibilities for the gene relations.
• A gene can act as inhibitory for one gene and excitatory for another gene. • A gene can act as both inhibitory and excitatory for the other gene. • At a time, a gene cannot act as positive and negative inhibitor. • At a time, a gene cannot act as positive and negative excitatory.
The major decision variables for the gene network depend on the above-mentioned properties.

Evaluation measures
In the literature, various standard methods are used to evaluate the results in the area being pursued. Most of the literature provides evaluation based on the comparison between their research findings and the relations identified by the biologists. It is very difficult to evaluate the research findings since the problem is to find the relationships in the biological system. There is a chance that the identified relations through biological experiments are incomplete. In such situations, to evaluate properly, there is a need for artificially generated data. Artificially generated data are those that generated by the models of biological system. With the combination of biological real life data and artificial data, we evaluated performance of various proposals in the thesis. The output microarray values for the biological system and artificially simulated system are categorized into accepted values and the values provided by the models proposed are categorized as the computed values.
For the evaluation, computed values have been compared with the accepted values. In the thesis, two types of comparison are performed for the evaluation. The first one focuses on the comparison of relations accepted by biologists and that computed by models. TP (true positive), FN (false negative), TN (true negative) and FP (false positive) are used for the comparison based on the biological relations. The second is based on the mean squared error (MSE) computed between accepted data values (that are generated biologically or artificially) and computed microarray data values.
The important metrics used to evaluate the gene regulatory network reconstruction in literature are briefly presented in the following subsections.

Fitness function
Mean squared error (MSE) proposed by the Tominaga et al. (2000) is used as the fitness function for the optimization.
where x exp i;t , x cal i;t are the expression value of gene i at time t from the experimental and estimated (calculated) data, respectively. Here, N is the total number of genes and T is the time interval.

Sensitivity
This measurement calculates the fraction of accepted relations that are identified. Equation for the sensitivity is given as expression (4).
where TP is the true positive, that is, number of accepted relations identified, FN is the false negative, that is, number of accepted relations that are not identified. The percent sensitivity value is 100 if all the accepted relations are identified.

Specificity
This measurement calculates the fraction of unaccepted biological relations that were shown unaccepted. The equation for evaluating specificity is given as expression (5).
where TN is the true negative, that is, number of unaccepted relations shown unaccepted, FP is the false positive, that is, number of unaccepted relations that were identified as accepted. The specificity percent value is 100 if all the unaccepted relations are identified as unaccepted.

Balanced accuracy
Data related to bioinformatics is inadequate and imbalanced, hence, to measure the accuracy of the research, in the problem mentioned, a parameter called balanced accuracy is used. The RHS of Eq. (6) represents the expression for balanced accuracy. It is the mean of the sensitivity and specificity measures computed as follows.
Similar to all percentage problems, if the balanced accuracy increases the result will be more similar to reality and the values would lie between 0 and 100, 100 being close to reality.

S-system model
Most popular differential equation model is the S-system model, which is a well-accepted nonlinear differential equation modeling proposed by Savageau (Savageau 2000). In the S-system, the rate of change of concentration of gene xi is defined as in Eq. (7).
where G i,j and H i,j are excitatory and inhibitory coefficients, respectively. a i C 0 and b i C 0 are rate constants. G i,j and H i,j represent the relationship between each genes and x j (t) is the concentration of gene j expressed at time t.
S-system is a power law formalism, which is inspired from the chemical reaction processes. According to the rate law, the rate of change of concentration of each reactant is represented by the equation k(T)[C A ] x [C B ] y , where k(T) is the rate constant which is having a dependency with the temperature T. C A and C B express the concentration of the species A and B. The exponents x and y are reaction orders. Gene regulation interaction is a bio-chemical process that takes place in a living organism. This model demonstrates the chemical reaction that happens between genes. The total number of variables used in the S-system is 2N ? 2N 2 , where N is the number of genes in the reaction.
In this paper, S-system is the mathematical model used to model the GRN.

Immuno-hybrid based S-system model computation
This algorithm is a global-local optimization approach to solve the problem of gene regulatory network reconstruction using the DNA microarray data. This approach is a combination of clonal selection algorithm (Jereesh and Govindan 2013c) and BFGS Quasi-Newton algorithm (Dennis and More 1977). Clonal selection algorithm is a meta-heuristic algorithm used for the optimization problems. In the clonal selection algorithm maturation step is replaced with two operations. First one is the cloned mutation process, in which each clone of the selected antibodies are mutated using a mutation probability m. Diagrammatic representation of general cloned mutation process is depicted as in Fig. 1. Second one is the local weight updating process, in which each mutated antibody is updated using the BFGS Quasi-Newton method. In the second step, dynamic step size change is introduced as per the Eq. (8). Each antibody holds inherited properties from the parent and the divergence properties due to mutation. Detailed algorithm for the immuno-hybrid approach is given as Algorithm 1.
For the gene regulatory network construction, each antibody in the population stands for the parameters of the S-system. Identifying the optimal antibody solution is the problem specified here. Format of the vector representing individual antibody is given in Fig. 2. Fitness evaluation is done using Eq. (3). For fitness calculation, at first the individual antibody is modeled as a vector representing the solution parameters of the S-system model. Then, the differential equation of S-system is solved by Runge-kutta algorithm. Original microarray values and generated microarray values are used to compute fitness value using Eq. (3).
where l is the maturation rate, k is the local iteration factor; P is the population size and R f is the rank of individual based on the fitness value. 5. end for 6. return local optimal solution, X Fig. 3 Time-dynamics of the ten data sets of five-dimensional regulatory system BFGS Quasi-Newton formula where B k is an approximate hessian matrix,

Results and discussions
For comparing the efficiency of the proposed approaches in the paper, a well-known five-gene standard artificial network (Noman and Iba 2005;Kimura et al. 2008;Kabir et al. 2008) is identified. The simulated microarray data is generated using the Runge-kutta algorithm and S-system model. For the experimentation, ten sets of expression data with initial values in the interval [0, 1] are generated artificially.
The initial values used for the generation of artificial data are taken as per (Jereesh and Govindan 2013c). To average the performance over more data sets, we generated ten artificial data sets with the help of a model by Hlavacek and Savageau (1996) and the time dynamics of data sets are given in Fig. 3.
The proposed immuno-hybrid approach for optimizing the parameters of S-system is compared with the other approaches (Spieth et al. 2004;Jereesh and Govindan 2013a, b, c) in literature. For the artificial data, performance comparison with respect to the convergence is depicted in Fig. 4. It is evident that the immuno-hybrid approach outperforms all of the mentioned approaches. The immuno-hybrid approach using S-system converged after 0.9 9 10 5 fitness evaluations, which is a good improvement in speed performance, compared to other algorithms. SOS DNA repairing system in E. coli: real world experimental data Escherichia coli (E. coli) are bacterium residing in the lower intestine of warm-blooded organisms. This causes food poisoning occasionally. Most of the E. coli is harmless for the host and some will help to produce vitamin K2. This also helps for the proper digestion of the food. This bacterium is easy to reproduce within a time limit on laboratory conditions and important species in the study of molecular biology. Fig. 4 Performance of convergence. Comparison of errors obtained for memetic algorithm, cuckoo search using S-system, modified cuckoo search using S-system, clonal based approach using S-system model and immuno-hybrid approach using S-system SOS DNA repair system in E. coli (Sutton et al. 2000) is a famous real life data set, which is commonly used to evaluate the efficiency of gene regulatory network reconstruction methods. Figure 5 is a graphical representation of gene interactions following the damage of DNA. Mainly six major genes (uvrD, lexA, umuD, recA, polB and uvrA) are involved in the processing of DNA repair (Kimura et al. 2008(Kimura et al. , 2009Kabir et al. 2008;Hsiao and Lee 2007). LexA is a repressor gene, which inhibits the expression of other genes. Whenever DNA damage happens in E. coli, RecA identifies the damage and activates the processing of cleavage of LexA. Hence, the concentration of the LexA is reduced and leads to the excitation of other genes. After the clearance of damage, cleavage of LexA will be slow downed and stopped, and this leads to increased concentration of LexA. The LexA represses the other genes and advances to a balanced state. Construction of gene network allows predicting the roles of each of the genes in the DNA repairing system. SOS DNA repair system of E. coli data set is obtained from the experiments done by Uri Alon lab of Weizmann Institute of science; (website http://wws.weizmann.ac.il/ mcb/UriAlon/download/downloadable-data). There are 50 time-periods for the experiment in which 49 are used for the experimentation where the first time-period is at zeroth time and contains zero knowledge. Out of the eight genes we have selected, six important genes are specified. All the values in the expression have been normalized in the range of [0, 1].
Gene regulatory network for DNA repair system of E. coli identified by the immuno-hybrid approach is depicted as in Fig. 6. The immuno-hybrid based approach using S-system model identified eight relations, which was identified by the biologists (Table 1).

Conclusions
Biological systems behave differently in different conditions. To model such systems we need a dynamical modeling. A nonlinear differential equation modeling for the dynamic biological systems is a common approach. This paper proposed an evolutionary global-local hybrid algorithm for the optimization of gene regulatory network modeling. An algorithm called clonal selection based optimization algorithm is combined with the BFGS Quasi-Newton search method to develop immuno-hybrid algorithm. The randomness property, cloned mutated process and local weight updating property are the key factors for the immuno-hybrid approach. The proposed approaches predicted gene network for artificial data set and SOS DNA repair system more efficiently than many of the existing algorithms. The convergence speed and accuracy compared with the existing approaches are found to be superior. The immuno-hybrid approach provided still better performance figures of 89 and 70.5, respectively, for sensitivity and balanced accuracy. In addition, immune-hybrid approach identified eight valid relations, which is the highest among all of the other algorithms. Thus, it is demonstrated that a combination of differential modeling and hybrid optimization techniques can provide better reconstruction of gene network.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons. org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.