A membrane evolutionary algorithm for DNA sequence design in DNA computing

DNA sequence design has a crucial role in successful DNA computation, which has been proved to be an NP-hard (non-deterministic polynomial-time hard) problem. In this paper, a membrane evolutionary algorithm is proposed for the DNA sequence design problem. The results of computer experiments are reported

Membrane computing is an emergent branch of natural computing, first introduced by Paun [1]. This unconventional model of computation is a type of distributed parallel system, which is inspired by the structure and function of living cells. The devices of this model are called P systems. Roughly speaking, a P system consists of a cell-like membrane structure, in the compartments of which there are multisets of objects that evolve according to given rules in a synchronous, non-deterministic maximally parallel manner. Many different classes of such computing devices have already been investigated. Most of them are computationally universal, i.e., able to compute whatever a Turing machine can do [2][3][4], and are computationally efficient, i.e., able to trade space for time and in this way solve presumably intractable problems in a feasible time (e.g., [5][6][7][8][9]). Membrane computing is very attractive from a computational point of view because of its hierarchical structure and intrinsic parallelism.
Evolutionary algorithms are robust optimization and search methods inspired by evolutionary processes occur-ring in natural selection and molecular genetics. They are approximation algorithms that seek a trade-off between solution quality and computational costs. The main features of evolutionary algorithms are the representation and evaluation of individuals, population dynamics, and evolutionary operators, such as selection, crossover, and mutation [10]. Evolutionary algorithms exhibit an intrinsic parallelism derived from dealing with multiple individuals, show remarkable adaptability and flexibility to various applications, and provide good search capabilities and robust results [11].
Both P systems and evolutionary algorithms are natureinspired models and are used to solve complex problems; however, they are different with respect to the objects and rules used and in the computational strategies employed. While P systems represent a suitable formal framework for parallel-distributed computation, evolutionary algorithms are very effective for implementing different algorithms to solve numerous problems [11]. Thus, the possible interaction between P systems and evolutionary algorithms represents a fertile research field, as suggested by the list of 26 open problems in membrane computing [12]. The first attempt in this direction was by Nishida [13,14], who devel-oped membrane (evolutionary) algorithms to solve the traveling salesman problem. In those membrane algorithms, membrane structure was used together with ideas from genetic algorithms (cross-over and mutation operators), and the traveling salesman problem was solved. A membrane algorithm was also employed to solve the minimum storage problem [15]. The quantum-inspired evolutionary algorithm based on P systems was also developed to solve the knapsack problem and the radar emitter signals problem [16,17]. The similarities between distributed evolutionary algorithms and P systems have been analyzed and new variants of distributed evolutionary algorithms were suggested and applied for some continuous optimization problems [18]. In this paper, a membrane evolutionary algorithm is proposed to solve the DNA sequence design problem.

The DNA sequence design problem
DNA computing is a new paradigm of computation. DNA computing is attractive both theoretically and technically because of its intrinsic parallelism. DNA computing has been used to solve various computationally complex problems, such as the Hamiltonian problem [19], the SAT problem [20,21], the Steiner tree problem [22], the maximal clique problem [23], and the maximum independent set problem [24]. Because the problems solved by DNA computing are encoded by a DNA sequence, the design of DNA sequences is crucial for successful DNA computation. To make a set of DNA sequences effective in DNA computing, they must fulfill a number of combinatorial and thermodynamic constraints. Such a task is not easy to achieve, and it has been shown that designing a set of good DNA sequences is an NP-hard (non-deterministic polynomial-time hard) problem [25,26].
Various algorithms and methods have been proposed for reliable DNA sequence design. Marathe et al. [27] proposed a dynamic programming approach. Frutos et al. [28] developed a template strategy to select a great number of dissimilar sequences. Arita et al. [29] introduced a genetic algorithm into the DNA sequence design system and proposed a random generate-and-test algorithm. Tanaka et al. [30] applied simulated annealing to optimize the set of DNA sequences. Cui et al. [31] proposed DNA sequence design algorithm based on the PSO optimization. Wang et al. [32] developed the GA/SA algorithm for DNA sequence design. Qiu et al. [33] designed a hardware microprocessor to discover the DNA code under thermodynamic constraints. Dyachkov et al. [34] introduced the concept of a stem similarity function and discussed DNA codes based on stem similarity. Kawashimo et al. [35] presented a local search based algorithm for designing DNA short sequence sets satisfying thermodynamic constraints with minimum free energy criteria. Zhang et al. [36] proposed an invasive weed optimization algorithm to optimize encoding sequences.
Successful DNA computing relies heavily on designing or selecting proper DNA sequence to realize desired chemical reactions and solution extractions. To ensure that chemical reactions are controllable, some thermodynamic and combinatorial constraints must be satisfied in the design of DNA sequences.

Sequence design constraints
In the following context, (1 ) are used to denote DNA sequences with length n, and m is the cardinality of a set of DNA sequences. For convenience, DNA sequence x is oriented from the 5′ to 3′ end, and the reverse orientation is the 3′ to 5′ end. The Watson-Crick complement of a sequence x is denoted by x , while the reverse sequence of sequence x is denoted by x R .
(1) Thermodynamic constraint. T1 (melting temperature): Melting temperature is an important factor in the efficiency of a DNA reaction, and can be used to select DNA sequences with uniform melting temperatures. The nearest neighbor mode [37] was used to calculate melting temperature, and the evaluation function ( ) Tm F  of the melting temperature is defined as follows: where Tm tar is the target melting temperature, Tm gen (x i ) is the melting temperature of the generated sequence x i , R is the gas constant (1.987 cal mol −1 K −1 ), H° and S° are the enthalpy and the entropy, respectively, and C T is the salt concentration.
(2) Combinatorial constraints. C1 (similarity measure constraint): The similarity measure [38] is used to describe the degree of similarity of two DNA sequences, and ensures that each sequence is as unique as possible. The evaluation function Similarity ( ) F  of the similarity measure constraint is defined by eqs. (4) and (5).
where (−) g denotes g gabs, ( ) For more information, refer to [38]. C2 (H-measure constraint): The H-measure is akin to the similarity measure. The difference is that the H-measure compares two given sequences from opposite directions, while the similarity measure works in the same direction. The H-measure computes how many nucleotides are complementary between the given sequences to prevent cross hybridization of two DNA sequences. Like the similarity measure, the H-measure also uses the shift sequences. The evaluation function of the H-measure measure constraint is defined as follows: where ( ( ) , ( )) is the number of corresponding places where two nucleotides are the same. For more information, refer to [38]. C3 (hairpin structure constraint): Hairpin structure consists of a ring part and a stem part, which can hybridize itself by forming a loop. The measure of hairpin structure constraint calculates the probability to form a secondary structure. Generally, the hairpin structure is not desirable for DNA encoding. The equations are defined as follows [38]: where r is the minimum length to form a hairpin, pinlen denotes the minimum length of the stem. A hairpin structure is formed at position c for sequence x i , if more than half of the bases in the subsequence , , C4 (GC content constraint): The GC content is the percentage of G and C in a DNA sequence. It is important to arrange the GC content such that the chemical character is uniform. Thus, the evaluation function ( ) GC f  of the GC content constraint is described as follows: C5 (continuity constraint): Continuity is often used to describe the degree of successive occurrence of the same base in a sequence. In a DNA sequence, if the same base occurs continuously, the reaction is not well controllable because they may cause an unexpected biological structure. The formulations are defined as follows [38]: continuously in DNA sequence x, and t is the target.

Evaluation function
In DNA sequence optimization, the content can be naturally regarded as a constraint, not an objective function. So, more precise formulation of DNA sequence optimization is a constrained multi-objective optimization problem.
Using the weight-sum method, the multi-objective optimization for DNA encoding can be transformed into a single objective optimization problem. The fitness function can be described as follows: For simplicity, we set each weight i w to unity. In this paper, the constraint handling technology [39] is applied. For more details, refer to [30].

A membrane evolutionary algorithm for DNA sequence design
Before the membrane evolutionary algorithm is described in detail, some concepts related to string object P systems are briefly introduced.

A string object P systems
In this paper, several basic notions of membrane computing are introduced. For more information, refer to [1].
The membrane structure of a P system, shown in Figure  1, consists of several membranes arranged in a hierarchical structure inside a main membrane, called the skin. A membrane without any other membranes inside is said to be elementary. A space delimited by one membrane and its immediately lower membranes is called a region, and the region of an elementary membrane is the space delimited by it. Each region can contain a multiset of objects and a set of evolutionary rules, by which objects can evolve, and communication rules, by which objects can be moved between regions.
In string object P systems, the membranes can be marked with + or −, and this is interpreted as an "electrical charge", or with 0, and this means "neutral charge". We will denote the three cases by [ ] A string object P system with active membranes and priority relations among the rules from each region is defined as follows: (ii)  is a membrane structure consisting of n membranes (and hence the regions) injectively labeled with 1, 2, , , 1 n n   ; n is called the degree of the system  ;  Mutation rule M(s): This is similar to the uniform mutation operator of a genetic algorithm, which is a simple replacement with a randomly selected character within a given alphabet set of objects. It is described in detail as follows. Assign every bit as a mutation point in turn. For each mutation point, we select a random value between 0 and 1 and compare this with the mutation probability p m . If the random value is less than p m , a random character is substituted for the original character.
Mutation rules with replication work on string objects, are associated with membranes and depend on the label and the charge of membranes, but do not directly involve the membranes, in the sense that the membranes are neither taking part in the application of these rules nor are they modified by them. Moreover, one string can evolve to more than one string in that membrane using a rule of this type. Similar to the binary genetic algorithm, we adopted a simple crossover. Let n be the dimension of the string and p c be the crossover rate, now choose a pair of strings: and randomly generate a decimal r. If r < p c , apply a simple two-point crossover to them as follows. Generate two random integers pos1 and pos2 in the interval [1, n]. The components of two individuals between the numbers pos1 and pos2 will be exchanged. Then the new strings are generated as follows: Communication rule: An object is introduced in the membrane, possibly modified during this process; the polarization of the membrane can be modified, but not its label.
(d) Communication rule: An object is sent out of the membrane, which is possibly modified during this process; the polarization of the membrane can also be modified, but not its label.

 
In P systems, the computation is a sequence of transitions between the configurations starting from the initial configuration by applying the rules associated with regions in a non-deterministic maximally parallel manner. The compu-tation will halt in each region of the system when no rule can be applied. The result of the computation is collected in membrane i 0 when the system arrives at the final configuration. For more details about P systems information, refer to [2].

Membrane evolutionary algorithm for DNA sequence design
For convenience, we first provide some definitions and notations before constructing the P systems for DNA sequence design.  ,   and   are defined as DNA sequences with the same length over{ , , , } Formally, the new system with active membranes and priority relation is constructed as follows: The set 0 R   and the sets i R (1 i n   ), contain the following rules: Using this rule, two DNA sequences can be obtained. One is obtained by first mutating the initial one, and then by attaching two primes to its every symbol; the other is obtained by attaching a prime to each symbol of the initial one. (2) The prime is removed using this rule. Two new DNA sequences can be created using the rule as above and their two primes are also removed.
A DNA sequence is sent out of a region and the polarization of the membrane is translated from + to −. For this type of rule, priority is given as follows: for any 1  A DNA sequence is introduced into a region from the outside region and the polarization of the membrane is translated from − to 0. For this type of rule, priority is given as follows: for any 1 2 , then the corresponding rule has a higher priority than the rule , where f is the fitness function.
Using the rules of types (4) and (5), the worst and best DNA sequences in every region are sent to the outer region and adjacent inner region, respectively.
. For the above rules, there are priority relationships as follows: the rule of type (10) always has a higher priority than the rules of type (1); the rule of type (11) always has a higher priority than the rules of type (1).
The role of objects (0 5) is to cause the rules of types (1), (2) and (3), (4), (5) be used in turn, and wait two steps for the two worst DNA sequences to be deleted. Initially, there is a p 0 in every region (1 ) i i n   , and then only the rules of type (1) are used together with the rule of type (6), and p 0 is sent out of membrane i and becomes p 1 . Moreover, the polarization of each membrane i is unchanged; in the next step, the rules of types (2) and (3) can be used. At the same time, using the rule of type (7), the polarization of each membrane is translated from 0 to +, and p 1 is introduced in membrane i and becomes p 2 ; then the rules of type (4) can be used, which makes the polarization become −. Moreover, the rule of type (8) can also be used, which makes p 2 translate to p 3 . In the next step, the rules of types (5) and (9) can be used simultaneously; therefore, the polarization is translated from − to 0 and p 3 becomes p 4 . The rules of types (10) and (11) always have a higher priority than the rules of type (1); therefore, in the following two steps, the rules of types (10) and (11) must be used in turn, the polarization keeps unchanged and p 4 becomes p 5 , and then becomes p 0 again.
The objects d i are counters. Initially, d 0 is placed in membrane n. In each computation period, a rule of this type can be used, and there is only one rule of this type being used, which is used together with the rule of type (9). Therefore, the subscript of d i only adds one in each computation. (13) The objects i t have the similar role as the objects i p .
They make the rules of types (18) and (19) wait four steps, and then only the rules of types (18) and (19) are used in turn in each membrane When the rules of types (1), (2), (3), (4), and (5) are used in turn in membrane i , the rules of types (13), (14), (15) can be used simultaneously. Thus, after these four steps, the polarization of membrane i is translated from 0 to + and t 0 becomes t 4 . ( In this rule, the priority is given as follows: for any 1   , where f is the fitness function. The aim of the rules of types (18) and (19) is to delete the two worst DNA sequences in membrane i at that moment. After the polarization of membrane i′ becomes +, only the rules of type (18) can be used together with the rule of type (16), and at that moment there is only a rule of type (10) being used in membrane i. Therefore, the worst DNA sequence in membrane i is deleted. At the same time, the polarization of membrane i′ is translated from + to  , and t 4 becomes t 5 . In the next step, in membrane i′, only the rules of types (17) and (19) can be used, and in membrane i, the rule of type (11) can be used. Therefore, the second worst DNA sequence is deleted in membrane i, and the polarization of membrane i′ becomes 0 again and t 5 becomes t 0 again. At the same time, the polarization of membrane i is 0 and simultaneously p 5 evolves to p 0 .
From the previous explanation of the rules, we can easily see how this P system works. Therefore, the results of a computation with respect to this P system are all strings over { , , , } A G C T collected in membrane n, at the moment of the subscript i of d i adding up to a given value m from 0. Using this P system, the DNA sequence design problem can be solved and two better DNA sequence can be obtained in membrane n. The basic pseudocode of the membrane evolutionary algorithm is shown in Figure 2.
The procedure of the membrane evolutionary algorithm is shown as follows: Step 1: Specify membrane structure with n regions contained in the skin membrane; generate the initial strings in each region; Step 2: In each elementary membrane, the mutation rule and crossover rule are implemented simultaneously, and strings are evaluated by a fitness function; Step 3: Communication rules are used to exchange some information among the n regions or between each region; the best and worst strings are sent to the adjacent inner and outer regions, respectively; Step 4: Update each region simultaneously; delete the worst strings, and save the current best strings; Step 5: If the stopping condition is satisfied, then output the results; otherwise, return to step (2).

Algorithm parameters
In the simulation, the bases A, C, G and T are mapped to 0, 1, 2 and 3, respectively. m n-mer DNA sequences are connected one by one in the same direction to form an m*n-mer DNA sequences. We denote it as a string in the membrane system.
The novel algorithm based on the new membrane system constructed above for DNA sequence design is implemented with Matlab 7.0. The parameters of the algorithm used in our example are: the number of membranes is 20, the maximum iteration number is 1000, the probabilities of crossover and mutation rate are 0.7 and 0.03, respectively, the threshold value t of continuity is 2, and salt concentration is 0.1 mol/L. For hairpins, we assumed that hairpin formation requires at least six-base-pairings and a six-base loop.

Results and analyses
First, we compared the new algorithm with the multiobjective evolutionary algorithm from [38]. In [38], Shin et al. proposed a constrained multi-objective evolutionary algorithm to solve DNA sequences optimization for reliable DNA computing. Table 1 presents the sequences for Adleman's Hamilton problem in [38] and the sequences generated by our algorithm. The comparison results in terms of averages of fitness are shown in Figure 3.
From Table 1 and Figure 3, it is clear that our proposed  shown in Figure 4. From Figure 4, it can be seen that the DNA sequences generated by our membrane evolutionary algorithm performed better than the DNA sequences from [40], according to three criteria (Hairpin, H-measure, and Similarity), but not for Continuity.

Conclusions
In this paper, we propose a membrane evolutionary algorithm for solving the DNA sequences optimization problem, and apply it to produce good DNA sequences for DNA computing. The simulation results show that our algorithm is efficient in generating a set of high quality DNA sequences. Although this novel algorithm, based on membrane computing, for DNA sequence design looks simplistic, it has many advantages, such as simplicity, fast convergence, and theoretical elegance. The algorithm deserves to be further investigated, and can be modified to solve other hard optimization problems.
The DNA sequence design problem is important in DNA computing and biology. Further research will focus on more accurate model formulations, and the development of efficient algorithms based on dynamic P systems.