Background

The understanding of molecular conformations is one of the crucial issues in computational biology. The incorrect protein folding is associated with illnesses such as Alzheimer's disease, bovine spongiform encephalopathy and Creutzfeldt-Jakob disease. The biological functions of protein are determined by their dimensional folding structures, and their spatial structures are absolutely determined by their primary structures [1]. Traditional experimental methods of determining protein folding structure are expensive, such as X-ray crystallography and NMR spectroscopy. Because of the complexity of realistic protein, it is extremely difficult to make an analysis of the protein folding process.

Due to the polypeptide chain forms such large number of different spatial structure, it is still difficult to search for the global minimum energy conformations of proteins from its sequence of amino acids [2]. Therefore, the most important problem is how to establish a highly simplified but effective model which can reflect the relation between the free energy and tertiary structure of the protein. One of the simplified protein models is the hydrophobic-polar (HP) model which has been widely used to study protein structure and understand protein folding process. The HP-lattice model represents the amino acid chains of a protein using two types of residue, non-polar or hydrophobic (H) residue and polar (P) or hydrophilic residue are on the vertices of a simple cubic lattices[3]. The HP lattice-model abstracts the hydrophobic interaction process in protein folding by reducing a protein to a heteropolymer that represents a predetermined pattern of hydrophobicity in the protein[4]. The non-ploar amino acids are classified as hydrophobic and polar amino acids, which is used to force the formation of a compact hydrophobic core as observed in the real protein [5]. However, the HP lattice-model doesn't reveal all secrets of the protein, despite its simplicity. The main reason lies in that local interactions are neglected in the simplified models, while local interactions might be important for the local structure of the chains [6].

To reflect more realisticly the native attributes of proteins, Stillinger studied a similar AB off-lattice protein model in two dimensions[7]. In AB off-lattice model the 20 amino acids are also reduced to two classes, hydrophobic (A) and hydrophilic (B). For AA, BB and AB pairs respectively, there is an intramolecular mix of strong attraction, weak attraction, and weak repulsion, roughly analogous to the situation of the real proteins[8].The interactions considered in AB off-lattice model include both sequence independent local interactions and the sequence dependent Lennard-Jones term that favours the formation of a hydrophobic core[9]. Irback et al. extended a two dimension (2D) to a three dimension (3D) in the AB off-lattice model, which takes account of the torsional energy implicitly [6].

In recent years, many works were devoted to the optimal conformations with lowest energies in the AB off-lattice model[1012]. Because searching the whole conformational space of a protein has been proved to be NP-complete problem, it is necessary to introduce the heuristic optimization algorithm, such as the energy landscape paving minimizer (ELP)[9], the genetic tabu search algorithm (GATS)[10], the conformational space annealing (CSA)[13], the pruned-enriched-Rosenbluth method (PERM) [14] and the local adjust genetic algorithm (LAGAA) [15] etc. This paper describes a protein structure prediction method that is based on the AB off-lattice model in two dimension and three dimension, which combines the tabu search algorithm and local adjust simulation annealing. The new improved hybrid algorithm (SATS) is applied to find the spacial conformations with Fibonacci sequences and real proteins.

Methods

AB off-lattice model in 2D

The major difference between AB off-lattice model [7] and HP lattice model [3] is the folding angle of the model. The AB off-lattice model deals with only two types of amino acids, to be called hydrophobic A and hydrophilic B, which can be used to represent 20 kinds of amino acids. They are linked together by rigid unit-length bonds to form linear unoriented polymers that reside in the form of two dimensions. In 2D AB off-lattice model the angle of the two bonds which connect three amino acid residues can change freely, as Figure 1 illustrates [7], the configuration of any n-mer is specified by the n − 2 angles of bend θ2,..., θ n− 1 at each of the nonterminal residues. θ i = 0 corresponds to linearity of successive bonds.

Figure 1
figure 1

A schematic diagram of a generic 9-mer in 2D. The n-mer is specified by the n - 2 angles of bend θ2,..., θ n− 1.

The AB off-lattice model postulates that two kinds of interactions including the intramolecular potential energy for each molecule: backbone bend potentials V1 and nonbonded interactions V2. Residue species along the backbone can be conveniently encoded by a set of binary variables ξ i . If ξ i = 1, the ith residue is A; if ξ i =-1, it is B. Hence, the total energy function for any n-mer chain is expressed as follows [7]:

Φ = i = 2 n - 1 V 1 ( θ i ) + i = 1 n - 2 j = i + 2 n V 2 ( r i j , ξ i , ξ j )
(1)

Where the distance r ij between monomer i and j is denoted as functions of the intervening angles[7]:

r i j = { [ i = 2 j - 1 c o s [ l = i + 1 k θ 1 ] ] 2 + [ i = 2 j - 1 s i n [ l = i + 1 k θ 1 ] ] 2 } 1 / 2
(2)

The V1 is a simple trigonometric form:

V 1 θ i = 1 / 4 ( 1 - cos θ i )
(3)

The V2 is non-bonded interactions with a species-dependent coefficient:

V 2 ( r i j , ξ i , ξ j ) = 4 ( r i j - 12 - C ( ξ i , ξ j ) r i j - 6 )
(4)

where

C ( ξ i , ξ j ) = + 1 A A + 1 / 2 B B - 1 / 2 A B
(5)

For any N -residue protein, the N − 2 angles of bend should be found when the potential-energy of the 2D AB off-lattice model obtains the minimum energy, which is based on the thermodynamic hypothesis formulated by Anfinsen: the natural structure of a protein in its physiological environment is the one in which the free energy of the whole system is lowest[1]. Thereby, the protein folding problem can be defined:

min { Φ θ 2 , . . . , θ n - 1 } θ i ε - π , π
(6)

AB off-lattice model in 3D

The 3D AB off-lattice model also consists of hydrophobic (A) residues (σ i = +1) and hydrophilic (B) residues (σ i = − 1), and the energy function is given by [6]

E = - k 1 i = 1 N - 2 b ^ i b ^ i + 1 - k 2 i = 1 N - 3 b ^ i b ^ i + 2 + i = 1 N - 2 j = i + 2 N E I J ( r i j ; σ i , σ j )
(7)

Where b ^ i is defined as the bond vector between the monomers i and i + 1 with unit length:

b ^ i b ^ i + 1 = cos θ i b ^ i b ^ i + 2 = cos α i
(8)

As Figure 2 illustrates[6], the N -residue molecule can be described by the N − 1 bond vectors b ^ i or by N − 2 bond angles θ i and N − 3 torsional angles α i , and these two angles are the degrees of freedom of the 3D off-lattice model.

Figure 2
figure 2

A schematic diagram of protein folding in 3D. The N -mer can be specified by the N − 1 bond vectors b ^ i or by N − 2 bond angles θ i and N − 3 torsional angles α i .

The Eucledian distance r ij between sites i and j is associated with the N − 2 bond angles and N − 3 torsional angles. The species-dependent global interactions are given by the Lennard-Jones potential[6]:

E L J ( r i j ; σ i , σ j ) = 4 C ( σ i , σ j ) ( 1 / r i j 12 - 1 / r i j 6 )
(9)

Where σ1, . . ., σ n is a binary string:

σ i = { 1 A - 1 B
(10)

If σ i = 1, the ith reside is A; if σ i = − 1, the ith reside is B, and the formation of hydrophobic core depends on the C(σ i , σ j ). In addition, the strength of species-independent local interactions is reflected by the parameters K1 and K2. The parameter (K1, K2) were tested again and again by using different values in [6], and finally Irback found that the spacial structure is more stability when the parameter (K1, K2) was set to ( 1, 0.5).

As same as, the N − 2 bond angles θ i and N − 3 torsional angles could be found by computing the lowest energy of residue sequences. So the protein folding problem of the 3D AB off-lattice model can be specified as:

min { E θ 2 , . . . , θ n - 1 ; α 3 , . . . , α n - 1 } θ i , α i ε - π , π
(11)

Improved strategies

Tabu search, created by Glover [16], is a meta-heuristic search method. It can be used for complex mathematical optimization and combinatorial optimization problems. Tabu search uses a local or neighborhood search procedure to iteratively move from one potential solution to an improved solution in the neighborhood until some stopping criterion has been satisfied. Local search procedures often become stuck in poor-scoring areas or areas where scores plateau. In order to avoid these pitfalls and explore regions of the search space that would be left unexplored by other local search procedures, tabu search carefully explores the neighborhood of each solution as the search progresses. The solutions admitted to the new neighborhood, are determined through the use of the memory structures. Adaptive memory helps the search process to avoid local optima and explores the solution space economically and effectively without getting trapped into cycles [17]. These memory structures form what is known as the tabu list, a set of rules and banned solutions used to filter which solutions will be admitted to the neighborhood to be explored by the search. To enhance the efficiency, the following strategies are used in the algorithm for predicting the protein folding structure.

Encoding

It is very important for the algorithm how to encode the individual, because the different encoding will affect the effectiveness and performance of searching for the whole spatial structure. The solutions of individual encoding then are often binary coded. This encoding, however, is not well suited for protein folding structure prediction problem. Instead, for an N-residue long chain, the individual can be expressed as h i = { θ 2 i , θ 3 i , , θ n - 1 i } and h i = { θ 2 i , θ 3 i , , θ n - 1 i , α 3 i , α 4 i , , α n - 1 i } in the 2D AB off-lattice model and the 3D AB off-lattice model respectively. Encoding in this way is enabled by the fact that the optimization is performed for the protein amino acids chain.

Annealing mechanism

Simulated Annealing (SA) [18]is a probabilistic method for the global optimization problem of searching an approximation to the global optimum of a cost function. Just as the cooling process of solid shows, the solid stays in a disorder state at the beginning with a high temperature, and coming to more and more order when the temperature drops lower and lower till to the frozen state [19]. The core mechanism of SA is the Metropolis Criterion which is used to decide whether the new state should be accepted. The acceptance probability function depends on the energy E and temperature T. If the change in energy is negative the new state is accepted. If the change in energy is positive it is accepted by the certain probability given by the Boltzmann factor. That is to say, the good and bad solution both can be accepted with a probability to avoid becoming trapped in a local optimum. Annealing algorithm simulates the process described above, the algorithm starts with a give parameter called start temperature, and terminates when the temperature drops to zero or the global optimized solution is founded. In this paper, the cooling schedule is a simple linear equation which is the same as [20] (T i+1 = σT i , 0 ≤ σ ≤ 1, When σ inclines to 1, the temperature declines only slowly).

Adaptive neighborhood generation

Neighborhood choosing is a key activity for tabu searching. In order to improve the performance of the algorithm, an optimized neighborhood choosing strategy as the following described is adopted. In our strategy, the neighborhood is related to the current annealing state and initial temperature. With the change of the current annealing state and the decreasing of the temperature, the neighborhood is more and more small. To illustrate the relation between annealing state and neighborhood, a linearity combination can be defined:

δ N e i b o r = s c a l e * C u r S t a t e / I n i t S t a t e
(12)

Where scale represents the original neighborhood of the tabu search, CurState is current annealing state, and InitState represents the initial annealing state. So the neighborhood gradually narrows in the process of annealing.

Local adjustment strategy

Figure 3 presents the schematic diagram of local adjustment for the tetramer. If there is a local minimum solution found, how to get a further optimize solution depend on it, is a worth considering problem. In many cases, the global minimum solution is near to the local minimum solution. Base on this conclusion, a local adjust strategy is adopted to enhance the performance of the algorithm. Give φlmin), φgmin) and φ(θ), standing for a gotten local minimum solution, the global minimum solution to be founded and a current solution to generate off-spring. So the vector φlmin) - φ(θ) shows the direction for the off-spring of φ(θ) to move towards. As the off-spring moves, the possibility of finding global minimum solution will increase. Aiming to make a simple description, we define:

Figure 3
figure 3

A schematic diagram of local adjustment.

φ θ = f θ x , θ y , θ z
(13)

The vector φlmin(θ) − φ(θ) constrains the off-spring of φ(θ) in a cube area, if φgmin) is lying in this area, the possibility of finding it will be increasing.

The algorithm

The SATS algorithm is based on the AB off-lattice Model. Just as the followed process illustrates, the algorithm generates a hypotheses list by using the same initial conformation mechanism in [21] (The idea is as follows: Pick out all A-monomers and place them in certain spots in the space, and all B-monomers wrap the hydrophobic core.), and calculates every individual's energy of the list by the AB off-lattice model and stores the individual with the best energy as a temp best solution. Then, start to descend the temperature, during this period, a new list with individuals in a small scale around the individuals of the hypotheses list is produced as neighborhood list. After calculate out the individual energy, the neighborhood list is rear-ranged by the energy of the individual. Select several top individuals of the neighbor-hood list as candidates and use deprecated principle to judge whether to add it to the tabu list or not. As the tabu list refreshed, the local adjust principle is adopted to optimize the elements of the list aimed to find the possible best solutions. If the current temperature is lower than the given stop condition, terminate the algorithm and output the founded best solution. The steps of SATS algorithm are as following.

SATS Begin

Parameter:

neighbourhood_size

candidate_size

initial_temperature

end_temperature

descend_rate

Process:

Step 1: Create hypotheses using the initial conformation mechanism;

Step 2: Generate neighborhood solutions;

Step 3: Calculate the energy value by the off-lattice Model;

Step 4: Select candidate solutions from neighborhood;

Step 5: Use deprecated principle to accept the solution and add it to the Tabu List;

Step 6: Apply the local adjustment strategy;

Step 7: Descend the temperature, if the temperature is greater than the end temperture then go to Step 2;

Step 8: Output the Result.

SATS End.

Results and discussion

The SATS algorithm has been implemented with Python in Windows 7. There are two parts of experiment for searching the optimum energy conformation in the AB off-lattice model.

Results for Fibonacci sequences

In the experiments of protein folding structure prediction, the Fibonacci sequences are usually selected as the experiment data to test the performance of the optimization algorithm. The Fibonacci sequences can be defined:

s ( n ) = A n = 0 B n = 1 s ( n - 2 ) + s ( n - 1 ) n > 1
(14)

For comparison, the first part adopts the same short Fibonacci sequences with 3 <= N <= 5 in Ref. [7]. Table 1 lists the lowest energy values of thirty-six short Fibonacci sequences calculated by our SATS, and we believe that these spacial conformation are the ground states, which are entirely identical to those of Stillinger [7]. We also use two same test sequences in [8] to obtain their minimal energies and secondary structures, as Figure 4 illustrates. It is seen from Figure 4 that "AABABB" belongs to α-helix and "AAABAA" belongs to β-sheet.

Table 1 The minimum energies obtained by SATS for the short Fibonacci sequences.
Figure 4
figure 4

The schematic diagram of helix and sheet. (a)The lowest energy of AABABB is -1.36198.(b) The lowest energy of AAABAA is -3.69750.).

Then we test our algorithm on four Finbonacci sequences with 13 ≤ n ≤ 55. Firstly, let us discuss the situation of the 2D AB off-lattice model. The methods used to search lowest energy states include neural networks [7], the pruned-enriched-Rosenbluth method [14], the particle swarm optimization (E EPSO ) [12]. Table 2 lists the lowest energies obtained by our SATS along with those of above three methods for comparison. Of course, with the increase of the sequence length, the search time will longer. It is obvious that our lowest energy values are agree with those of EPSO for n = 13 and n = 21. For other cases, our energy values are lower than the energy values obtained by other three methods, especially for the sequence with n = 55, the result of SATS is obviously improved.

Table 2 The minimum energies obtained by different algorithm for Fibonacci sequences with 13 ≤ N ≤ 55 in 2D.

In order to further analyze the effect of the SATS algorithm, we also give the lowest energy conformations of four Finbonacci sequences in Figure 5. The hydrophobic (A) is represented by the red ball, and the hydrophilic (B) is represented by the while ball. As is known to all, a hydrophobic cores always are surrounded by hydrophilic residues for a realistic protein chain. It is observed from four conformations that n = 13 forms the single hydrophobic core, while other three conformations only form the particle clusters. This shows that the 2D AB off-lattice model can show some important attributes of the real protein to certain extent, and it still should be improved.

Figure 5
figure 5

The 2D lowest energy conformations of the four Fibonacci sequences obtained by SATS algorithm. (a) n = 13; (b) n = 21; (c) n = 34; (d) n = 55. (The red balls represent hydrophobic A monomers, white balls represent hydrophilic B monomers.

In the case of 3D AB off-lattice model, it also shows that the results obtained with SATS are better than the results of the previous algorithms. The methods used to search lowest energy states in 3D off-lattice model include the annealing contour Monte Carlo (ACMC) algorithm [22], the energy landscape paving minimizer ELP [9], the conformational space annealing (CSA) algorithm [13] and the local adjust genetic algorithm studied by us in [23]. Table 3 lists the lowest energies obtained by different methods. For n = 13, the energy value of SATS is agree with that of ACMC, but is lower than energy values obtained by other three method. For n = 21, our energy value is agree with those of ELP and LAGA, but is slightly lower than that of CSA and much lower than that of ACMC. For n = 34 and n = 55, our results are the lowest energy in all results obtained by different methods, which shows that SATS has better performance for long sequence. Similar to 2D prediction, Figure 6 depicts the 3D lowest energy conformations with the same way in [21] (Where the hydrophobic (A) is denoted by the red ball, the hydrophilic (B) is denoted by the grey ball). Each of four sequences can fold into a single hydrophobic core flanked by the hydrophilic residues, compared with 2D off-lattice model, the 3D AB off-lattice model is more approach to realistic protein folding structure.

Table 3 The minimum energies obtained by different algorithm for Fibonacci sequences with 13 ≤ N ≤ 55 in 3D.
Figure 6
figure 6

The 3D lowest energy conformations for the four Fibonacci sequences obtained by SATS. (a) n = 13; (b) n = 21; (c) n = 34; (d) n = 55. (The red balls represent hydrophobic A monomers, white balls represent hydrophilic B monomers.

The curves in the Figure 7 are about the minimum energies of 2-dimensional model and 3-dimensional model obtained by different algorithm. From Figure 7, we can find that our methods show much better performance with increasing sequence length.

Figure 7
figure 7

The energy curves obtained by different algorithms. (a) The energy curves in the 2D model;(b) The energy curves in the 3D model.

Results for real protein sequences

The second part deals with some real protein sequences from the Protein Data Bank (PDB). The K-D method [24] is adopted to distinguish between hydrophobic residues and hydro-philic residues. In short, I, V, L, P, C, M, A and G are belong to hydrophobic residues (A), while D, E, F, H, K, N, Q, R, S, T, W and Y are belong to hydrophilic residues (B). Table 4 lists the same four short real protein studied by Ref.[12]. Table 5 shows the minimum energy obtained by four different algorithms. The results of SATS are lower than those of other three algorithms, especially for 1EDN. This shows that SATS is effective to predict the folding structure of the real protein. Figure 8 shows the lowest energy configurations of four real protein sequence obtained by the SATS method, corresponding to the lowest energy values in Table 5. It can be seen from the Figure 8, the configurations of the 1EDP and 1BXL form the hydrophobic core respectively, the other two protein sequences only formed several particle clusters.

Table 4 The four short sequences of real protein.
Table 5 The lowest optimum energies of the short real protein sequences.
Figure 8
figure 8

The lowest energy conformation for the four real protein sequences obtained by SATS. (a)1BXP; (b) 1BXL; (c) 1EDP; (d) 1EDN. (The red balls represent hydrophobic A monomers, the white balls represent hydrophilic B monomers.

Table 6 lists the same two long real proteins of 1AGT and 1AHO studied by Ref.[11]. For 1AHO, the result of SATS is much better than those of PSO [11] and SA [25]. For 1AGT, the result is better than the result of SA while is slightly worse than the result of PSO. Figure 9 shows the information about sequence of two real proteins from PDB, and depicts the lowest energy conformations obtained by SATS. In these two conformations, the hydrophobic monomers also form several clusters of hydrophobic residues. This experiment also demonstrates the AB off-lattice model can reflect some characteristics of the real protein.

Table 6 The lowest optimum energies of the two long real protein sequences.
Figure 9
figure 9

The lowest energy conformation of 1AGT and 1AHO. (The red balls represent hydrophobic A monomers, the white balls represent hydrophilic B monomers.).

Conclusions

This paper has shown that protein folding conformation based on only anfinsen's thermodynamic hypothesis can be feasible by SATS method which combines simulated annealing algorithm and tabu search algorithm. In order to verify the efficiency of the algorithm, 2D off-lattice model and 3D off-lattice model are both adopted by using Fibonacci sequences and real protein sequences respectively. In addition, local adjust strategy is used to improved the accuracy and speed of searching the protein native state. It is obvious that some of our results for lowest energy are better than those of other methods. Therefore, SATS is more effective in solving the protein folding structure problem. In the future, the one of most important work is how to make the algorithm more effective and accuracy for real protein sequence prediction in 3D space. Besides, the AB off-lattice model only considers two kinds of residues and two kinds of interaction energy, so it cannot reflect more important properties of the real protein. Therefore, we should study other models to explore the more interaction energy of protein amino acids.