Introduction

Since the presence of HP lattice model [1], heuristic search algorithms for a variety of lattice models have been proposed and proven useful to explore the relationship between the primary amino acid sequence and its native folding structure, particularly in the protein folding problem (PFP) and the protein structure prediction (PSP). The main purpose of the HP lattice model is to understand the physicochemical principle of protein folding during the modeling process of searching for the lowest free-energy conformation of a protein.

Despite the difference in modeling accuracy, both high-resolution and low-resolution models can contribute to an understanding of the protein structure obtained from experiments, such as NMR and crystallography. Moreover, they have various applications in protein modification, protein-ligand and protein-protein interactions [2]. Table 1 summarizes the relationship between modeling accuracy and the related applications.

Table 1 The relationship between modeling accuracy and the related application.

To improve the modeling accuracy, several lattice models have been developed and proposed. The present study compares four popular lattice models in terms of visual comparison, including 2D square and triangular lattice models, 3D cubic lattice model and face-centered cubic (FCC). The protein structures obtained from the four modeling types were compared with reported 'real' biological protein structures. As Figure 1 shows, the 2D triangular lattice model can give a better structure modeling and prediction for proteins with short primary amino acid sequences.

Figure 1
figure 1

Four different types of lattice model for visual comparison taking the protein with the PDB id: 1A0Ma as the example. Visual comparison for PDB id: 1A0Ma. (a) Real protein structure; (b) and (c) are 2D square and triangular lattice model simulation results. Black-filled dots indicate Hydrophobic amino acids and white dots denote hydrophilic amino acid. (d) and (e) are 3D square and face-centered cubic (FCC) lattice model simulation results from CPSP-tools [3]. In (d) and (e), green balls indicate hydrophobic amino acids while the gray balls indicate the hydrophilic amino acids.

In solving this prediction problem, Hart and Istrail [4] first gave a 1/4 (25%) approximation for the problem of the 2D square lattice and a 3/8 (38%) approximation for the problem of the 3D cubic lattice. Agarwala et al. [5] gave a 6/11 (54%) approximation for the problem, which is consistent with our experimental results.

Many researchers have favored and focused research on the square lattice model because it has many associated benchmarks, large amount of data accumulated over the years, and the availability of comparison with different strategies and modeling methods. By contrast, little work has been done on the 2D triangular lattice model. In this paper, we proposed a genetic algorithm with elite-based reproduction strategy (ERS-GA). Based on ERS-GA, this study further develops a hybrid of hill-climbing and genetic algorithm (HHGA) for protein structure prediction on the 2D triangular lattice. Experimental results were conducted to validate the effectiveness of this method.

The remainder of this paper is structured as follows: Section II gives the preliminaries and the definition of the protein structure prediction problem in the HP 2D triangular lattice model. Section III describes the methodology used in the study. The comparison of results is presented and discussed in Section IV followed by the conclusion in Section V.

Preliminaries

Proteins play fundamental and crucial roles in nearly all biological processes, such as, enzymatic catalysis, signaling transduction, DNA and RNA synthesis, and embryonic development. It has been a long-standing goal of molecular biology to predict the tertiary structure of a protein from its primary amino acid sequence [6, 7]. This paper emphasizes research on ab initio modeling, among which the 2D HP triangular lattice model is thought to be the best two-dimensional model in protein structure prediction at present.

HP lattice model

The HP lattice model [1] is the most frequently used model, which is based on the observation that the hydrophobic interaction between amino acid residues is the driving force for protein folding and for development of native state in proteins [8]. In this model, each amino acid is classified based on its hydrophobicity as an H (hydrophobic or non-polar) or a P (hydrophilic or polar). The HP lattice model allows HP protein sequences to be configured as self-avoiding walks (SAW) on the lattice path favoring an energy free state according to HH interaction. The energy of a given conformation is defined as the number of topological neighboring (TN) contacts between H's that are not adjacent in the sequence. Figure 2 shows an example for the 2D triangular lattice model.

Figure 2
figure 2

An optimal conformation in a 2D triangular lattice model. An optimal conformation for the sequence (HP)2PH(HP)2(PH)2HP(PH)2 in a 2D triangular lattice model. The black filled dots denote the hydrophobic amino acid and the red open circles denote the hydrophilic amino acids. The H-H contacts (free energy) in the conformation are assigned the energy value of -1. The free energy is defined as a minimum value; the maximum number of H-H contact is given in the case of two-dimensional models, Figure 2 illustrates a protein structure with 15 H-H contacts (energy= -15).

Calculation of free energy

The free energy of a protein can be calculated by the following formulae [9]:

(1)
(2)

where the parameter

(3)

Protein folding can then be transformed into an optimization problem for the conformation with minimal free energy. Formally, given an HP sequence s = s 1 s 2s n, find a conformation of s with minimum energy. That is, the problem is to find c* ∈ C(s) such that E(c*) = min{E(c)|cC(s)}, where C(s) is the set of all valid conformations for s [10].

Triangular lattice model

A significant drawback of the cubic lattice [5] is that, if two residues are at any even distance in the primary sequence, they cannot be in topological contact with one another when the protein is embedded in this lattice. In other words, on the square lattice, two amino acids in contact in any folding must be at odd distance away in the protein sequence [5]. To address this issue, Joel et al. [11] introduced the 2D triangular lattice model. As Figure 3 shows, each lattice point has six neighbors in the two-dimensional triangular lattice. Since each residue has two covalent neighbors, except the first and the last residues, a residue at a lattice point can be in topological contact with at most four other residues. Thus, each residue is involved in up to four H-H contacts [11].

Figure 3
figure 3

The 2D triangular lattice model neighbors of vertex (x, y).

With the unit vectors obtained from the triangular lattice, it is much easier to model protein conformation on a two-dimensional triangular lattice without exhibiting the parity problem [5]. However, the lattice model of protein conformation as a self-avoiding walk is NP-complete [12]. To solve this problem, some heuristic search algorithms [1318] have been developed for various lattice models. Backofen and Will [21] utilized advanced techniques such as constraint programming to calculate all optimal side-chain structures of a given sequence, and proved their optimality [3]. Further, Böckenhauer et al. [15] extended the library by implementing the 2D triangular lattice and the pull move set for triangular lattice models.

In this paper, we developed an effective hybrid of local search and genetic algorithm (GA) to resolve this problem. The performance is examined and compared to the results in [15]. More details about the proposed algorithm are presented in the next section.

Methods

This paper introduces the elite-based reproduction strategy to GA as the ERS-GA. Further, we propose a hybrid of hill-climbing and ERS-GA, called the HHGA, for protein structure prediction on the 2D triangular lattice. The proposed HHGA, in essence, is a combination of global search algorithm with local search operator. Restated, HHGA works within the framework of ERS-GA and adopts hill-climbing to enhance its exploitation capability. Figures 4 and 5 show the flow charts of the proposed ERS-GA and HHGA. The following subsections describe the operators of ERS-GA and HHGA.

Figure 4
figure 4

Flowchart of the elite-based reproduction strategy (ERS).

Figure 5
figure 5

Flowchart of the hybrid of hill-climbing and genetic algorithms (HHGA)

Initialization

For an input amino acid sequence of length n, a candidate conformation in the 2D triangular lattice [11, 14] is encoded as a chromosome in the form of a string of length (n – 1) over symbols {L, R, LU, LD, RU, RD}, denoting the fold directions left, right, left-up, left-down, right-up and right-down, respectively. An initial population is generated randomly in the (n – 1) dimensional space within a predetermined range. In this paper, population size was set at 200 empirically.

Each chromosome in the population needs to be evaluated for its fitness. Here we directly use equation (2) of free energy as the fitness function. The goal for an optimization algorithm like HHGA is to minimize the fitness value, namely, free energy. The evaluated chromosomes are sorted according to their fitness values. This sorted population serves as the basis of subsequent reproduction process.

Elite-Based Reproduction Strategy (ERS)

Reproduction is a process in which the information of candidate solutions are modified and copied, depending upon their fitness values. The reproduction in GA consists of selection, crossover, and mutation. For the ERS-GA and HHGA, this study adopts the elite-based reproduction strategy, which keeps the top half of the population to the next generation and generates offspring by performing crossover and mutation on the second half of the population [19]. In the experiments, this study uses two-point crossover with crossover ate 0.8 and uniform mutation with mutation rate 0.4.

Local search

Two local search operators are proposed for the protein structure prediction problem. First, given the current solution, local search I chooses its neighbor residues, which are generated in a way similar to mutation operation: i.e., randomly changing its direction. Consequently, if the fitness value of a neighbor is better than the current solution, this neighbor residue will be accepted to replace the current one.

In local search II, the neighbor residues are generated in a way similar to crossover operation. That is, five neighbors are created by changing the direction of the second segment after the crossover point, where rotation angles are 60°, 120°, 180°, 240° and 300°, respectively. If any of the five folding directions leads to a superior fitness to the original direction, this neighbor will replace the current solution.

Termination condition

Genetic algorithm requires a termination condition to stop the evolutionary process and return the final result. In this study, the experiments ran ERS-GA and HHGA for a maximum of 200 generations. The best chromosome of the population is then returned as the final result.

Numerical Results

Table 2 lists the eight benchmark sequences in our experiments. These sequences have been used for the 2D square HP model [20]; however, in the 2D triangular HP model the minimum energy of these benchmarks was still unknown. The comparison with previous studies provided a means of demonstrating the effectiveness of the method described here.

Table 2 The benchmarks for the 2D triangular lattice HP model.

The experiments were conducted in two steps. First, ERS-GA was used to predict the protein structure to evaluate the efficacy of this method. Tables 3 and 4 summarize the results and compare them with prior work. According to the results in Table 3, the proposed ERS-GA significantly outperforms simple genetic algorithm (SGA) and hybrid genetic algorithm (HGA).

Table 3 Comparison of the proposed approach with the simple genetic algorithm (SGA) and hybrid genetic algorithm (HGA).
Table 4 Comparison of a hybrid of hill-climbing and GA (HHGA) with the tabu search (TS).

Next, the HHGA integrates the hill-climbing local search into the ERS-GA approach for performance improvement. Table 5 shows that this hybrid algorithm, i.e., HHGA, can effectively enhance the performance and performs comparably with the tabu search proposed by [15]. This comparative outcome demonstrates that HHGA is a similarly good approach as the state-of-the-art method in protein structure prediction. Figure 6 plots the structures obtained from HHGA for eight protein sequences.

Table 5 Comparison of ERS-GA with HHGA in free energy obtained (Mean/Best) and average running time.
Figure 6
figure 6

(a) to (h) Results of the structure of eight protein sequences.

Table 5 further presents the comparison of the ERS-GA with the HHGA, where each algorithm was run for 30 times. The average running time was measured on Intel i7-920 machines. The experimental results show that HHGA achieves better solution quality, i.e. lower energy, than ERS-GA does on all the benchmarks. This validates the effectiveness of the local search in HHGA. On the other hand, HHGA gains this advantage at the cost of running time.

Conclusions

In the ab initio technique, the lattice model is one of the most frequently used methods in protein structure prediction. From visual comparison, however, it was found that the 2D triangular lattice model can yield better structure modeling sequences and prediction for proteins with short primary amino acid sequences. Meanwhile, it was realized that the 2D triangular lattice model has rarely been used in protein structure prediction.

This paper has highlighted this interesting issue and provides a short introduction to the working method for 2D triangular lattice models. Furthermore, the paper proposes the genetic algorithm with elite-based reproduction strategy (ERS-GA) and a hybrid of hill-climbing and genetic algorithms (HHGA) for protein structure prediction on the 2D triangular lattice. The simulation results show that ERS-GA and HHGA can successfully be applied to the problem of protein structure prediction. The satisfactory simulation results validate the effectiveness of the proposed algorithms; in addition, they demonstrate that the 2D triangular lattice model is promising for protein structure prediction.