Background

Determination of protein structure is important for understanding protein functions [1]. The classical techniques for protein structure determination include X-ray crystallography, nuclear magnetic resonance, and electron microscopy. These determination techniques, however, often suffer from the limitations in both expensive costs and long determination period, leading to the ever-increasing gap between the number of known protein sequences and that of solved protein structures [2]. Computational approaches to protein structure prediction from sequences are becoming increasingly important to narrow down the gap [3].

The protein structure prediction approaches can be categorized into two families, namely, template-based modeling [410] and ab initio approaches [1115]. Recently the predicted contacts have also been shown to be invaluable to protein structure prediction [1621]. Unlike the template-based modeling approaches, the ab initio prediction approaches work without requirements of known similar protein structures. Briefly speaking, most ab initio prediction approaches are based on the hypothesis that the native structure of a protein should be the highly-populated one with sufficiently low energy; thus, ab initio approaches usually perform conformation search to find a structural conformation with sufficiently low energy. For example, Rosetta employs the Monte Carlo technique to search conformations assembled from fragments of known structures, and finally reports the centroid of a large cluster of low-energy conformations [11].

For the ab initio prediction approaches, one of the key issues is designing an effective energy function [11, 14, 15]. Typically, an energy function is a weighted-sum of multiple energy terms. The energy terms characterize specific structural features, especially the interplay between local and global interactions of residues. For example, the hydrophobic interaction term was designed to capture the observed tendency of non-polar residues to aggregate in aqueous solution and exclude water molecules. Van der Waals force term is the sum of the attractive or repulsive forces among residues. Hydrogen bonding term describes the electromagnetic attractive interaction between polar molecules in which hydrogen is bound to highly electronegative atom oxygen in the carboxyl [1]. In Rosetta, a total of 13 energy terms were used at the residue level, and over 140 terms were used at the full-atom level; therefore, it is important to find the optimal weighting of so many energy terms [11]. This study focuses on designing an optimal weighting of the 13 energy terms used in Rosetta.

Ideally, an effective energy function is expected to be able to distinguish the native structure from non-native conformations (called decoys), and could drive as much as possible initial conformations to the native-like one during the conformation search process. To achieve these two objectives, a widely-used strategy for designing energy function is to maximize the correlation between energy and quality of decoys [22]. Here the quality of a decoy refers to the structural similarity between the decoy with the native structure, which is measured using root mean square deviation (RMSD) of backbone atoms in this study. Inspired by the idea of “funnel-shaped free energy surface”, Levitt et al. proposed a funnel sculpting technique to construct energy functions that allow the conformation search procedure to easily “roll” into the native structure from a random starting conformation [23]. In another study, Shell et al. attempted to smooth energy function to make the energy landscape a funnel [24].

In this study, we present a framework that constructs effective energy function for protein structure prediction. Our framework, called BARS, consists of two procedures, i.e., broadening attraction-basin where the native structure lies in the energy landscape (hereafter denoted as native attraction-basin), and reverse sampling. The underlying rationale is that the larger the attraction-basin is, the more likely for the Monte Carlo procedure to find the native structure. To explore the attraction basin, we performed reverse sampling starting from the native structure. Subsequently, we tuned the weights of energy terms to broaden the native attraction-basin and thus acquired an improved energy function. We showed that both the possibility of successful search and the quality of predicted conformation increase when using the improved energy function.

The manuscript is organized as follows: “Methods” section describes the whole framework of our method, and the linear program model to optimize protein energy weights as well. “Results and discussion” section lists experimental results of the optimized energy function. In “Conclusion” section, we will discuss some limitations of our method and possible future works.

Methods

To construct an effective energy function, our BARS framework alternates two procedures, i.e., for an energy function, we first explored the native attraction-basin in the corresponding energy landscape using reverse sampling, and then improved the energy function through broadening the attraction-basin. These two procedures were alternated until the energy function changes sufficiently small between successive iterations as below.

The details of the two procedures are described as follows.

Exploring the native attraction-basin using reverse sampling

When applying Monte Carlo technique to search the structural conformation with the lowest energy, the ab initio approaches might finally end with success if starting from some initial structural conformations, and might end with failure if starting from other initial structural conformations. An initial structural conformation is said to lie in the native attraction-basin if the conformation can finally evolve into the native structure during the conformation search process.

To explore the native attraction-basin under a specific energy function, we propose a technique called reverse Monte Carlo sampling. Here, the term “reverse” comes from the fact that the sampling process works essentially reverse to the general Monte Carlo technique used for conformation search. Specifically, the general Monte Carlo search procedure starts from a random initial conformation and moves towards the native structure, during which the energy of conformation is reducing. For each inter-mediate structural conformation, a perturbation is made to generate a new conformation. Some popular perturbation techniques include fragment replacing used by Rosetta [11] and torsion angle sampling used by FALCON [13]. The newly-generated conformation is accepted if it has lower energy relative to the original conformation; otherwise the new conformation will be rejected with a probability according to the Metropolis-Hasting rule [25] (Fig. 1). To emphasize the difference between the general Monte Carlo search technique and the sampling technique used in this study, we denoted the former one as forward Monte Carlo technique.

Fig. 1
figure 1

Conformation searching process using the Monte Carlo technique. The protein structure prediction approaches usually employ the Monte Carlo technique to search the conformation with the lowest energy. An execution of conformation search will generate a path of conformations, e.g., the lines in blue and yellow. Here the energy landscape was drawn using 1000 decoys of protein 1ctfA: we calculated the RMSD among all possible pairs of decoys, and then performed principal component analysis of the generated RMSD matrix [23]. The x and y axises represent the first and second principal components, respectively. Decoy energy was calculated using score3 of Rosetta

On the contrary to the forward Monte Carlo search technique, our reverse Monte Carlo sampling procedure starts from the native structure and moves towards the edge of the native attraction-basin, during which the energy of conformations is increasing (Fig. 2). The reverse sampling process ends at a conformation if any perturbation of this conformation cannot leads to increase of energy. Intuitively, this conformation lies at the edge of the native attraction-basin and thus is denoted as edge point conformation in this study.

Fig. 2
figure 2

Reverse Monte Carlo sampling and tuning weights of energy terms. Reverse Monte Carlo sampling starts from the native structure and moves towards the edge of the native attraction-basin, during which the energy of conformations is increasing. Here the solid line shows a path of conformations generated by reverse Monte Carlo sampling S0S1S2S3. S3 represents an edge point conformation as its perturbation neighbors, e.g., \(S_{3}^{1}\) and \(S_{3}^{2}\), have lower energy than S3. The dash line shows these conformations after tuning weights of energy terms, where S3 is no longer the edge point conformation as one of its perturbation neighbors, \(S_{3}^{2}\) has larger energy than S3. Note that during weight-tuning, the monotonicity of energy is maintained within the native attraction-basin

Formally, each execution of reverse sampling will generate a path of conformations P=S0S1S2→...→Sn, where S0 denotes the native structure, Si(1≤in) denotes the ith inter-mediate conformations along the reverse sampling path, and Sn denotes the final edge point conformation. The energy of the inter-mediation conformations increases along the path, suggesting the monotonicity of energy within the native attraction-basin. The RMSD between S0 and Sn is calculated as a rough measure of the radius of native attraction-basin. For the edge point conformation Sn, we perform m times of perturbation, thus acquiring m perturbation neighbors of Sn, denoted as \(N(S_{n})=\left \{S_{n}^{(1)}, S_{n}^{(2)},..., S_{n}^{(m)}\right \}\) (Fig. 2). In our study, m is set as 1000.

Broadening the native attraction-basin by smoothening the edge point conformations

Intuitively, if we can “smoothen” the edge point conformations, the native attraction-basin will be broadened since the reverse sampling process will not be stuck at these edge point conformations. We accomplished the “smoothening” operation through tuning weights of energy terms such that the energy of Sn is decreased to be less than at least one of its perturbation neighbors \(S_{n}^{(i)} (1 \le i \le m)\). To make the native structure still the lowest one under the new energy function, we imposed a constraint on weight-tuning such that after tuning weights, the new energy of Si should be lower than that of the conformation Si+1. In other words, the monotonicity of energy are maintained within the attraction-basin, and thus the shape of the native attraction-basin will also be maintained. Figure 2 shows how the conformation path changes after improving energy function.

We designed a linear program to calculate the optimal weights that satisfy this constraint.

$$ \begin{aligned} &\text{min}\qquad ||W-W_{0}|| \\ s.t. W \cdot {E}_{i} &\leq W\cdot {E}_{i+1}, 0\leq i\leq n-1 \end{aligned} $$
(1)
$$ W \cdot {E}_{n} \leq \frac{1}{m}\sum\limits_{j=1}^{m} W\cdot {E}_{n}^{(j)} $$
(2)
$$ W \geq 0 $$
(3)
$$ |W| = |{W}_{0}| $$
(4)

Here the vector W denotes the weights of energy terms, and W0 denotes the original weights before tuning. For an inter-mediate conformation Si in the reverse sampling path, Ei denotes the vector of its energy terms, i.e. \(E_{i}= < e_{i}^{(1)},e_{i}^{(2)},\ldots,e_{i}^{(13)} >\), where \(e_{i}^{(k)}\) represents the k-th energy term. The objective of the linear program is to find a new weighting scheme with change as small as possible. Formula (1) describes a constraint that the original relative order of Si and Si+1, i.e., the monotonicity of energy within the native attraction-basin, should be kept after tuning weights. Formula (2) was designed to “smoothen” the edge point conformation, i.e. at least one of the m perturbation neighbors of the edge point conformation has a higher energy; thus, Sn is no longer an edge point conformation under the new energy function.

Results and discussion

Data set

We evaluated the BARS framework on Test101 dataset that contains a total of 101 benchmark proteins. The criteria for selecting these proteins are: i) The length of these proteins are less than 120 amino acids as the energy terms used in this study were designed for small proteins [11]. ii) These proteins cover most SCOP superfamilies as energy functions differ with protein class. We used three proteins as representatives of three SCOP classes (Table 1) to explain the working process of the BARS framework and summarized experimental results on Test101 dataset in Additional file 1: Table S1.

Table 1 Three benchmark proteins used in the study

Analysis of evolution of energy functions

As mentioned above, the BARS framework alternates the broadening attraction-basin and reverse sampling steps to gradually improve energy function. In our experiment, the initial weights of energy terms were set as the weights used by Rosetta in the scoring function score3.

Table 2 shows how the energy functions evolve as iteration proceeds for protein 1ctfA. From this table, it can be observed that the weights of energy terms are almost fixed after 6 iterations. In addition, although the difference between consecutive iterations are not very large as expected, the final weights are quite different from the initial ones (Manhattan distance: 15.16).

Table 2 Weights of energy terms during the iteration process of BARS

We also compared the final weights for proteins in different classes. As shown in Table 3, the final weights exhibited considerable difference for proteins from different classes. For example, the weight of Env term in 1iieA (class A) is 5.69, about 4 times larger than that of 1ctfA (class D), and over 2.5 times larger than that of 1iloA (class C). This is consistent with the fact that the environment local geometrical term is more important for all- α proteins, since local residue-residue interactions dominate the helix formation process. In addition, 1ctfA (class D) can be distinguished from the other two proteins at the sheet term: the final weight of this term is 3.48 for 1ctfA, much larger than that of 1iloA (1.90) and 1iieA (1.14). This is also reasonable as α + β proteins usually contain anti-parallel β-sheets, whereas α/ β proteins contains β- α- β motifs. Taken together, the table supports the view point that different energy terms are emphasized for proteins in different classes.

Table 3 The optimized weights of energy terms calculated by BARS for protein 1ctfA (after 6 iterations), 1iloA (after 5 iterations), 1iieA (after 5 iterations)

Broadening attraction-basin as iteration proceeds

We further investigated whether the native attraction-basin could be broadened after smoothening the edge point conformations. To measure the size of the native attraction-basin, we performed reverse sampling for 50 times, thus acquiring 50 edge point conformations. The mean RMSD between these edge point conformations and the native structure is calculated and used to measure the size of the native attraction-basin. Intuitively the calculated mean RMSD can be treated as radius of the native attraction-basin.

As shown in Fig. 3, the mean RMSD was 6 Å initially, and increased to nearly 14 Å at the final iteration. This clearly suggested that the attraction basin was really significantly enlarged as iteration proceeds.

Fig. 3
figure 3

Broadening the native attraction-basin of protein 1ctfA as iteration proceeds. Here, x-axis denotes the number of iterations, and y-axis denotes the RMSD between the native structure and the edge point conformations. For the weighting scheme at each iteration, a total of 50 paths were generated, thus yielding 50 edge point conformations. The mean RMSD increases as iteration proceeds, suggesting that the native attraction-basin was broadened

Protein structure prediction using the improved energy function

We further investigated whether the improved energy function could facilitate protein structure prediction or not. For this aim, we compared the predicted structures by running Rosetta with different energy functions. Specifically, for each iteration step of BARS, we used the corresponding weighting scheme of energy terms to construct an energy function, and then run Rosetta using this energy function to generate 1000 decoys. We investigated two aspects of these predicted decoys: i)Quality of final prediction results: Among these decoys, a clustering procedure was executed and the centroid of the largest cluster was reported as the final prediction. Then we analyzed RMSD of the final prediction with the native structure. ii)Good decoy ratio: Among these generated decoys, we calculated the ratio of “good” decoys. Here, we adopted the widely-used criterion that for small proteins, a decoy is called good decoy if its RMSD to the native structure is less than 6 Å [13].

As shown in Fig. 4, the quality of the final prediction results improved as iteration proceeded. Taking protein 1iloA as an example, the final prediction had a RMSD of 2.7 Å when using the original weighting scheme (Fig. 5, left panel). In contrast, when using the optimized weighting scheme, the quality of the final prediction structure improved (RMSD: 1.3 Å, Fig. 5, right panel). We repeated this experiment on the 101 benchmark proteins in Test101 set and observed that for 82 proteins, the quality of final prediction structure improved (Fig. 6 and Additional file 1: Table S1). For example, the quality of predicted structure for protein 1pchA was low (RMSD: 11.721 Å) when using the original energy weight; in contrast, when using the optimized weighting scheme, the predicted structure significantly improved (RMSD: 3.417 Å).

Fig. 4
figure 4

Quality of the final prediction for protein 1ctfA, 1iloA, and 1iieA. At each iteration step, a total of 1000 decoys were generated by Rosetta with corresponding weights of energy terms. We run clustering procedure for the 1000 decoys and finally selected the centroid of the largest cluster as the best decoy. The RMSD of the best decoys reduces as iteration proceeds, suggesting that the quality of prediction results increases

Fig. 5
figure 5

Predicted structures for protein 1iloA. Left panel: the predicted structures using the initial weights (RMSD: 2.7 Å). Middle panel: native structure. Right panel: the predicted structures using the optimized weights (RMSD: 1.3 Å). Thus, the optimized weights help improve the quality of predicted structures

Fig. 6
figure 6

Quality of predicted structures using the original weights and the optimized weights of energy terms on 101 benchmark proteins of the Test101 dataset. For each benchmark protein, we predicted structures using both original weights and optimized weights of energy terms, and showed RMSD of the predicted structures as x-axis and y-axis, respectively. Most proteins fall below the diagonal line, suggesting that when using the optimized weights of energy terms, the predicted structures usually have smaller RMSD

Besides the quality of the final prediction, the good decoy ratio also increased considerably (Fig. 7). For example, if using the initial weights for protein 1iloA, only 11 decoys among the 1000 decoys were good decoys. In contrast, over 200 good decoys were generated when using the optimized weighting scheme. Taken together, these results clearly suggest that using the energy function constructed with BARS, the general Monte Carlo search procedure significantly improved in both success possibility and the quality of prediction results. This improvement should also be attributed to the broadened native attraction-basin.

Fig. 7
figure 7

Good decoy ratio increases as iteration proceeds. At each iteration step, a total of 1000 decoys were generated by Rosetta using corresponding weights of energy terms. Here a decoy is called “good decoy" if it has a RMSD less than 6 Å to the native structure. The figure suggests that the “good decoy ratio" significantly increases, e.g. the ratio increases from 0.01 to over 0.2 for protein 1iloA. Thus, Rosetta can generate high-quality decoys more efficiently

Application range of the constructed energy functions

Application range is one of the key issues of energy functions. An ideal energy function is expected to be applicable on a large amount of proteins rather than a single protein. To examine the application range of the energy functions acquired using BARS, we run Rosetta with the energy function acquired from protein 1ctfA on other seven benchmark proteins in the same class to 1ctfA. As shown in Table 4, on all of the seven proteins, the predicted structures have high quality (RMSD less than 7 Å). More importantly, on five out of the seven proteins, the prediction results using the energy function acquired from 1ctfA are much better than those predicted using the original weights.

Table 4 Quality of predicted structures using the original weights of energy terms and the weights acquired from protein 1ctfA

Similarly, we run Rosetta with the energy function acquired from protein 1iieA on other 20 benchmark proteins. As shown in Table 5, on 19 out of the 20 benchmark proteins, the predicted structures have RMSD less than 5 Å. On 18 out of these benchmark proteins, the predicted structures using the weights acquired from 1iieA are better than those predicted using the original weights.

Table 5 Quality of predicted structures using the original weights of energy terms and the weights acquired from protein 1iieA

In summary, these results demonstrate the wide application range of the energy functions acquired using the BARS framework. A reasonable explanation of this wide application range is that proteins in a class might share similar folding process; thus, the energy function optimized on a certain protein is also applicable for other proteins in the same class.

Conclusion

In this study we report the BARS framework for constructing effective energy functions. The framework attempts to improve energy function gradually such that the native attraction-basin was broadened. During this process, a reverse Monte Carlo sampling strategy was proposed to explore the native attraction-basin. Extensive experimental results demonstrate both effectiveness and wide application range of the constructed energy functions.

It has been reported that protein folding is a hierarchical process. According to this observation, Rosetta employs a multi-step prediction strategy. In particular, Rosetta first uses score function score0 with only hydrophobic core terms, then uses score2/score5 with secondary structure terms, and finally uses score3 to incorporate a total of 13 energy terms [11]. This study focuses on the optimization of weights for the third step. How to design effective energy functions for the first and second steps remains as one of our future works. We also noticed that on 19 out of the 101 benchmark proteins, the quality of prediction results using the optimized weighting scheme were low. How to design better energy functions for these proteins is another future work. It should be pointed out that the RMSD deviation of the 50 edge point conformations is large at iteration 5 for protein 1ctfA (Fig. 3). This might imply the irregular shape of the native attraction-basin. We will investigate this issue in future work.

The application of the BARS framework is not limited to protein structure prediction. Constructing an effective scoring function is usually the first important step for optimization problems in various domains such as RNA structure prediction, natural language processing, etc. How to optimally combine multiple terms into a scoring function is a challenging task. Our BARS framework should greatly facilitate designing effective scoring functions for a large variety of problems.