1 Introduction

One of the long-standing challenges in computational biology is to fold proteins of given amino acid sequences into native functional three-dimensional structures of experimental accuracy. Such reliable protein structure prediction method is in urgent need because it is far cheaper to sequence the entire genome of a species (<$10,000) [1] than to determine the structure of a single protein (~$100,000) [2]. As a result, the number of sequences generated from genome sequencing projects outpaces the growth of structures solved by experimental techniques by orders of magnitude. It is considered practically impossible to solve the structures of millions of proteins by experimental techniques, and the fact that not all protein structures can be solved by existing experimental techniques further exacerbates the challenge. For example, X-ray crystallography requires high-quality crystals that are not always possible to obtain while the Nuclear Magnetic Resonance (NMR) technique is currently limited to small-size proteins.

The most influential event in the structure prediction community is the biannual CASP meeting (Critical Assessment of Structure Prediction techniques) [3]. At two-year intervals since 1994, sequences whose structures are soon to be solved are collected from structural biologists and distributed to computational biologists for prediction. Predicted structures are then compared to experimental solutions, and results from this comparison are reported in the bi-annual CASP meeting. The most effective structure prediction techniques highlighted by CASP include fragment-based assembly [4], profile and/or threading-based fold recognition [518], consensus and meta-server methods [12, 1922], and template assembly [23]. While encouraging progress has been made, the overall pace of advancement since the first CASP remains slow [24]. The most successful techniques in structure prediction (e.g. ROSETTA [4] and TASSER [23]) appear to converge to a unified approach of mixing and matching known native structures either in whole (template-based modeling) or in part (fragment assembly) [24, 25]. The convergence of methods highlights the need for innovative techniques to break the impasse in protein structure prediction.

The CASP meeting has had a profound positive impact on the community by promoting the winners (the best predictors), regardless of the methods and databases employed. However, an unintended consequence of the performance-oriented evaluation is that it favors incremental changes from existing proven techniques that have been perfected over the years, rather than novel methods that are potentially game changing but not yet comparable in accuracy to the mature and proven techniques. It rewards the methods that employ the largest database and supercomputing powers and perform a relatively easier task of re-ranking models predicted by other methods, rather than the challenging task of structure prediction. The purpose of this review is to raise the attention to alternative approaches in protein structure prediction with the hope of preventing their premature termination. To limit our scope, we will focus on recent trends and several emerging “ab initio” approaches that are not fragment based. Focusing on fragment-free approaches in this review is not an attempt to reduce the historical or future importance of fragment-based approach but to stimulate new ideas to help solve this challenging problem.

2 Physics-based approaches

Most proteins fold into unique thermodynamically stable structures. The stability of the folded structures and the ability of proteins to perform a wide range of functional activities are determined by solvent-mediated physical interactions between the amino acid residues of the proteins. In principle, such physical interactions can be obtained by solving quantum mechanical equations. However, sufficiently accurate quantum–mechanical simulations of the large-scale motion of proteins are not yet possible because of the large number of complicated interactions in such systems (protein and water molecules). As a result, these interactions are usually approximated by empirical molecular mechanics force fields.

2.1 Molecular mechanics force fields

Molecular mechanics force fields are typically obtained by the combination of quantum mechanical calculations of small peptide fragments and empirical fitting of experimental data [2628]. Earlier development of force fields focused on dynamics and free-energy simulations of proteins around their native conformations [2931]. Direct ab initio folding simulations from random coils are hampered not only by the insufficient accuracy of molecular mechanics force fields but also by the astronomically large conformational space of polypeptide chains. Currently, typical molecular dynamics simulations last for a few hundred nanoseconds, compared to actual folding time from microseconds to seconds. Thus, most folding studies in explicit water molecules are limited to small peptides or very small proteins [32, 33]. One milestone study was a microsecond folding simulation of 36-residue villin headpiece starting from an unfolded conformation by Duan and Kollman [34]. Although the presence of water molecules can smooth the free-energy landscape [35], molecular dynamics simulations of low-resolution protein structures with explicit solvent models have mixed outcome: improving the structural accuracy for some but not other proteins [3639]. In particular, a large-scale study of 75 proteins each with 729 near-native structures [40] indicates that molecular dynamics simulations with explicit solvent molecules started from near-native structures move further away from their respective native conformations. The results underscore the need for further improvement in the force fields and the approaches.

The performance with explicit water molecules described above does not justify the significant increase in computing time needed to include them. As a result, most studies in structure prediction employed simplified implicit solvation models (for reviews see e.g. [4143]). While most studies are limited to short peptides and small proteins [32, 4449], some successes for high-resolution ab initio predictions are noteworthy. Simmerling et al. [50], Pitera and Swope [51], and Duan et al. [52] all achieved high-resolution prediction of a 20-residue Trp-cage peptide with various versions of the AMBER force field and a generalized Born (GB) solvation model [53]. Duan et al. folded villin headpiece to less than 0.5 Å Cα-root-mean-squared distance (RMSD) from its native structure [54, 55]. Pande et al. folded villin headpiece to about 1.7 Å of the root-mean-squared of the inter-residue Cα–Cα distance matrix (dRMS) from its native structure [56] and further developed a method for automatically constructing Markov state models to capture the thermodynamics and kinetics of folding [57]. Duan et al. also reached 2.0 Å RMSD for both three-helix bundles of 47-residue albumin binding domain and 60-residue B domain of protein A (BdpA) [58], and 1.3 Å for a 28-residue designed alpha/beta protein (FSD) [59]. Figure 1 shows the best folded structure achieved during folding simulation when compared to the native structure of BdpA. The lowest sampled conformations are less than 1.0 Å RMSD. It should point out that most of these are small helical proteins. Ab initio folding of proteins of mixed secondary structures and medium size remains a challenging endeavor. Nevertheless, the successful folding of small proteins to sub-angstrom Cα-RMSD by ab initio approach is encouraging. It suggests that, with improved force fields, folding proteins to their native states with experimental accuracy should be possible in the not-too-distant future.

Fig. 1
figure 1

Comparison between simulated structures (magenta) and NMR structure of BdpA (green). a The best folded structure with 0.8 Å RMSD (Cα only) from MD folding simulation of the truncated BdpA. b The best folded structure with 1.3 Å RMSD from the Replica Exchange MD of the full-length BdpA. Adopted from Fig. 2 of Ref. [58]

Recently, Dill and his coworkers [60] made a blind prediction of six CASP 7 targets based on AMBER 96 with an implicit GB/SA (Solvent Accessible surface area) model of solvation with a sampling technique called the zipping and assembly [61]. They found that the accuracy of their method is about the average accuracy of other knowledge-based techniques. This is encouraging, considering that the method does not utilize any predicted secondary structures and fragments/templates from known protein structures. Their study will likely re-energize the physics-based approaches that were participants in early CASP experiments (e.g. [6264]) and currently are overshadowed by knowledge-based or mixed approaches. However, in order to increase the competitive edge of physics-based approach over a knowledge-based one, it is clear that there is a need for further optimization of physics-based force fields and/or solvation models. For example, Jagielska et al. showed that protein models can be refined closer to their native structures using an AMBER force field with optimized relative weights [65]. Krieger et al. [66] re-tuned AMBER parameters by minimizing the deviations from 50 high-resolution protein crystal structures. Lin et al. found that hydrophobic potential of mean force is more useful than commonly used solvent accessible surface area for native structure discrimination [67]. Progress has been made in the development of efficient PB (Poisson-Boltzmann)/SA method that enabled MD simulations of proteins [68]. Because a force field–based approach relies on the continuum solvent models to treat the solvation effect, the overall accuracy and effectiveness of the approach thus requires the advancement in both. One area that may require additional effort is an efficient approach to treat the ionic effect including an accurate model of salt bridges in proteins.

Most existing physics-based molecular mechanics force fields treat electrostatic interactions between atoms as a collection of fixed point charges. In reality, they are anisotropic and polarizable. As a result, there is a significant effort in the development of polarizable force fields [6978]. Polarizability is handled by many different approaches including fluctuating charges, induced dipoles, Drude oscillator and distributed multipoles. Yet, despite the effort in development, applications of polarizable force fields are limited to validation of the developed polarizable force fields and a few dynamics simulations of proteins [79]. As the development of polarizable force fields continues [79], their application to structure prediction (structure refinement, in particular) will likely commence soon.

2.2 Quantum mechanics and mixed QM/MM

A more fundamental approach is to treat atomic interactions quantum mechanically. Most existing applications of quantum mechanics (QM) to proteins are a hybrid approach in which QM and molecular mechanics (MM) are applied to treat different portions of a system (QM/MM) [80, 81]. Typically, a small portion of a system (e.g. the active site of an enzyme [82]) is treated quantum mechanically and is coupled to the remaining portion that is treated classically for efficient conformational sampling. Applications of QM to entire proteins became possible with the development of linear scaling techniques [83, 84] and were found to be useful for refining experimental structures [8587]. In 2001, Liu et al. [88] demonstrated that it is possible to simulate a system where the entire protein crambin is represented on the semi-empirical quantum–mechanical level and water molecules are modeled at the MM level for 350 ps. The simulation of the protein crambin provides a more accurate description of structural detail than regular MM simulations, when compared to the high-resolution X-ray structure. Zhu et al. [89] further showed that the gas-phase and solution structures of non-natural beta- and mixed alpha/beta- peptides can be predicted by an approximate density functional method for peptides coupled with a MM model for the solvent. Renfrew also found that quantum mechanics allows a more accurate placement of side chains [90]. More recently, a new approach was proposed where valence and core electrons are treated at the QM and MM levels, respectively [9193]. The resulting X-Pol model has been used to simulate the protein BPTI in water for 50 ps. These studies highlight the potential utility of QM/MM in protein structure prediction as computing power further improves. These ab initio physics-based approaches, however, are several orders of magnitude slower than molecular dynamics based on molecular mechanics force fields. They may prevail one day when GPU (Graphics processing unit) parallel processing [9496] and specific hardware for molecular dynamics simulations [97] become mature techniques accessible to most researchers.

One of the most successful applications of quantum calculations to protein structures is their ability to make highly accurate structure prediction from NMR chemical shifts [98100]. Several groups have achieved a 2.0 Å or better resolution for predicted protein structures by employing fragment-based, structure prediction techniques with NMR chemical shifts as the only experimental restraints [101107].

3 Knowledge-based potentials

While purely physics-based approaches may have the potential to achieve accurate protein structure prediction in the future, it makes practical sense to take advantage of known sequence and structural information, as appropriate for aiding protein structure prediction. Knowledge-based information can be employed to derive restraints in order to achieve a significant reduction in the conformational sampling space; knowledge-based (free) energy functions have been applied rather successfully to discriminate the native conformations from other non-native ones. Here, we will limit our discussion on all-atom knowledge-based energy functions because they are required for high-resolution structure prediction and are usually more accurate than residue-level knowledge-based energy functions.

3.1 All-atom distance-dependent potentials

A knowledge-based or statistical energy function is obtained directly from statistical analysis of known experimental protein structures [108, 109]. Unlike physics-based energy functions, an all-atom statistical energy function is a potential of mean force and, thus, allows direct and efficient evaluation of the free energy involved in folding and binding of proteins. Developing distance-dependent statistical energy functions at the atomic level is a relatively new, under-explored approach, compared to distance-dependent all-atom physics-based force fields [2628]. Although the residue-level distance-dependent potential was developed by Sippl in 1990 [110], the first all-atom distance-dependent statistical potential was not obtained until 1998 by Samudrala and Moult [111]. Only a few more have been developed since [112120].

Different statistical energy functions differ in the reference states employed to estimate the expected number of atomic pairs at a given distance in the absence of any interaction. Samudrala and Moult used a conditional probability function [111], while Lu and Skolnick employed a quasi-chemical approximation [113]. The common approximation behind the two methods is the “uniform density” reference state [108] that statistically averages over the observed state for the distance dependence [110]. Zhou and Zhou proposed to employ uniformly distributed points in a finite-size sphere for the reference state (Distance-scaled Finite Ideal-gas REference state, DFIRE) [114] that led to an approximate analytical expression for the distance dependence. Shen et al. further refined the analytical expression to account for varied protein sizes and led to the DOPE (Discrete Optimized Potential Energy) energy function [115]. Cheng employed a free-rotating and self-avoiding chain model as the reference state to account for the effect of covalently bonded backbone [120]. The difference between these two new techniques and DFIRE is typically small [120122].

The relatively slow development of all-atom knowledge-based energy functions is largely because a statistical energy function is not considered to be theoretically rigorous [123125] and is thought to be useful for coarse-grained models only. Moreover, an all-atom statistical potential is often suspected to be less reliable than an all-atom physics-based energy function. However, all-atom statistical energy functions have been found to be comparable to, or more accurate than, physics-based energy functions in loop selections [126], restoring partially denatured segments with secondary structures [127], and refining near-native structures [128]. In restoring partially denatured segments [127], both explicit and implicit solvation physics-based force fields were less successful than the DFIRE energy function [114] together with a clustering method. Moreover, specific interactions obtained from a statistical approach are directly comparable to quantum calculations. Morozov et al. [129] showed an excellent agreement between a statistical hydrogen-bonding potential and quantum mechanical calculations. Gillis et al. [130] illustrated that statistical descriptions of cation–π and amino–π interactions have a significant correlation with quantum calculations at the Hartree–Fock and the second-order Möller–Plesset perturbation theory levels. The correlation coefficient is 0.96. By comparison, the correlation coefficient between quantum calculations and the results from the physics-based energy function CHARMM [27] is 0.89. In addition, Zhou et al. showed that a DFIRE-based statistical potential has some characteristics of a physics-based energy function in terms of database independence and transferability [131134]. These studies indicate that statistical energy functions are valuable counterpart to physics-based energy functions, even at the detailed atomic level. Thus, all-atom knowledge-based energy functions will likely play increasingly active roles in structure prediction beyond ranking decoy structures. For example, Yang and Zhou employed an improved version of DFIRE (DFIRE 2.0) based on finer grids to make an ab initio folding of terminal segments with secondary structures [122].

3.2 All-atom orientation-dependent potentials

Specific folding and binding of proteins rely on specific interactions. Evidence is abundant that many interactions are more specific and orientation dependent than what are described by existing statistical energy functions. The most well-studied specific interaction for protein folding is hydrogen-bonding interaction [135]. Hydrogen-bonding interaction is commonly described as an individual, physical or statistical term in many empirical functions for proteins (e.g. Refs. [23, 136138]). However, hydrogen-bonding is only a special case of polar–polar interaction. The interaction between polar atoms that are not hydrogen-bonded should be orientation dependent as well. There is evidence that this orientation dependence plays an important role in the formation of α-helices and β-sheets [139142]. Additionally, the interaction between polar and non-polar atoms is likely orientation dependent because the hydrophobic effect is caused by the re-orientation of water molecules (polar atoms) near a hydrophobic surface [143]. The orientation dependence described above is part of the physics-based approach through electrostatic interactions, but not yet accounted for by statistical energy functions. Recent advances in statistical orientation-dependent potentials focused on coarse-grained models [130, 144147], rather than a systematic treatment of polar interactions on an atomic level.

Recently, Yang and Zhou introduced a dipolar DFIRE (dDFIRE) that treats polar atoms separately from non-polar atoms [148]. In this method, each polar atom is no longer approximated as a point but is a point with a direction. The directions of polar atoms are defined by the covalent bond vectors between heavy atoms. If a polar atom (e.g. main chain oxygen) is bonded with only one heavy atom, the direction of the polar atom is determined by the bond vector. If a polar atom (e.g. main chain nitrogen) is bonded with two heavy atoms, the direction of the polar atom is determined by the sum of two bond vectors. Polar atoms bonded with three heavy atoms (e.g. backbone nitrogen of residue proline) are approximated as non-polar atoms. Figure 2 displays all defined directions of polar atoms in 20 amino acid residues. Once the directions of polar atoms are defined, orientation-dependent polar interactions can be extracted from known protein structures based on distance and orientation angles of physical interactions of dipoles. Application of the DFIRE energy function to ab initio refolding of protein terminal segments with secondary structure elements indicates that hydrogen-bonded interactions alone are not enough to make high-resolution prediction of segment structures with secondary structure elements [148]. Specific interactions between polar atoms and between polar and non-polar atoms all contribute significantly to the prediction accuracy of the structure of a terminal segment. An all-atom orientation-dependent knowledge-based energy function has also been extracted with rigid block approximation in the absence of distance dependence and found to be useful for side chain modeling [149151].

Fig. 2
figure 2

Directions of all polar atoms for the main chain (top left) and the side chains of all amino acid residues. One diagram, sometimes, shows several residues with similar side chain structures for polar atoms (e.g. –OH/SH group in Thr, Ser, Cys and Tyr)

4 Conformational sampling

In addition to the lack of an accurate energy function, another bottleneck facing protein structure prediction is conformational space sampling [152]. This can be illustrated by the fact that from CASP 6 to CASP 8, some reasonable predictions were made for free-modeling targets with less than 100 residues but none for proteins with more than 100 residues [153]. Because several review articles provided an excellent overview on conformational sampling techniques [154159] and a comprehensive review would require a separate article, we will only highlight a few newly developed sampling techniques that were implemented for protein folding and/or structure prediction. In particular, we will not discuss coarse-grained models [160162] in this review as they have become a commonly used tool for speeding up sampling.

4.1 Barrier crossing/flattening techniques

Efficient sampling of protein conformational space is challenging because the energy landscape of proteins has numerous barriers that prevent proteins from moving freely from one conformational state to another. How to efficiently cross these energy barriers is the aim of many sampling techniques. They can be generally classified into methods modifying potential energy landscape such as umbrella sampling [163] and accelerated molecular dynamics [164, 165], methods employing a generalized ensemble of the system (multiple copies) such as replica exchange [166] and parallel tempering [167], and combinations of the two techniques such as simulated tempering [168, 169]. These three approaches have been substantially improved and/or implemented for protein structure prediction and folding in recent studies [154157, 159]. A Grow-to-Fit method that reduces energy barriers due to side chain packing has been developed for the assignment of protein side chains using molecular mechanics force fields [170]. Among more recent examples, an improved accelerated molecular dynamics [171] demonstrated fast folding of Trp-CAGE and Trpzip2 [44, 172]. In this method, the energy surface is flattened to accelerate the barrier crossing process. Significantly faster convergence of thermodynamics properties of Trpzip2 [173] was observed by coupling replica exchange simulations to a non-Boltzmann structure reservoir generated from a high-temperature simulation [174, 175]. Replica exchange simulations were optimized by replica quenching [176] and reconstructing replica flow in the temperature ladder from first passage time [177]. Replica exchange simulations are also combined with specific biased potential such as hydrogen-bonding bias potential [178], repulsive and side chain interactions [179] and backbone-biased potential [180] for enhanced sampling. Enhanced sampling was also achieved by adaptive sampling of networks called Markov State Models [181]. Iteratively generating bias potentials targeting density of states has been shown to enhance the sampling of Go-type models [182, 183]. Similar to replica exchange, a forced random walk in temperature space allows a single simulation trajectory to traverse within a predetermined range of temperature to achieve accelerated sampling in MD simulations of small proteins with explicit solvent [184, 185]. A method has been proposed in which the simulation is initially performed at high temperature to sample the conformational space that is divided into smaller space within which subsequent room-temperature simulations are performed [186, 187]. Quick convergence was also demonstrated by coupling the replica exchange method with a general bias potential that does not correlate with the native protein structure [188190] and by performing orthogonal space random walk [191]. Applications of these novel techniques are mostly limited to molecular mechanics force field simulations on peptides and/or a few small proteins, and a comprehensive comparison between different techniques is yet to be available. Their effectiveness on larger proteins of realistic size and knowledge-based energy functions is not known.

4.2 Local-guided/biased sampling

Another method to increase sampling efficiency is to restrict the conformational space to be sampled. The fragment-based approach was introduced as a technique to reduce the conformational space by focusing on sampling of known native local structures only. However, it has been found challenging to recognize structurally similar fragments or templates from a prebuilt structure/fragment library [25] because these structures are built using a preset threshold of structural or sequence similarity. As a result, these structures are similar but not identical to the structure of interest. Somewhat random imperfections in these fragments/templates make it difficult to design a universal energy function to recognize them and to make a correct assembly despite their imperfections. This adds more demands to the grand challenge of developing an accurate energy function for protein folding and structure prediction [192]. In addition, fragment rigidity may make it difficult to reach near-native structures for some proteins. Indeed, Hegler et al. found that under the same energy function, fragment-based sampling of larger proteins (>70 residues) encounters kinetic limitation that is not seen in unrestrictive molecular dynamics [193]. Kim et al. further showed that sampling is often limited by the inability to sample rarely occurring torsion angles of a few residues [194].

One approach to conformational sampling is to guide it by hierarchical folding pathways. Ozkan et al. predicted structures by zipping (local folding) and assembly [61]. This method involves independent folding of local structures and growth (zip) or coalescence (assemble) of these structures with other structures and achieved encouraging results in CASP [60]. DeBartolo et al. fixed secondary structure iteratively during Monte Carlo folding simulations [195] and further improved the technique with multiple sequence alignment for torsion angle sampling distribution with DOPE and other empirical energy functions including a collapse term [187]. For a benchmark of 12 small proteins, their method achieved higher accuracy for secondary structure prediction than sequence-based prediction, and the accuracy of their tertiary structure prediction is within 6Å for 8 of 12 proteins [196]. Brunette and Brock proposed a model-based search that guides the sampling with partially folded models during simulations with the Rosetta energy function [197]. The proposed method did sample lower energy conformations than the simple Monte Carlo technique in Rosetta. However, the test is quite limited because in the absence of homologous structural fragments, both the proposed method and Rosetta performed poorly for 29 out of 32 testing proteins, perhaps due to limited sampling in their experiments on homolog-free structure prediction.

A similar approach employs locally biased sampling. Hegler et al. showed improved sampling by a local energy term that is derived from local fragment sequence alignment and tested their technique in CASP 8 [193]. Chen et al. developed a move set for protein folding based on statistical knowledge of torsion angles [198]. Their test is limited to a native-contact biased model. Yang and Liu improved protein sampling by genetic algorithm in discrete backbone dihedral angle space [199]. Zhao et al. sampled the backbone via local biases from a probabilistic, conditional random/neutral fields model on the relation between protein sequences and backbone structures [200202]. Their application to CASP 8 targets is on a par with other best predictors. Similarly, Boomsma et al. [203, 204] proposed a generative, probabilistic model for local structure sampling. Testing of the technique was limited to the ability to sample near-native conformations.

To summarize, the above studies on local-guided/biased sampling suggested significant potential. However, large-scale benchmark tests and optimized integration with a suitable energy function with an all-atom model for final packing are needed to further improve or confirm the accuracy of protein structure prediction.

4.3 Secondary structure and torsion angle restraints

Another approach for reducing conformational space is to employ predicted secondary structures (e.g. [4, 205209]). However, predicted secondary structure is often represented by coarse-grained three states of helices, coils and strands because the accuracy of predicting more than three states is too low to be useful [210]. Restraints based on predicted secondary structures are limited to ideal shapes of helical and strand residues only because coil residues do not have a well-defined structure.

One way to avoid the limitation of predicted secondary structures is to predict backbone torsion angles. However, multistate torsion angles are as difficult as secondary structure to predict [211215]. For example, Zimmermann and Hansmann [216] obtained a three-state prediction accuracy of 79%, the same level of accuracy for secondary structure prediction [217]. Recently, Zhou et al. demonstrated that real-value backbone torsion angles could be predicted with reasonable accuracy [218220]. One limitation of direct real-value angle prediction is that many predicted angles are located in sterically prohibited regions. This limitation was remedied by mixing the advantage of multistate prediction (avoiding prohibited regions) and that of real-value prediction (continuous representation) [221]. This was done by making a two-state peak prediction first and followed by predicting the deviation from the predicted peak. The final method (SPINE XI) further refines the prediction by a conditional random field model and leads to an accurate prediction of real-value torsion angles that is close to the accuracy of angles derived from NMR chemical shifts with the methods TALOS [222] and TOPOS [107]. Multistate prediction derived from predicted real values by SPINE XI is even more accurate than predicted states from those methods dedicated to multistate prediction. For example, a three-state prediction accuracy based on a five-residue block of 8 consecutive torsion angles defined by multistate predictor LOCUSTRA is 81% by SPINE XI and 79% by LOCUSTRA [216].

Predicted real values of torsion angles serve as significantly more powerful restraints for fragment-free protein structure prediction than predicted secondary structure. Using a benchmark of 16 proteins and defining success as the ability to sample a structure with less than 6 Å RMSD from the native structure within top 15 predicted structures, Faraggi et al. [221] showed that the success rate increases from 6 with predicted secondary structure as restraints, 10 with predicted real-value torsion angles for helical and strand residues only as restraints, to 12 with predicted real-value torsion angles as restraints for all residues. The median RMSD value for these three cases decreased to 6.3, 5.4 and 4.3 Å RMSD, respectively. Here, torsion angles are not restrained if they are within the predicted ranges of error, restrained harmonically if greater than predicted ranges but within twice the predicted ranges, and subjected to a constant penalty if above twice the predicted ranges. This result demonstrates the importance of real-value prediction (67% increase in success rate and 14% reduction in the median RMSD value), and of coil residue restraints (another 20% increase in success rate and 20% reduction in the median RMSD value) in structure prediction. In Fig. 3, the case of the SH3 domain protein (PDB ID: 1shf) is given to illustrate the importance of real-value torsion angles for sampling of non-ideal beta strands.

Fig. 3
figure 3

The best structures in top 15 predicted structures obtained by predicted secondary structure as restraints (8.5 Å RMSD, a), predicted real-value torsion angles for strand residues as restraints (6.9 Å RMSD, b), predicted real-value torsion angles for all residues as restraints (3.1 Å RMSD, c) are compared to the native structure (d) for the SH3 domain protein (pdb id 1shf). It is clear that only real-value prediction allows the sampling of bended strand conformation

5 Summary and outlook

Some progress has been made towards ab initio prediction of protein structure by physics-based force fields. The progress, however, is limited to a few small helical or mixed helical and strand proteins. With intensive development in next generation force fields and advances in computing power, there is hope that physics-based methods may emerge as a powerful tool for structure prediction. Meanwhile, lack of progress in knowledge-based approaches for template-free modeling calls for fresh ideas. This review describes several trends in recent literature: development of physical, polarizable force fields and specific orientation-dependent all-atom, statistical energy functions, and smoothing or reduction of sampling space via improved sampling techniques and local bias or restraints.

One noticeable trend is the increased use of molecular force fields coupled with solvation free energy for scoring or ranking near-native conformations generated from conformational sampling. This approach, however, neglects the contribution of entropy (dynamic motions) in stabilizing native conformations of proteins because typical molecular force fields characterize the energy rather than free-energy surface of proteins. A more effective scoring function would require re-training all force field parameters (van der Waals parameters and partial charges) to mimic the free-energy surface and allow a more accurate account of the effect of atomic movement on solvent dielectric [223, 224].

In summary, recent studies suggest that it is possible to reach near-native structures without borrowing native fragments or templates from other proteins. Although fully flexible conformational search is one or more orders of magnitude slower than rigid fragment–based search, it has the potential to reach more accurate, high-resolution structure needed for function prediction and analysis. This fully flexible, continuous sampling approach coupled with more specific, accurate energy functions will likely lead to the next generation methods in structure prediction.