Introduction

Although the multi-stage nature of protein folding has been confirmed experimentally [1], experimental research into the structure of early stage intermediates remains scant. In silico models are thus needed to supply adequate starting structures for folding simulations. Several authors have recently presented experimentally determined structures which they claim to correspond to early stage intermediates [2].

Many traditional in silico structure prediction methods depend on a set of starting structures, subjected to energy minimization algorithms in hope of arriving at the native form of the analyzed protein. Preparation of such starting structures depends on the model in question, with important differences separating Darwinian and Boltzmann-based approaches [3]. The former model relies on preserving structural similarities as the chain undergoes evolutionary, sequential changes, while the latter treats folding as a spontaneous process triggered by the chain’s natural propensity to seek out its global free energy minimum. Determining starting structures is much easier in the Darwinian approach due to the existence of protein homologues whose structure is well known. Pasting together structural motifs from sequentially similar fragments yields useful structures which can then be subjected to energy minimization. The Boltzmann approach is far more challenging and typically relies on Monte Carlo simulations to stochastically select Φ and Ψ angles for a given chain. Some Boltzmann algorithms depends on databases which contain data on short peptide fragments (tripeptides or 9-peptides—such as in the Rosetta package [4]).

The model presented in this paper attempts to simulate early stage intermediates by referring to a limited conformational subspace established within the bounds of the Ramachandran plot [5]. The subspace assumes the form of an elliptical path whose shape and placement are the result of geometric analysis of the polypeptide chain (Fig. 1). The path traverses all fragments of the map which correspond to specific structural motifs. Assuming that the theoretical model is correct, generation of the early stage intermediate may take on two forms. In the step back procedure the crystalline form undergoes changes intended to reverse the folding process, replacing the values of Φ and Ψ with corresponding pairs of early-stage angles (Φ e and Ψ e ) belonging to the limited conformational subspace. From an algorithmic point of view, each pair of dihedral angles {Φ, Ψ} is matched to a corresponding early-stage pair {Φ e , Ψ e } which lies on the elliptical path and is closest to the original pair (see [59] for a more detailed description of this process and refer to Fig. 1). Given this distribution of dihedral angles, the limited conformational subspace is partitioned into seven sections, each centered upon a distinct local probability maximum, labeled A through G. This process established the structural alphabet for the early stage intermediate (Fig. 2) [10].

Fig. 1
figure 1

The step-back procedure assigns to each pair of angles {Φ, Ψ} a corresponding pair {Φ e , Ψ e } which lies on the elliptical path. Letter codes (A, B, C, D, E, F and G) denote local probability peaks (Fig. 2). The derivation of the elliptical path which represents the limited conformational subspace is further explained in [5, 11]. The figure shows three sample pairs of angles and their subspace counterparts

Fig. 2
figure 2

Sample distribution of probability for a randomly chosen amino acid (histidine). a continuous distribution obtained using the step back algorithm. b discrete distribution obtained using the step forward method based on the contingency table. t(deg) is the offset (in degrees) along the elliptical path, from an arbitrary starting point in the middle of the lower right-hand corner of the Ramachandran plot

Applying the above algorithm to a nonredundant set of proteins produces a contingency table which lists the relations between structural motifs and peptide sequences (in our case, the base sequence is a tetrapeptide fragment). Given the number of possible code variations and structural motifs, the size of the contingency table is 160,000 × 2401. Each cell lists the probability of encountering a specific structural motif for a given sequence of peptides (see Table 1) [10]. The other possible approach, called the step forward model, applies the contingency table directly to assign structural motifs to tetrapeptide fragments within the limited conformational subspace. This paper compares both approaches in order to highlight the most common misconceptions associated with the step forward procedure. The step back algorithm is treated as a baseline when determining the scope of simulation errors and inaccuracies [11]. It should be noted that the limited conformational subspace is—in itself—not a novel concept as theoretical and experimental considerations have led others to suggest similar approaches [12]. The presented derivation based on the geometric model of the polypeptide chain [11] is merely one of many attempts to establish such a subspace.

Table 1 Representative fragment of the contingency table generated by the step back algorithm

The effectiveness of the presented early stage intermediate generation method is verified on the basis of a set of protein chains extracted from the Protein Data Bank (PDB). The presented study complements the outcomes of these simulations as a crucial step in the protein folding process [13]. The goal of the early stage analysis step, presented here, is to assess the effectiveness of the proposed model when applied to raw amino acid sequences. This paper shows where the model succeeds and where it fails; it also explains the reasons behind simulation failures (on either stage of the folding process). The main aim of early stage model is to deliver the structural forms for further energy minimization procedures (computational interpretation) and to deliver structural forms which mimic the initial steps of folding process (biological interpretation).

Materials and methods

Data

The testing subset of 250 protein chains has been chosen randomly from nonredundant set of protein structures from PDB. The teaching set of nonredundant set of protein structures has been selected from PDB on a basis of data obtained in December 2011 by means of the BLASTClust tool for protein sequences characterized by sequence identity not higher than 95 %. The testing subset of protein chains is 1 % of the whole nonredundant data basis of proteins. The teaching set did not contain the proteins belonging to test set.

Early stage model

As highlighted above, the in silico folding model applied in our work can be divided into two stages: the early stage (ES) and the late stage (LS) (see [13] for a thorough description of the model). The early stage is simulated by adopting a limited conformational subspace, corresponding to an elliptical path on the Ramachandran plot. For more information on how this subspace is derived refer to [13]. An important property of the presented elliptical path is that it traverses all areas of the plot which correspond to well defined secondary structural motifs (Fig. 1).

The step back procedure

The step back procedure relies on translating each pair of dihedral angles {Φ, Ψ} into its corresponding “image” which lies on the elliptical path (limited conformational subspace), using the least-distances rule (Fig. 1). The angles comprising this image will hereafter be denoted as {Φ e , Ψ e }. Performing these computations for a nonredundant set of proteins yields probability profiles which indicate the likelihood of encountering specific pairs of angles along the elliptical path. A sample distribution (for a randomly chosen amino acid) is visualized in Fig. 2. For each amino acid seven distinct probability peaks can be distinguished; these peaks are assigned letter codes (A through G). It should be noted that code C corresponds to an α-helical structure, code E represents a β-sheet while code G stands for a left-handed helix. Codes A, B and D all represent poorly ordered structures traditionally referred to as random coils (RC).

The values of {Φ, Ψ} (as they occur in the actual protein) are replaced with {Φ e , Ψ e } pairs belonging to the limited conformational subspace. Subsequently each point in the subspace is matched to a local probability peak and assigned a letter code, as shown in Fig. 2 [10].

Contingency table

The procedure described above, when applied to a large number of proteins (nonredundant database), yields a contingency table whose rows (160,000 in all) correspond to tetrapeptide sequences while columns (2401) represent various combinations of structural codes. The tetrapeptide was taken as the basic unit of structure as it is the shortest fragment to which a specific (secondary) structural motif can be unambiguously assigned. The contingency table expresses the correspondence between tetrapeptide sequences and structural motifs occurring in the early stage intermediate. It can be directly exploited in structural simulations of known peptide chains (Table 1). In this study, the contingency table has been created only for the subset of the nonredundant protein structures data base with the protein chains belonging to the testing subset excluded from teaching database. The aim of the modification was to avoid positively biased prediction of protein structure and to separate the elements of teaching set in respect to testing set.

Assignment of structural codes is performed in an overlapping fashion, as shown in Fig. 3. For each input sequence a consensus structure can be defined by taking the most frequently occurring code at a given position in each of the four overlapping structural chains. If no code fulfills this criterion, consensus is based upon the highest probability values in the contingency table.

Fig. 3
figure 3

Assignment of structural codes to an input sequence of amino acids

Statistical analysis

Convergence assessment of both algorithms (step back and step forward) has been performed by using Chi-square testing as well as analysis of RR (relative risk) [14], OR (odds ratio) [14] and D (distance) [14] parameters. Discrepancies between various structural models can be explained by analyzing the dependencies between correct (or incorrect) simulation results and the involvement of individual residues in interaction with external molecules (e.g., ligands or other proteins). A sample table which expresses these dependencies is shown below (Table 2). External interaction is defined as an engagement of particular residue in ligand (protein, ion, nucleic acid) complexation. This identification is based on PDBSum standards (the distance criterion—distance below 4 Å) [15]. A Chi-square test has been applied to assess the dependencies listed above. Values of the Chi-square statistics indicate dependencies (p < 0.05), which are treated as effects of external interactions upon the conformation of a given amino acid. All relevant calculations were performed using the Statistica package [16].

Table 2 A sample grouping data with respect to involvement of residues in external interactions (with ligands and/or other proteins) and the validity of structural code predictions generated using the presented algorithms (e.g., NNY is the number of residues for which structural codes have been incorrectly predicted and which form bonds with external molecules)

Results and discussion

Table 3 presents a summarized assessment of the accuracy of step forward structural predictions, compared with step back simulations. Of note are the large values along the diagonal, which indicate a high ratio of correct predictions (except the position B). Figure 4 contains an equivalent graphical representation of this data. Secondary structural motifs (code C—α-helix; code E—β-structure) are correctly modeled around 55 % of the time. Note the high ratio of correct predictions for codes A, B and D despite their relative scarcity in actual proteins. Code F, traditionally associated with β-like motifs, is also modeled with adequate accuracy (approximately 48 %).

Table 3 Frequency of structural code predictions for algorithms based on the contingency table. The table lists the similarities and differences in results obtained using the step forward (treated as golden standard) and step forward approaches, for the entire protein dataset, grouped by individual structural codes
Fig. 4
figure 4

Comparison of results for the entire testing dataset. The figure lists the aggregate frequency of correct and erroneous predictions for all amino acids. a normalized versus step back (Table 3). b normalized versus the step forward (Table 4)

Figure 4a indicates overestimation of code C. The very good result concerns the codes E and F. They represent the β-structural forms quite difficult to be predicted. Very promising is also the code G although classified erroneously as code A. Figure 4b reveals the erroneous recognition of A and G which appear highly entropic (information entropy) sharing a similar likelihood. The positive characteristics concerns the code D, which seems to play an important role as the zone linking the α-helical and β-structural zones on Ramachandran map.

Table 3 reveals overestimation of C-type (α-helical) structures compared to all other structures. Of note is the relative abundance of non-secondary structural motifs. Codes F and G are modeled with high accuracy (24.82 % and 39.87 % respectively). Given the relatively low frequency of A- and D-type motifs, even the obtained prediction values of 4.61 % and 5.97 % (respectively) should be considered satisfactory. The accuracy of prediction for individual amino acids is listed in Tables 3, 4 and Fig. 4. Both diagrams confirm that α-helical motifs are excessively favored with respect to other types of structures.

Table 4 Frequency (in percentage value) of correct structural code predictions for algorithms based on the contingency table. The table lists the similarities and differences in results obtained using the step back (treated as golden standard) and step forward approaches for the entire protein data set, grouped by individual structural codes

Characteristics of individual amino acids

To search for the possible specificity of particular amino acid the failure cases were analyzed in respect to each residue individually. High prediction accuracy is noted for ASN in zone E and ASP in zone G. PHE exhibits affinity for zone F, which is an important observation, as, according to research presented in [17], this zone may be associated with amyloidogenesis. One should also note the peculiar properties of CYS, which result from the relatively broad structural variations in this amino acid. While GLY does not appear in zone G as frequently as might be expected, HIS seems correctly related to zone A. Finally, the results for zone D (for all amino acids) appear particularly promising. The importance of this zone for structural modeling has been noted in [18].

As can be seen in Fig. 5 and Table 5 that almost all amino acids demonstrate the best predictability for C (α-helix) structural code. The second best predicted structural code is E (β-structure). However ASP and PHE are the exceptions. The first one represents high predictability for G code (left-handed α-helix) and the second one for F code (traditionally treated as β-structure although the distinguishing between E and F structural forms seems to be important).

Fig. 5
figure 5

Amino acids predictability. a correct predictions, b false predictions

Table 5 Accuracy (expressed in percentage values) of structural code predictions for individual amino acids (NA means that the given amino acid has not been observed in a particular zone)

Individual case studies

In order to ascertain the reasons behind erroneous predictions the authors have performed accuracy analyses for specific proteins. Table 6 lists the best- and worst-case scenarios identified in the course of this study. The distinguishing between helical and differentiated secondary structure is made according to different level of predictability of helical fragments in relation to all other secondary structural form.

Table 6 Best- and worst-case scenarios from the point of view of structural accordance between step back and step forward predictions. The best results are further subdivided into entirely helical and non-helical structures

The set of well predicted proteins of mainly helical structure represent quite differentiated length of polypeptide chain that suggests the accuracy of prediction not dependent on the size of particular molecule. The proteins of differentiated secondary structure (rather large size with β-structural motifs) appeared surprisingly as predicted correctly quantitatively even higher than helical structures. This observation seems to make the model quite promising.

The lowest predictability was achieved for proteins of low size although of high participation of random coil structures (almost entirely unstructuralized proteins like 3C05 or 2RQW) and proteins of large size of entirely β-structural form.

Helical proteins

The protein 2BA2 (PDB code) is a sample helical protein for which both presented approaches provide consistent predictions. Its native three dimensional structure, as well as the outcomes of step-forward and step-back simulations, are visualized in Fig. 6.

Fig. 6
figure 6

Three dimensional structures of the protein 2BA2 (PDB code) which exhibits the highest structural prediction consistency among purely helical proteins. a the native structure obtained from the PDB, b the step back model, c the step forward model. Fragments forming α-helixes and loops in native structure of the protein (determined using the DSSP algorithm) are marked in cyan and magenta respectively in all three images.

Proteins with differentiated secondary structure motifs

The protein 1PCZ (PDB code) is representative of a class of proteins which contain α-helical and β-sheet motifs. In spite of this fact, this protein exhibits relatively high prediction accuracy, as illustrated in Fig. 7. Of note is the satisfactory accuracy of β-structure prediction; traditionally a difficult task in ab initio algorithms [19]. Both models (step back and step forward) appear to correctly identify the location of loops.

Fig. 7
figure 7

Three dimensional structures of protein 1PCZ (PDB code) which exhibits the highest structural prediction consistency from among proteins not dominated by helical motifs. a the native structure obtained from the PDB, b the step back model, c the step forward model. Fragments forming α-helixes, β-sheets and loops in native structure of the protein (determined using the DSSP algorithm) are marked in cyan, red and magenta respectively in all three images

Proteins with poor prediction consistency

From among the analyzed proteins, one of the poorest ES prediction consistency was noted for the protein 2RQW (PDB code), with diverse structural characteristics. Visual inspection reveals significant variations between theoretical models for this protein: step back predictions differ greatly from step forward simulations (Fig.  8 ). The reason of failure lies in the algorithm by which Φ and Ψ angles are projected onto the conformational subspace: the step back model seeks the nearest point along the elliptical path, while the step forward approach adopts the coordinates of the relevant probability maximum corresponding to a given structural code. Thus, while the step back algorithm can be characterized as continuous, the step forward model is inherently discrete (of course, additional differences may result from the somewhat arbitrary assignment of structural motifs to sequence fragments, as previously discussed).

Fig. 8
figure 8

Three dimensional structures of the protein 2RQW (PDB code), for which the step forward predictions were the most inaccurate in the study group. a the native structure obtained from the PDB, b the step back model, c the “step forward” model. Fragments forming α-helixes, β-sheets and loops in native structure of the protein (determined using the DSSP algorithms) are marked in cyan, red and magenta respectively in all three images

Seeking the reasons behind the observed differences

In order to explain the reasons behind the mismatched predictions provided by both theoretical models the authors have focused on the involvement of amino acids in external interactions (other than short-range interactions with immediate neighbors and steric effects). Since the presence of external molecules may affect the resulting conformation of the polypeptide chain, it is worthwhile to assess the link between prediction accuracy and the involvement of specific residues in interactions with ligands, ions or other proteins. The following tables illustrate prediction accuracy as a function of such involvement.

Applying chi-square criteria to the values listed in Tables 7 and 8 reveals a causal link between the presence (or absence) of external ligands and the accuracy of theoretical predictions. Although the relation between the status of residue (engagement in any external interactions) and accuracy of its prediction appears to be significant, the engagement in ion binding or nucleic acid complexation was revealed as the opposite case. The residues engaged in ligand binding appear to represent the strong dependency between their status and accuracy of prediction taking into consideration all parameters measuring the dependency between effects of analysis and engagement in external interaction. The dependence was found for residues engaged in protein-protein interaction. However the OR and RR analysis suggests also (besides ligand binding) the ion complexation as influencing the status of particular residue in respect to presented analysis. The surprising result is lack of correlation (and influence) of nucleic acid complexation and the structural predictability of residue in respect to presented method.

Table 7 Summarized view of the effect of external interactions on incorrect predictions (total number of residues analyzed: 56,836). Each cell lists (respectively) the number of residues involved in protein complexation (P-P), ligand complexation (L), ion complexation (I), nucleic acid complexation (NA) and any form of complexation (ALL)
Table 8 Results of statistical analysis showing the values of chi-square test, OR, RR and D. Additionally to the values of particular parameter the 195 % OR confidence interval and 295 % RR confidence interval is given. The protein-protein interaction (P-P), ligand binding (L), ion complexation (I) and nucleic acid complexation (NA) were taken into consideration. ALL represents engagement in any form of external complexation

Steric clashes

It should be noted that both step back and step forward prediction results involve steric clashes (i.e., the chains loop back upon themselves or are packed too tightly). A special algorithm has been devised to resolve such problems by adjusting the values of Ψ e and Φ e angles within the limits imposed by the partitioning of the elliptical path into structural zones. Adjustments are performed in a hierarchical fashion, starting with zones A, B and D, then proceeding to zones F and G. Zones C and E are not affected. The convergence criterion is that no two atoms in the molecule may be brought closer than within 4 Å of each other.

Conclusions and Discussion

The goal of our analysis was to pinpoint the greatest problems associated with protein structure prediction. The model presented in this work is assumed to avoid as much as possible the random search for initial structural forms for further energy optimization procedures. Model avoids also the technique based on pasting the short polypeptides fragments preliminarily recognized as preferable for particular fragment. The technique defining the “consensus” sequence of structural codes in the overlapping system introduces the smoothing of structural elements without analogy to particular examples identified in proteins available in crystal forms. The accordance level received for presented technique seems to be satisfactory assuming that the detailed definition of the final structure is the result of late stage folding process which is able to introduce the local corrections to the structure defined in the early step. The elimination of clashes (as they appear in the structures generated according to presented model) although keeping the structural code is aimed on the verification of the model particularly in respect to defined structural zones. Introduction of random search for Φ and Ψ angles (outside the ellipse path) could deprive the model of its heuristic character.

The main idea of the model is to be able to trace the folding process in the sense of monitoring the steps introducing the mismatch to be able to recognize its source and the conditions. The main question concerning the protein structure prediction is not “How to predict the correct structure” but rather “Why do they fold the way they do”. It is expected that our model at least attempts to find the answer to this question.

Structural forms of early step folding is not available although the description of some rare cases can be found in literature [2]. Search for mechanistic model mimicking the folding process seems to be required. The introduction of limited conformational sub-space postulated earlier [12]. Model of ellipse path (constructed on the basis of backbone geometry) limits the size of conformational space to the extent of balancing the amount of information carried by amino acid sequence with the amount of information sufficient to predict the structure of early stage intermediate. It was shown that such balance is achieved for the presented model [8, 11]. The accordance received for presented model (a little bit below 50 %) is in the range of expectations. The high specificity of biological function requires highly differentiated structural motifs. The classification used for the construction of contingency table does not take into account the status of particular residue in respect to its participation in biological activity. The ellipse path was generated assuming the relaxed form of backbone. The balance between specificity and general model may be of the range comparable to the level of predictability of presented model. The construction of the late steps is assumed to be in strong relation to the construction of specific structural motifs related to function. The “late stage model” seems to collaborate well with the “early stage model” presented in this paper [13, 20]. Results shown in Tables 6 and 7 may suggest the important role of external factors which limit the conformational freedom of the backbone, what was assumed in the model presented. The construction of contingency table for residues not engaged in any external interaction is planed. The influence of external factors (ligands, ions, nucleic acids, and protein complexed) was the object of analysis presented in detail in [2123].

So far the presented model was used to predict ES intermediates for the following proteins: lysozyme [6], ribonuclease [8], hemoglobin [7] and BPTI [9]. Early stage structures (generated with the use of structural codes) were fed into late stage (LS) model calculations. Results suggest the validity of the resulting structures; thus the accuracy of the presented ES model is difficult to be estimated due to very low number of structures generated experimentally [2]. The validity of ES results is possible after performing the simulation of folding on the basis of LS model [20]. The structure created by LS model produces the structure easily comparable with the crystal structures deposited in PDB [15].

In spite of the above, comparative analysis of ES structures in the serpin family suggests that such proteins do indeed possess structural components facilitating biological function [24]. An important advantage of the proposed model is that it introduces a clear division of RC secondary structural motifs into several categories referenced by letters A, B, D and F. This distinction enables better differentiation of secondary structural characteristics. Another positive aspect of our approach is its relatively high accuracy in modeling such structures.

The early stage intermediate generation algorithm discussed here is a direct counterpart of fold recognition methods as defined in the CASP4 nomenclature [19]. In addition to predicting secondary structural characteristics for specific fragments of the polypeptide chain, it also models loops (fragments connecting different structural motifs) by distinguishing zones A, B, D, F and G. A distinct advantage of the presented approach is that it enables clear identification of the reasons behind incorrect predictions, whereas for Monte Carlo methods such analysis can only be performed at the final stage of simulation and does not enable the researcher to identify the origin of errors. In the context of our algorithm, erroneous predictions in the LS intermediate are found to correspond to the presence of external ligands which distort the structure of the polypeptide chain [2123]. Such interactions are not to be confused with natural folding preferences of the polypeptide backbone (Φ and Ψ angles). The work also presents the impact of external interactions on the folding process for specific types of amino acids.

The early stage intermediate appears in literature suggesting the increased interest in step-wise folding process [2533]. The one step folding model assuming the generation of 3D structure for known amino acid sequence solely was examined as impossible on the basis of information theory [34]. The analysis of early-stage geometrical motifs present in crystal structure suggest the reliability of the presented model despite relatively low however satisfactory level of the structural predictability [3538].