Statistical dictionaries for hypothetical in silico model of the early-stage intermediate in protein folding

The polypeptide chain folding process appears to be a multi-stage phenomenon. The scientific community has recently devoted much attention to early stages of this process, with numerous attempts at simulating them—either experimentally or in silico. This paper presents a comparative analysis of the predicted and observed results of folding simulations. The proposed technique, based on statistical dictionaries, yields a global accuracy of 57 %—a marked improvement over older approaches (with an accuracy of approximately 46 %).


Introduction
Ab initio protein structure prediction methods (new fold, Boltzmann-based) [1] strongly depend on initial structures. Optimization algorithms tend to produce conformations which either match or closely approach local minima instead of the protein's native form. Some progress in this regard can be observed by tracking the outcome of the CASP competition (http://www.predictioncenter.org). Experimental analysis indicates that protein folding involves multiple stages [2][3][4][5][6][7][8] and this observation is further reinforced by in silico models [9,10]. The analysis presented in this work assumes a two-stage process [11][12][13][14]. We will focus on the so-called Early Stage (ES) intermediate whose structure can be derived on the basis of a limited conformational subspace, restricting the allowed set of (u, w) angle pairs to an elliptical path on the Ramachandran plot. The rationale behind this restriction is extensively discussed in [15][16][17][18][19][20][21][22] and has been stipulated for many years [23].

Early stage model (ES)
The ES model assumes that the initial conformation of the polypeptide chain can be predicted on the basis of its backbone, neglecting side chain contributions. In our model the ES intermediate is expected to conform to the previously mentioned limited conformational subspace [12,15,16]. This subspace is represented by an elliptical path which traverses areas corresponding to well defined secondary structural motifs on the Ramachandran plot. Its shape and placement follow from analysis of the chain's backbone structure, expressed using pairs of V-angles, i.e. angles between planes corresponding to two adjacent peptide bonds. This second-order function delineates a path along which the curvature radius matches observed values (Fig. 1).
If each observed pair of (u, w) angles is projected onto the limited subspace using the minimum distance criterion, the distribution of the resulting pairs (u e , w e ) can be shown to exhibit seven distinct maxima (Fig. 2). The areas corresponding to each local maximum can be translated into a structural code, resulting in a structural alphabet which consists of seven letters (A-G). This alphabet enables us to express the predicted structure of the ES intermediate with the precision of limited conformational sub-space.

ES structure prediction
Once the structure of the polypeptide chain (as given by PDB) is denoted using the structural codes discussed above, it becomes possible to study the relation between residue sequences and structural codes. This relation can be expressed as a contingency table in which each sequence of amino acids corresponds to a given code with specific probability. Contingency tables can be used to predict the structure of input sequences. While constructing our structural alphabet we have applied the greatest probability criterion and selected tetrapeptide fragments as the basis of our contingency tables.
As already indicated, the ES intermediate structure can be predicted to within the nearest maximum of the limited conformational subspace. Further analysis based on information theory principles indicates that the quantity of information required to make this prediction corresponds closely to the quantity of information which is present in the polypeptide chain itself [16]. The accuracy of structural predictions based on tetrapeptide fragments and contingency tables has been discussed in [24]. In this paper we present a different code selection method, based on statistical dictionaries which permit us to take into account longer input sequences.

Statistical dictionaries
The newly implemented early-stage secondary structure prediction method is based on statistical dictionaries: we have assembled a dictionary of primary substrings and their corresponding secondary structures. In general, dictionary methods use a large set of items-words, translations, sequences of symbols etc. These methods are applied in many domains: text translation (the dictionary contains a number of phrases with the corresponding translations), speech synthesis, cryptography, etc. Dictionary methods depend on a large set of previously solved problems in order to find a solution to the problem at hand. Even if a direct solution is not present in the dictionary, the solver algorithm may find similar problems and use their solutions to generate a suitable answer. The presented method is based on the assumption that a sufficiently long substring of the primary structure always leads to the same secondary structure subsequence. The method consists of two stages: dictionary construction stage and prediction stage.
Comparing the presented technique with earlier approaches based on analysis of tetrapeptide fragments indicates that using statistical dictionaries produces a marked increase of accuracy (from 46 to 57 %), rendering our new method superior.

Databases
The dictionary is built on the basis of selected proteins from the PDB database. A nonredundant protein database was generated using the BlustClust tool (http://www.ncbi. nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html).
Following elimination of proteins whose degree of sequential similarity was greater than 95 % the database numbered 24820 proteins. The training set consisted of 24426 protein chains while the testing set consisted of 246 protein chains, selected to be dissimilar to chains in the training set. This is essential to ensure, that the prediction stage does not use information about chains from the testing set. Residues involved in interactions with external molecules were identified by measuring the distance between the external molecule and the protein under analysis [a cutoff distance of 2.9 Å was applied, in line with PDBSum standards (http://www.ebi.ac.uk/pdbsum)].

Statistical dictionaries
Each dictionary contains records composed of two elements: the primary subsequence and corresponding secondary structure for the middle element of the subsequence. Substrings are generated from the training set using a sliding window. Each chain of length n generates n pairs (substring, secondary structure class). For a given length l of the window, [l/2] additional neutral 'X' symbols are added at the beginning and end of the chain. The sliding window is then moved from left to right, generating pairs. The secondary structure class applies to the middle element in the window. Our implementation collected substrings up to 13 elements long. The dictionary uses a family of hash functions [25] to place all strings in a number of hash tables. Each hash table creates one subdictionary D i , i = 1, 3, 5, …, l max . Subdictionary D i contains strings of length i. Each record placed in a dictionary is composed of two elements: the primary string and a set of seven counters counting the occurrences of seven possible structural code classes (A, B, …, G) for the middle element of the primary string.
The prediction algorithm uses information from the dictionary built in the first stage. Each position of input string p s is analyzed. For each position, subdictionaries D i , i = 13, 11, …, 1 are used to match a substring extracted from p s , from position p s [k -2i] to p s [k ? 2i]. If a match is found, the corresponding best secondary structure class is retrieved from the dictionary. If an exact match is not found, another try is made to find an approximate match with one non-matching position. If not successful, a smaller value of i is taken. The last subdictionary, D 1 , contains all twenty possible elements so this algorithm always finds a match. Sequence p s is additionally padded with a sequence of [i/2] 'X' elements at the beginning and end, which is not shown in the code.

Evaluation measures for prediction of the 7-class structural alphabet
The evaluation formula is very simple and similar to the Q3 measure. For a given amino acid chain of length n, the observed structural code is denoted as S obs [1…n], and the predicted structural code as S pred [1…n]. The accuracy for this amino acid is computed as m/n, where m is the number of indexes i, for which S obs [i] = S pred [i] and n is the length of the chain. Accuracies for all 7 classes (A-G) of the structural alphabet have also been computed in a similar way. For each class only positions with S obs [i] equal to this class have been taken into account. If there were no elements of this class in the secondary structure, the accuracy for this class was assumed to be 0 % (which may be a bit misleading). The total accuracy for the whole testing set is defined as the arithmetic mean of accuracies for all chains. Total accuracies for 7 classes of the structural alphabet are computed analogously.

Comparative analysis
Predicted structural codes were compared with secondary structures determined by the DSSP algorithm for structures deposited in PDB [26,27]. The secondary structures were obtained from the online DSSP database (http://www.cmbi. ru.nl/dssp.html). Additionally, the prediction results were collated with prediction of secondary structures obtained by the SPINE X method [28,29] for the identical testing set of protein chains. The method distinguishes three secondary structure classes-helical (H), extended (E) and coils (C). In order to draw a comparison, such three groups of structures were created also for ES structural codes and DSSP structures. DSSP structures were grouped as follows-helical structures contain H (a-helix), G (helix-3) and I (helix-4), extended-B (b bridge) and E (strand), coils-T (turn), S (bend) and not classified. The same division was used by authors of SPINE X for evaluating predictions. The ES structural codes can be easily assigned to helical (C) and extended (E and F) structures. The four other codes create the third group but they cannot be identified with turns, bends and coils unambiguously.

Results
Results summarized in Table 1 present the overall accuracy of the structural code identification method discussed above. The aggregate value of 56.67 % compares favorably to results obtained using contingency tables which assign structural codes to tetrapeptides. Table 1 also shows the prediction accuracy for residue sets obtained by eliminating residues involved in external interactions (with ligands, other proteins or DNA/RNA chains). The differences between all four groups of results are negligible-the statistical dictionary method does not seem to favour noninteracting residues, while the contingency table method is substantially affected by eliminating residues engaged in ligand interaction as shown in [24]. In contrast, elimination of residues which interact with proteins and DNA/RNA does not alter the accuracy of predictions and both methods are quite similar in this scope. Results obtained using the maximum probability criterion are on the order of 46 % and seem affected by the status of each residue (i.e. whether it is involved in external interactions). As shown, this correlation is strongest for residues which bind external ligands and other proteins, whereas interaction with DNA/ RNA chains has a limited effect on prediction accuracy. The proposed method does not seem affected by such perturbations-whether due to methodological differences or to the relatively limited representation of interacting residues in the study set. The physical model assumes that the presence of external factors (such as ligands) may affect the local conformation of peptide bonds. Due to its highly specific nature of such distortions we should not expect the resulting conformation to match the ''standard'' structural form for a given sequence.
The improved accuracy of the statistical dictionary method (which takes into account fragments consisting of 1-13 amino acids) indicates that tetrapeptides are not sufficient for predicting the structure of the resulting chain. Restricting analysis to such short fragments effectively eliminates all nonstandard conformations, while taking into account longer chains may result in (correct) selection of structural forms which occur with lower probability.
Prediction accuracy for individual amino acids Table 2 presents the prediction accuracy for individual amino acids. The presented values (obtained using the statistical dictionary method) hint at specific correlations (Fig. 3).
Major differences can be observed for C-type structures (clockwise a-helix) and for cysteine. The presented method is less apt to propose a-helical forms for all residues except aspartic acid. D-and F-type structures are predicted with greater accuracy for most residues. Code D represents transitional structures which form the bridge between the a-helix and b-twist areas on the Ramachandran plot. Likewise, code F is adjacent to the b-twist area, aggregating forms with low negative values of u. The corresponding structures are generally deformed counterclockwise a-helixes. Analysis of such structures indicates that they represent important deviations from a and b forms: codes D and F are usually found at the ends of wellknown secondary motifs (D for a-helixes and F for b-twists respectively). Termination of such motifs produces a new structural class (see Fig. 3.5 in [14] ) which is very important from the point of view of determining the overall conformation of larger residue chains. The greater predictive accuracy of the statistical dictionary method should be viewed as a significant advantage in this regard. Another notable difference between the presented methods is the lower accuracy of the statistical dictionary method for cysteine residues (where only B-type structures are more accurately predicted than using the contingency table method). A decrease in accuracy is also observed for glycine (affecting 5 out of 7 structural codes), however the statistical dictionary method produces better results for G-type structures which are the most common conformation for this amino acid. The presented method is also less accurate with regard to B-type structures and-somewhat unexpectedly-C-type structures. Code C represents a clockwise helix which dominates the structure of many proteins. Results obtained using the older method suggest significant overrepresentation of helical fragments.

Individual prediction examples
For 2VBL the statistical dictionary method produced correct results in 92 % of cases. All a-helixes and b-twists were correctly predicted (Fig. 4), with incorrect structural codes occurring mainly at the ends of a-helixes. The contingency table (tetrapeptide) method achieved a much lower accuracy (51 %) with a marked overrepresentation of helical structures.
2JEK is an example of a protein for which the statistical dictionary method produces less accurate results than the contingency table method (12 % decrease in accuracy). The statistical dictionary method is less apt to propose helical structures, which form the majority of this protein (Fig. 5).
The final example is 2VAD for which the statistical dictionary method proved vastly superior to the contingency table method (85 vs. 35 %). This particular protein consists mainly of b-sheets; a structural motif for which the contingency table method produces poor results. Figure 6 highlights the differences between the outcome of each algorithm, with extended fragments corresponding to individual b-sheets. Another possible reason for the reduced accuracy of the contingency table method is the potential presence of a ligand, which distorts the protein's conformation.
Additional examples of structures predicted with particularly high or low accuracy are presented in Table 3.
Analysis of results listed in Table 3 confirms that the statistical dictionary method is less accurate when modeling helical structures. This is however, compensated for by its high accuracy with regards to b-twists and random coils (codes A, B, D and G), as confirmed by our analysis of 1CR9-L (immunoglobulin domain) and 1XAU-A (random coil).

Comparison with SPINE X method
The accuracy of secondary structure prediction is presented in the Table 4. The level of correct prediction of helical structures is especially high for ES prediction method (78.3 %), while the SPINE-X method overpredicts coils (helixes-36.5 % and coils-48.4). The extended

Discussion and conclusions
In conclusion, it should be noted that the proposed method provides significantly more accurate results than the contingency table method [24] with an overall accuracy of 57 %. This accuracy seems sufficient given that determining the final structure of the target protein requires another simulation step-the late stage (LS) intermediate, which accounts for pair-wise interactions between atoms, as well as interactions between the polypeptide chain and its environment [13,30,31] Blue, red and green fragments correspond to residues which form a-helixes, b-twists and loops respectively. Source: PyMOL each residue (compared to the tetrapeptide fragments, which form the basis of the contingency tables). This results in better prediction accuracy, particularly in the scope of D and F motifs which correspond to the terminal parts of a-helixes and b-twists respectively. Of note is the reduced accuracy in predicting cysteine and glycine conformations-this, however, can be alleviated by incorporating elements of the contingency table analysis algorithm into the proposed method. The further work assumes the analysis of non-redundant data base with \30 % sequence similarity. The comparative analysis of these two data base may deliver information about possible influence of homology sequence on the final prediction. Fig. 6 2VAD structure (A chain) a native structure derived from PDB, b structure obtained by projecting each (u, w) angle pair onto the elliptical path which represents the ES conformational subspace, c ES structure obtained using the statistical dictionary method, d ES structure obtained using the contingency table method. Blue, red and green fragments correspond to residues which form a-helixes, b-twists and loops respectively. Source: PyMOL The detailed analysis of (u, w) angles distribution additionally suggests the possible incorporation of the zone B to the zones E and/or F. Elimination of B of low probability observed for this zone may significantly improve the prediction reliability of the model. The discussion of the effect of ligand binding seems unrelated to the model under consideration. However the late stage model taking in consideration the interaction of folding polypeptide with the surrounding environment (water and ligands) seems to be significantly sensitive to the external molecules. This was the reason to distinguish the status of particular residue in respect to possible interaction influencing its conformation. The comparative analysis (Table 4) reveals much better prediction of random coil structures SPINE-X, however the others recognitions seem to be of similar efficiency.
Besides the methods based on theoretical calculations some experiments deliver valuable information about the ES steps of protein folding process. Experimental observations [for example hydrogen-exchange pulse-labelling mass-spectrometry method applied for large two-domain maltose binding protein (MBP; 370 residues)] suggest the presence of intermediate composed of segments that are distant which generate the immediate interaction and final collapse in the next steps of folding process [32]. However ab inito methods are limited to the proteins of domain-like size pf about 100-120 aa. This is why the experimental analysis of small molecules like RNase H (152 aa 1F21) may the perfect object for verification of theoretical methods simulating folding process and protein structure prediction [33].