Background

Accurate descriptions of the different non-covalent interactions involved in protein folding and stability are essential for a number of related problems. Potential energy functions based on such terms have been widely used to facilitate: fold recognition [13], homology modelling [4, 5], docking [6], ab-initio structure prediction [79], sequence design [10] and the analysis of protein folding kinetics [11, 12]. In each case, the purpose of the potential function is to discriminate between a variety of alternative conformations, selecting the most energetically favourable (assumed to be the most native) for further analysis [13]. Different potential energy functions have been defined at different levels of structural resolution [14]. At the atomic level, various pairwise inter-atom potentials (force-fields) have been developed from the detailed analysis of small, protein-like compounds. These include: ECEPP [15, 16], MM [17, 18], AMBER [19, 20], CHARMM [2123] and GROMOS [24]. Potential functions between distinct groups of atoms have also been defined, typically between pairs of residues [8, 2528] or idealised elements of secondary structure [9, 2934]. These 'potentials of mean force' (mean-fields) have the nature of free energies [27, 35], and may be derived by conformational averaging [7] or, more commonly, by empirical methods as described below.

There are two commonly used methods for deriving empirical potential energy functions [36]. The first method employs a statistical analysis of the observed 'interactions' [8, 25, 26, 37, 38]. In this method, the observed occurrence of a particular interaction is weighted by its expected occurrence in a given reference state [27, 39]. The resulting statistical interaction propensities can be either converted into energies using the Boltzmann distribution [8, 25, 26, 38] or log-odds scores [40, 41]. However, it has been shown that these two types of propensity are essentially the same [36]. In the second method, a potential function can be directly optimised in order to discriminate between native and near-native (decoy) structures [42]. This technique resembles machine learning, and has been applied in a variety of different ways, usually by maximising the discrimination between an average decoy and the native structure [4346]. Either of the above two methods may be applied to any feature of the protein structure that can be parameterised [9]. In the current work, we focus on the statistical analysis of residue interaction propensities. Previously, a variety of different methods have been applied to derive empirical residue-residue interaction potentials, often yielding remarkably consistent results [27]. However, the physical basis of the empirically derived potentials remains ambiguous [47]. Specifically, it has been shown that protein structures are inconsistent with the assumptions that underlie the use of the Boltzmann distribution [28, 48].

The major criticism of empirical residue-residue interaction potentials is that they ignore the protein/solvent boundary [27, 28, 48]. Consequently, there is an apparent attractive force between residues that co-segregate into the protein surface or core regions [28]. To address this, several groups have developed residue-specific environment potentials. These residue-specific environment potentials are usually correlated with hydrophobicity, measuring the extent to which each residue is buried in the protein core. In this way these single-body environment potentials capture information about the protein/solvent boundary. Such potentials have been combined with residue-residue interaction potentials: as a 'solvent correction factor' [49, 50], as an ad-hoc repulsive term [38], and using a Bayesian framework to avoid over-counting [40].

The above combination of two-body, residue-residue interaction potentials with single-body, residue-specific environment potentials raises the question as to which type of potential is the most specific for the native protein structure. To address this question, we separated statistical residue interaction propensities into two different types of score: a two-body, residue-residue 'contact-type' score, and a single-body, residue 'contact-count' score.

These two types of score can be expected to capture qualitatively different kinds of residue interaction propensities. The resulting propensities can be understood in terms of biophysical properties of protein structure. For example, the contact-type score can encode the fact that hydrophobic residues tend to interact with other hydrophobic residues in preference to hydrophilic residues. In contrast, the contact-count score can encode the fact that bulky hydrophobic residues tend to have more residue-residue interactions than small hydrophilic residues.

Here we report a comparison of two-body, residue-residue 'contact-type' scores and single-body, residue 'contact-count' score, as described below.

Results

Two different types of residue interaction propensity are studied here, contact-type and contact-count. The 'two-body' residue contact-type propensities are based on the distinct amino acid types of a pair of contacting residues. The 'single-body' residue contact-count propensities are based on the discrete number of residue-residue contacts made by each distinct residue type. These two different interaction propensities are captured by the contact-type and contact-count scoring matrices, respectively. An example contact-type scoring matrix is given in Table 1, and the scores for some residues in an example contact-count scoring matrix are shown in Figure 1. The scores in these matrices reflect the observed residue interaction propensities in a set of native structures, and are defined in comparison to simple, random models of residue interaction. (For details see the Methods Section.)

Table 1 An example of data from a contact-type scoring matrix
Figure 1
figure 1

An example of data from a contact-count scoring matrix. An example of some data from one of the contact-count scoring matrices used in this work. The scores are defined as described in the Methods Section. In this matrix residue-residue contacts were defined using a 12 Å distance threshold between C β atoms without filtering for short-range interactions. Scores are shown for selected residues as trends across the range of observed 'number of contacts'. Missing values for a residue indicate cells of the matrix that were removed due to lack of data.

Both types of scoring matrix were constructed using several different definitions of residue-residue interaction. Three different structural criteria were used to define residue-residue interaction. Firstly, we tested the effect of the choice of atomic interaction site, representing each residue by either the C α atom or the C β atom, or both. Secondly, the contact distance threshold was varied between 6 and 20 Å in increments of 1 Å, giving a total of 15 different distance cutoffs. Thirdly, we applied a sequence separation filter, either considering all interactions or only the long-range interactions. Long range interactions were defined as interactions between residues that are more than 10 residues apart in the protein sequence [51, 52]. The combination of these criteria gave a total of 3 (C α , C β or both) ×15 (distance cutoffs) ×2 (all or long range contacts) = 90 different residue-residue contact definitions.

The following three sections present the different contact-type and contact-count scoring matrices. First, the scoring matrices themselves are described, as they provide information on the nature of the captured residue interaction propensities. Second, the results of scoring native and 'fully-random' protein structures are presented. Third, the matrices are used to evaluate several sets of protein decoy structures.

In summary the results show that; i) the contact-count scores are much more specific than the contact type scores compared to random models of residue-residue interaction, ii) the C β -C β interaction captures the most specific residue interaction information compared to other atomic interaction sites, iii) both scores can identify 'unusual' proteins in the training set, iv) in contrast to point i, both scores perform equally well on the task of discriminating between decoy structures. The apparent contradiction between point i) and iv) will be returned to in the Discussion.

The magnitude of the scoring matrices

The 'mean absolute score' of a scoring matrix (MAS) was defined as the mean of the absolute value of the score in each cell of the matrix. The magnitude of MAS gives the degree to which the observed residue interaction propensities deviate from random. In other words, MAS measures the 'information content' of the observed interaction propensities encoded in the scoring matrix. The value of MAS would be equal to zero if residue interactions occurred at random, i.e. without any particular interaction propensities. The mean absolute score for each different contact-type and contact-count scoring matrix is shown in Figure 2, and are described in detail below.

Figure 2
figure 2

Magnitude of the contact-type/count scoring matrices. Each point gives the mean absolute score (MAS) of each cell in a particular residue interaction propensity scoring matrix. The different scoring matrices result from the different (given) residue-residue contact definitions used in matrix construction. The broken lines indicate the trend for the matrices with sequence separation filtering and the solid lines give the trend for the matrices without sequence separation filtering.

Contact-type

The sequence separation threshold has the biggest effect on the mean absolute score (MAS) of the contact-type scoring matrices (Figure 2). Without sequence separation filtering, the contact-type scoring matrices tend to have smaller values of MAS. This clearly shows the effect of including the inherently non-specific short-range contacts in the scoring matrix. The scoring matrices that include short-range contacts are 'more random', with respect to the observed contacts encoded in the matrix. A similar effect is seen with increasing contact distance threshold.

Contact-count

The values of the 'mean absolute scores' (MAS) of the contact-count matrices are consistently larger than those of the contact-type matrices (Figure 2). The number of elements in the contact-count scoring matrix may vary with the residue-residue contact definition used (Figure 3). However, the value of MAS is comparable between the different types of scoring matrix because MAS is the mean absolute score over all elements in the matrix. The comparison suggests that the 'number of contacts per residue type' is consistently more informative than the 'residue-residue contact-type', given any of the residue-residue contact definitions used here. Unlike the contact-type matrices, the contact-count matrices appear consistently more informative when short-range contacts are included.

Figure 3
figure 3

Size of the contact-count scoring matrices. Each point gives the number of different 'number of contact-bins' in the different contact-count scoring matrices. The broken lines indicate the trend for the matrices with sequence separation filtering and the solid lines give the trend for the matrices without sequence separation filtering (as described in the text). The plot shows how the 'number of contact-bins' varies with the contact definition.

Summary

The 'Mean Absolute Score' shows that contact-count matrices are consistently more specific than the contact-type matrices. At all the distance thresholds, either with or without the sequence separation filtering, the values of MAS are largest for the C β -C β contacts, then the C α -C β contacts, then the C α -C α contacts. This shows that the C β atom captures the specific side chain interactions more accurately than either of the other two definitions.

Scoring native protein structures

Each of the scoring matrices was derived from a data set of 3, 070 monomers (see the Methods Section for details). As a simple test, each of the native proteins was scored using the contact-type and contact-count scoring matrices which had the highest value of MAS (as described above). Using either the contact-type or the contact-count scoring matrices, there were some proteins that scored significantly worse than average. Examination of the 130 worst cases showed that they were caused by a few anomalies and annotation errors.

There were 86 very small proteins and protein fragments. These included, for example, the structure of single C α -helices and extended, coiled-coil proteins. There were 25 membrane associated proteins, including alpha helical and beta-barrel lipoproteins. There were 12 proteins that adopted an extended conformation in complex with either DNA or several large ligand groups. Another four structures were found to be C α only models, containing only the backbone and no side-chain information.

Interestingly, in this group we found 3 structures of protein subunits from oligomeric proteins. These cases were incorrectly annotated monomers in the data set. These subunits appear 'non-native' because they would make many additional residue contacts in the native oligomer. For this reason the artificially isolated subunit is effectively 'non-native' and scored badly as a result.

An examination of 130 randomly selected proteins from the data set showed only a few protein fragments and DNA binding proteins. There were several proteins found binding large ligand groups, but the relative extent of the ligand was small compared to the cases found above. There were no trans-membrane structures found in the random sample.

For the above reasons, these 130 cases were removed from the data set giving a total of 2, 940 monomers. The matrices were re-calculated over this new data set for use in the following sections.

Scoring 'decoy' structures

In this section we describe a realistic benchmark of score performance [53] using several standard sets of 'near-native' protein decoys [54]. Here the scores are used to evaluate the decoys with reference to the C α RMSD of the decoy to its corresponding native structure. The C α RMSD is used as an independent measure of decoy quality in order to evaluate the various scores.

Description of the decoy sets used

Nine different sets of decoys were used in the current work. The structures of the decoys were taken from the Decoys-R-us database [54]. Each decoy set uses a particular method to generate several 'near-native' protein structures using a given native protein structure. Some additional information for the different decoy sets is given in Table 2. The different methods include: energy minimisation (lmds and vhp-mcmd), homology modelling (hg-structal, ig-structal, and ig-structal-hires), systematic randomisation with subsequent filtering (4-state-reduced and lattice-ssfit), ab-initio (semfold) and de-novo methods (fisa and fisa-casp3).

Table 2 Summary of the nine different decoys sets from decoys-R-us

The relationship between decoy 'quality' and score

When assessing the relationship between decoy 'quality' and the residue interaction propensity score, several different measures of score performance are important [53]. Here we apply two different measures of score performance, collected from the decoys as described below. The first measure is the Spearman rank correlation coefficient (S). The value of S shows whether the interaction propensity score can accurately discriminate between decoys of varying quality. The second measure is the Z-score of the native structure compared to the decoys with respect to interaction propensity score (Z). A large and positive Z indicates a clear discrimination of the native conformation from that of the decoys using the interaction propensity score.

The nine different decoy sets were analysed separately, and each decoy was scored using the different residue contact-type and contact-count scoring matrices as described above. The two different measures of score performance described above (S and Z) were calculated for each protein. For each decoy set we always report the mean value of S and Z over all the proteins in the set, given a particular residue-residue contact definition. In the following sections we refer to the 'best' score for a decoy set as the contact definition that had the best mean performance (on S or Z) over all the proteins in the set.

The best values of S per decoy set

Focusing only on the best performing scoring matrices, we saw considerable variation between decoy sets. The best values of S per decoy set varied between 0 and 0.7 (Table 3). Four of the nine decoy sets showed very little correlation (S below 0.30). Another four had some correlation (S between 0.3 and 0.6) and only two of the nine showed a reasonable correlation between score and quality (S above 0.6).

Table 3 The best results for each of the nine different decoy sets

The best scoring contact-type and contact-count scoring matrices have very similar performance over the nine different decoy sets. The nine different contact-type and contact-count S values in Table 3 have a Spearman rank correlation coefficient of 0.95. This clearly shows that the contact-type and contact-count scores have equivalent performance on the discrimination task. In all cases, a strong (or weak) correlation using the contact-type scores implies a strong (or weak) correlation using the contact-count scores.

The best values of Z per decoy set

The best values of Z for contact-type and contact-count are less strongly correlated, having a Spearman rank correlation coefficient of 0.7 (Table 3). In addition, the best Z do not correlate well with S. For one case in particular (lattice-ssfit) weak S is accompanied by a large Z (Table 3).

In general, the best contact-type scoring matrices have worse Z than the best contact-count scoring matrices (Table 3). In one case in particular (fisa-casp3), the best contact-type Z is very small (0) and the contact-count Z is moderate (1.4). However, the difference in the Z between the two different score types is not significant (p = 0.1, df = 8).

In the above two paragraphs we described the relative performance of the best contact-type and contact-count scoring matrices. The important question of which residue-residue contact definitions give the 'best' performance of S and Z is addressed in the following paragraph.

Choosing a specific residue-residue contact definition

The choice of a specific residue-residue contact definition can have a large and significant effect on the results of the scoring matrices. The performance can vary, not just between count and type scoring matrices, but also between different decoy sets. For example, the best S for the contact-count score occurs at C β -C β 8 Å without sequence separation for the 4-state-reduced decoy set, but at C β -C β 14 Å without sequence separation for the fisa decoy set. Using these contact definitions the values of S are 0.45 and 0.35 for the two decoy sets, respectively. Exchanging the contact definitions in these two cases, S falls to 0.40 and 0.25, respectively.

The ultimate aim of a scoring function is to rank near-native protein decoys according to their similarity to the native structure. The performance of the scoring function on this task should be independent of the method used to generate the decoys. For this reason, it is informative to look at the overall performance of each different scoring matrix across all the different decoy sets. The mean values of S and Z for each different scoring matrix over each decoy set are presented in Figure 4.

Figure 4
figure 4

Mean performance. The mean value of S (upper) and Z (lower) for the contact-type (left) and contact-count (right) scoring matrices over the nine different decoy sets. The means were calculated as the mean of the mean value per decoy set, rather than the mean over the total set of proteins. The broken lines indicate the trend for the matrices with sequence separation filtering and the solid lines give the trend for the matrices without sequence separation filtering.

The best performance of the contact-type scoring matrices is obtained by defining residue-residue contact using C β -C β atoms with a distance threshold of 12 Å. This is obtained without sequence separation filtering, including short-range contacts. The best performance of the contact-count scoring matrices occurs at slightly longer distance threshold of 14 Å (Figure 4). Overall, the contact-count and contact-type matrices show a similar pattern of performance across different residue-residue contact definitions. The best Z are generally found using the contact-count scoring matrices. Both types of scoring matrix have a maximum in Z when using C β -C β atoms with a distance threshold of 8 Å. In addition, the Z of the contact-count scoring matrices is also high between 10 and 16 Å

Discussion

Early work on single-body 'residue environment' potentials was very promising [2, 3, 5563]. However, the effectiveness of these potentials has never been directly compared to two-body 'residue pair' potentials in detail. Here we do not try to combine the two different types of residue interaction propensity score, but rather to assess their independent contribution to scoring protein structures. The objective is to examine how much information is stored in the two types of measure and to compare their performance on the realistic task of ranking a set of decoy structures.

The magnitude of the scoring matrices

To address the question of which type of residue contact propensity score contains the most specific information about protein structure, we assessed the mean absolute score (MAS) of the cells in the different scoring matrices. The score in each cell measures the strength of a certain residue contact propensity. In this sense, magnitude of the MAS gives the degree of 'non-randomness' or information content of the given residue contact propensity. The MAS suggests that, whatever the residue-residue contact definition used, the 'single-body' residue contact-count propensities were stronger or more informative than the 'two-body' residue contact-type propensities.

As the residue-residue contact definition was changed, we observed changes in MAS that were consistent with previous observations [64]. The most informative contact-type and contact-count matrices were obtained using C β -C β contacts at 6 Å without sequence separation filtering and using C β -C β at 7 Å with sequence separation filtering, respectively. However, the pattern of change in MAS that occurred as a consequence of changing residue-residue contact definition were not seen in the score performance on the task of scoring protein decoys.

Scoring native protein structures

Scoring the data set of 3, 070 native proteins highlighted some problematic structures. Some of the worst scoring proteins in this set when using either the contact-type or the contact-count scoring matrices were all found to be membrane proteins. It is not surprising that the residue contact propensities derived from a data set of mostly globular proteins are not generally the same as the propensities seen in membrane proteins.

Further down the list of the worst scoring native proteins, we find some protein subunits of oligomeric proteins that were incorrectly annotated monomers. These subunits appear 'non-native' because they would make many additional residue-residue contacts in the native oligomer. The artificially isolated subunit is effectively 'non-native'.

Ranking near-native protein decoys

Firstly, we observed that the decoys in some sets cannot be successfully ranked by either the contact-type or the contact-count scores. These sets of decoys are all considered equally 'native' (or equally 'non-native') by the residue contact propensity scores, despite having a range of different RMSD values to the native structure [54]. We observed that these decoy sets lacked decoys in the range of 1 to 5 Å RMSD, having less than 25% of the decoys below 5 Å. This observation suggests that the scores might perform better on decoys that are closer to native.

Secondly, and perhaps more importantly, we observe that the two different kinds of score perform equally well on the different decoy sets (Figure 5). The contact-type and contact-count performance in terms of both rank correlation coefficient (S) or Z-score (Z) are both highly correlated. The correlation of the best performance is 0.97 for S and 0.67 for Z.

Figure 5
figure 5

Performance of the two different types of score. The best values of S and Z for the contact-type and contact-count scoring matrices. Each point represents one of the nine decoy sets studied in this work, encompassing the scores from 150 proteins and 244,794 decoys.

Finally, we observed that the specific residue-residue contact definition that gave the best performance varied between the different decoy sets studied. However, similar trends in performance were observed at any given residue-residue contact definition across all sets.

Several other groups have reported good performance on similar discrimination tasks using single-body residue burial terms. For example, in Godzik et. al. 1992 [60] it was reported that, in most cases, a burial term alone is a sufficient indicator of the native sequence compared to two- and three-body residue interaction terms. A Bayesian scoring function developed in Simons et. al. 1999 [9] suggested that residue burial scores have comparable performance to residue contact scores. Similarly, in Zhou et. al. 2004 [65] the authors concluded that the the residues solvent accessible surface area appears to be the most important among several different single-body terms tested. In addition, several groups have used a similar definition of residue contact-count as an approximation for burial [48, 50, 6567]).

The current work suggests that counting contacts between C β atoms using a distance threshold around 12 Å provides the most discriminative single body residue contact-count score (Figure 4). Similar observations have also been confirmed in the literature. For example, in Karchin et. al. 2004 [68] the best results were obtained with a 14 Å contact definition between C β atoms.

However, in a number of studies a distance threshold of 9 Å between C β atoms was used to count contacts [64, 69]. In one such study, it was stated that the 9 Å distance threshold used resulted in a slightly better performance than other cutoffs tested [69]. This difference may result from the specific count normalisation procedure applied in that work.

Only two different sequence separation filters were assessed in detail in this work, considering either all interactions or only the long-range interactions. Long range interactions were defined as interactions between residues that are more than 10 residues apart in the protein sequence. Results collected using alternative sequence separation thresholds of 5, 8 or 12 showed gave very little change in the scores collected. Using a sequence separation threshold of 2 showed scores that were roughly in between those of 0 and 10. It important to note that when scoring near-native decoys, sequence-separation filtering has very little effect on the performance of the score, as all decoys and the native protein have the same primary sequence.

Cooperativity in protein folding

It has long been suggested that pairwise potentials may not capture the inherent cooperativity of protein folding (for example see [14, 70]). Here we have presented results for the effectiveness of the contact-count score, suggesting that indeed higher order interactions are indeed important in protein structure. For example, it has been shown that contact-count can be estimated from a four-body residue-residue interaction potential [71]. However, the performance of such a four-body potential, assessed using the (SNAPP) score [72], is not significantly better than an equivalent two-body potential [73]. Despite this observation, four-body potentials are becoming much more commonly used as a way to better capture the cooperativity of protein interactions [74].

Future directions

The work presented here represents a basic comparison of contact-type and contact-count scores. There are several ways in which this basic work should be extended. However, it is important to note that the two scores compared in this work are far from optimal. It is known that distance dependent all-atom scores are more effective at discriminating between native and non-native protein structures [48, 69]. Developing the current work along these lines will be an important task for the future. In particular, it remains to be seen if the findings presented here at the residue level are consistent with observations at the atomic level.

We did not directly compare the statistical potentials derived in this work to similar potentials described by other authors in the literature. To extend the analysis presented here, our potentials should be compared directly with those in the literature (for a good example of this type of comparison see [75]) Additionally, a comparison of the important amino acid properties such as hydrophobicity and electrostatics should be performed.

In this work we did not address combinations of the two scores. The two types of potential studied perform equivalently, suggesting that they are based on a similar underlying principle. However, if a combination of scores improves the overall performance this would show that the scores carry different information. Although that an ideal scoring function should work in all possible cases, correlation between RMSD and score is usually only significant for RMSD below about 3 Å [76]. For this reason it would be useful to compare the scoring functions on decoys within specific ranges of RMSD from the native.

Conclusion

In this work we assessed the independent contribution of two different types of residue contact propensity to scoring protein structures. The main finding is that the contact-type and contact-count scores showed equivalent overall performance in the task of ranking protein decoys. Although the two different score types perform equivalently, the ability to automatically predict the number of contacts made by a residue [68, 77, 78] allows for a greater range of applications. In addition, a single-body term is amenable to an efficient dynamic programming method for alignment optimisation [3, 65, 79].

The work presented here represents our preliminary investigation of a multi-body potential for evaluation of protein structure. In future it should be possible to combine the contact-type and contact-count scores to better take into account the inherent cooperativity of protein folding.

Methods

The data set of native proteins and protein decoys

The non-redundant data set of monomers

The scoring matrices were derived from a non-redundant set of high-quality protein monomers. This set of 3, 070 proteins was selected using the following protocol. Only the monomers from the BioUnit section of the Protein Data Bank (PDB) [80, 81] were selected, excluding putative structures of dimers, trimers, and the other multi-subunit proteins. The resulting monomeric proteins were further filtered by size, having more than 20 amino acids, and by resolution, being better than 3 Å. Finally, the chains were made non-redundant at 30% sequence identity using BLASTClust [82]. The resulting set of 3, 070 monomers was used throughout this analysis.

The data set of 'near-native' protein decoys

Nine sets of protein decoys were taken from the Decoys R us database [54]. In total this data set included 244, 794 decoys derived from 150 native proteins.

Constructing the scoring matrices

Contact-Type

Given the fraction of residues of type x and of type y (P x and P y ), the probability of randomly observing a contact of type xy is,P xy = P x ·P y

where P xy is the probability of a 'random' contact of type xy (for example see Table 1). This formula is obtained by assuming that contacts are made between randomly selected pairs of residues, assuming statistical independence. In this way, we make no assumptions about the distribution of contacts within the protein, such as the distribution of the number of contacts per residue.

The observed and expected probabilities of a contact of type xy can be combined into a score using the log-odds ratio;

S x y = log ( P x y o b s P x y e x p ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdIha4jabdMha5bqabaGccqGH9aqpcyGGSbaBcqGGVbWBcqGGNbWzdaqadaqcfayaamaalaaabaGaemiuaa1aa0baaeaacqWG4baEcqWG5bqEaeaacqWGVbWBcqWGIbGycqWGZbWCaaaabaGaemiuaa1aa0baaeaacqWG4baEcqWG5bqEaeaacqWGLbqzcqWG4baEcqWGWbaCaaaaaaGccaGLOaGaayzkaaaaaa@48A7@
(2)

The magnitude of the score S xy gives a measure of how 'non-randomly' the pair xy occurs. The score is positive when xy is observed more often than expected and negative when xy is observed less often than expected.

Contact-Count

The contact-count scoring matrix is created in a similar way to the contact-type scoring matrix. However, instead of using the frequency P xy to denote the probability of a residue-residue contact between residue type x and type y, we use P xn to denote the probability of a residue of type x having exactly n residue-residue contacts. The probability P xn is defined as,P xn = N xn /N n

where N xn is the observed number of residues of type x having exactly n contacts and N n is the total number of residues with exactly n contacts (for example see Figure 1). The values of n were taken from those observed over all residues. The 'random' value of P xn is simply taken to be equal to P x , the fraction of residues of type x. Using this value assumes that there is no particular effect on the overall amino acid composition when filtered by a given number of contacts. Again the observed and expected probabilities can be combined as in Equation 2. Using this method, the magnitude of the score S xn can be easily interpreted as a measure of 'compositional bias' given a certain number of residue-residue contacts.

Undersampling

Certain residue-residue contact definitions could lead to low counts in the scoring matrices. For example, if the overall number of observed residue-residue contacts is low, certain contact-types may become rare. Similarly, if the number of different contact-counts spans a wide range, instances of a given residue type with a given contact-count may become rare. To address this issue of undersampling, if a cell of a scoring matrix was based on fewer than 5 observed or expected counts, that score was discarded. Those specific classes of contact were therefore ignored when scoring protein structures, being neither penalised nor rewarded.

Although a threshold of 5 observed or expected counts was used to filter undersampled classes in the results presented here, it should be noted that both the contact-type and contact-count scores appear very robust, showing only small changes in MAS when discarding cells with fewer than 50 observed or expected counts. The complete set of counts and scores for the contact-type and contact-count scoring matrices are included as additional files (see Additional file 1).