Background

The number of protein structure prediction servers has increased over the past years [1]. The use of many different methods to predict the structure of a protein is now state-of-the-art in protein structure prediction [2]. However, the number of available servers, taken together with the number of models returned exceeds the limit a human researcher is likely to scan. Fortunately, structure prediction meta-servers address this problem: they gather models from various other servers and employ automated processes successfully applied by human experts in order to deliver a correct prediction [1]. Since existing structure prediction servers are constantly upgraded while new servers appear, it is necessary to re-evaluate the fitness of the aforementioned expert processes.

The latest, 7thround of the Critical Assessment of Techniques for Protein Structure Prediction [3] has provided us with a fair amount of structure prediction server models. With the help of the Structure Prediction Meta Server [4], we have evaluated the servers returning these models using the same protocols as in previous Livebench experiments [5], results are available at [6].

Standard evaluation methods take into account the first (top ranked) model of the prediction servers. The Meta Server assigns a new reliability score to each model using 3D-Jury [7]. This score can be used to re-rank the models and thus affect the evaluation results. The aim of the present work was to verify the continued applicability of this model ranking method, focusing on the version available on-line. We were interested in answering the following three questions: Can we use 3D-Jury to estimate model quality? Does 3D-Jury select a model more accurate than the choice of the generating server? Could the 3D-Jury score be used as a generic model reliability score?

Results and Discussion

3D-Jury score correlates with the number of correctly predicted residues

The correlation of the 3D-Jury score (Jscore) with model quality is of fundamental importance to the operation of the Meta Server. Therefore we first examined the correlation of the 3D-Jury score returned by the default on-line version of 3D-Jury: 3J1,A(see Methods: 3D-Jury operating modes), with the number of correctly predicted residues ( N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ ).

3D-Jury scores correlate with the number of correctly predicted residues ( N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ ): the correlation coefficient is 0.95. A linear model (LM1) is presented on Figure 1. The residual error, 20.15, is low enough to enable meaningful estimation of the number of correctly positioned residues.

Figure 1
figure 1

Correlation of 3D-Jury score with the number of correctly predicted C α atoms. N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ – the number of C α atoms predicted within 3.5 Å from their respective locations in the crystal structure; Jscore – 3J1,Ascore; solid green line – prediction of linear model LM1; blue longdash lines: confidence interval at 95% confidence level; blue dashed lines: prediction interval at 90% confidence level; blue dotdash lines: prediction interval at 95% confidence level; blue dotted lines: prediction interval at 99% confidence level; x – slope; the colour bar is key to the approximate density of models A linear model (LM1) was fitted to the 3D-Jury score vs. N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ of 19,558 models. The residual standard error is 20.15. The 95% confidence interval as well as prediction intervals for 90%, 95% and 99% confidence levels are indicated on the figure. The vertical and horizontal histograms show the distributions of N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ and 3D-Jury scores respectively.

A better model (LM2) can be obtained by fitting to the [30, 100) 3D-Jury score range only. This range represents difficult targets. Figure 2 shows the linear model obtained. The residual error is 13.37, offering narrower, better prediction intervals for the number of correctly positioned residues.

Figure 2
figure 2

Correlation of 3D-Jury score in the [30–100) range with the number of correctly predicted C α atoms. N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ – the number of C α atoms predicted within 3.5 Å from their respective locations in the crystal structure; Jscore – 3J1,Ascore; solid green line – prediction of linear model LM2; blue longdash lines: confidence interval at 95% confidence level; blue dashed lines: prediction interval at 90% confidence level; blue dotdash lines: prediction interval at 95% confidence level; blue dotted lines: prediction interval at 99% confidence level; x – slope; the colour bar is key to the approximate density of models A linear model (LM2) was fitted to the 3D-Jury score vs. N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ of 6,710 models. The residual standard error is 13.37. The 95% confidence interval as well as prediction intervals for 90%, 95% and 99% confidence levels are indicated on the figure. The vertical and horizontal histograms show the distributions of N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ and 3D-Jury scores respectively. The 30 to 100 3D-Jury score range was chosen to represent difficult targets.

As an example to the use of LM2, let's assume that our model has 3D-Jury score 44.5. We can expect to have 13 to 82 well positioned residues in this model on the 99% confidence level, 21 to 74 on the 95% confidence level. For a score of 59 the 99% prediction interval for the number of correct residues is 26–94, the 95% prediction interval is narrower: 34–86.

A key to which residues are likely to be well-positioned is provided on the model-centred 3D-Jury page, accessible by selecting a model in the Model column of the main 3D-Jury page. Here, residues that are likely to be correctly positioned would have grey background at the corresponding positions of most of the other aligned models, forming a column of grey background.

3D-Jury improves overall server prediction results

We examined whether 3D-Jury could improve overall server performance by selecting a better model when multiple models are returned by a prediction server. We tested four operating modes of 3D-Jury: 3J1,A– uses one model of the default servers (a mode typical for on-line predictions); 3Ja,Aall models of default servers; 3J1,Cone model of all servers; 3J a,C all models of all servers. We have computed the MaxSub score (MaxS) [8] of 25,215 models for this analysis. Four 3D-Jury scores (Jscore) were also computed for each model, respective to the four 3D-Jury operating modes mentioned above. The servers' choice of the best model was evaluated by summing the MaxS' of the first models returned for each target. The four 3D-Jury variants' choice of the best model was evaluated by summing the MaxS' of the models with the highest respective 3D-Jury score for each target. We also summed up the highest MaxS score for each target, giving an upper limit to possible improvements. Results for 3J1,Aare presented in Table 1, column Q%. The order of the five model ranking approaches is revealed by the grand total of MaxS: 3Ja,C(20,006) > 3J1,C(19,983) > 3J1,A(19,690) > 3 a,A (19,655) > first server model (19,039) (the sum of MaxS over the highest scoring models is 20,718). Table 1, column N j shows the number of targets where 3J1,Amade a better choice about the best model than the original server. In the case of pmodeller6 [9] and 3dpro [10], we can see that 3D-Jury 3J1,Apredicts more targets better, but its overall performance is slightly worse than the original servers'. The reason for this is that 3J1,A's more numerous choices of better models were not good enough to counteract its loss of MaxSub scores on the bad choices. In the case of inub [11] and BasD [12] the situation is inverse: 3J1,Aimproved fewer targets, but the net improvement is positive. For many servers the improvement – or worsening – of the targets is marginal (e.g. phyre-2 = 0.6%). Nevertheless we can see that even in these cases there is room for a 4 – 5% improvement (Table 1, column Q%, values in parentheses). Moreover, it appears that for at least 14 targets every server fails to pick the best model.

Table 1 Server prediction results improved by 3D-Jury. 3J1,A– the default on-line version of 3D-Jury, uses one model of the default servers [7]; N s – number of targets better predicted (in terms of MaxSub score) by the server; N j – number of targets better predictedby 3J1,A, in parentheses: number of improvable targets, i.e. those with a suboptimal choice of the first model; Q%, Q % m a x MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaqhaaWcbaGaeiyjaucabaGaemyBa0MaemyyaeMaemiEaGhaaaaa@32FD@ – see Methods: Measures for comparing model selection methods Servers are ordered by N j -N s descending, three servers with ∑MaxS s = 0 are not shown. Servers not improved by the re-ranking of models (N s > N j ) are shown in italics. 3J1,Aselects better models on the whole for 50 servers out of the 56 shown, considering either Q% or the number of targets. Re-ranking of models by 3D-Jury does not improve the performance of 6 servers.

3D-Jury scores as generic model reliability scores

In order to assess the advantage of using 3D-Jury scores as generic reliability scores we conducted a receiver operating characteristic (ROC) analysis adapted for CASP and Livebench [5] evaluation. The analysis shows how well a reliability score separates good models from bad ones, in terms of the average number of good models seen before encountering 1 to 11 bad models ( t p ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsha0jabdchaWbaaaaa@2F97@ ). We compared the 3D-Jury scores returned by the on-line version 3J1,Ato the reliability scores of the original servers, when available. Results are shown in Table 2. The 3D-Jury score exceeds the original server score ( t p ¯ R MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsha0jabdchaWbaadaWgaaWcbaGaemOuaifabeaaaaa@30F0@ ) in 27 cases and falls short of it in only 5 cases out of the 38 analysed. The exceptions are pmodeller6 [9], pcons6 [2], ffas03 [13], inub [11] and shub [11].

Table 2 3D-Jury receiver operating characteristic (ROC) analysis. t p ¯ R MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsha0jabdchaWbaadaWgaaWcbaGaemOuaifabeaaaaa@30F0@ – average number of true positive (tp) models in the [0 – 10] false positive (fp) range, using the reliability score provided by the server as the discrimination threshold; t p ¯ J MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsha0jabdchaWbaadaWgaaWcbaGaemOsaOeabeaaaaa@30E0@ – average number of tp in the [0 – 10] fp range using 3D-Jury score as the discrimination threshold; J0 – lowest 3D-Jury score before observing the first bad model; t p J 0 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG0baDcqWGWbaCdaWgaaWcbaGaemOsaO0aaSbaaWqaaiabicdaWaqabaaaleqaaaaa@31F5@ – number of good models at or above J0 score; N t – number of targets The table shows results for the on-line default version of 3D-Jury: 3J1,A. Servers are ordered by t p ¯ J MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsha0jabdchaWbaadaWgaaWcbaGaemOsaOeabeaaaaa@30E0@ descending. Missing t p ¯ R MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsha0jabdchaWbaadaWgaaWcbaGaemOuaifabeaaaaa@30F0@ values indicate servers that did not return reliability scores. Five servers with t p ¯ R > t p ¯ J MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsha0jabdchaWbaadaWgaaWcbaGaemOuaifabeaakiabg6da+maanaaabaGaemiDaqNaemiCaahaamaaBaaaleaacqWGkbGsaeqaaaaa@3636@ are shown in italics. In order to assess 3D-Jury scores (Jscore) as reliability scores, we performed a ROC analysis adapted for CASP and Livebench data, comparing Jscore to the reliability scores provided by the servers. In terms of the average number of true positive models ( t p ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdsha0jabdchaWbaaaaa@2F97@ ), the 3D-Jury score exceeds the original server score in 27 cases, it falls short of it in 5 cases out of the 38 analysed.

The J0 scores listed in Table 2 indicate the lowest 3D-Jury score seen before a bad model was encountered from the indicated server. In other words, no bad model above J0 score was seen in the test model set of the server. J0 scores are of practical value: they can be used as server-specific score thresholds, since a score above J0 is likely to indicate a good model.

3D-Jury scoring of user models

In order to encourage model selection and refinement using 3D-Jury, we introduced a new feature: instant 3D-Jury scoring of user models. This feature, available for any completed job by selecting the job in the Queue and uploading a model, enables the user to score a set of models and obtain a ranking based on the 3D-Jury score. Pop-up hints and an on-line tutorial [14], available from the job page, offer help with this new feature.

Conclusion

In this report we present the evaluation of 3D-Jury [7] on models gathered in CASP7. We found good correlation between the 3D-Jury score and a model quality measure: the number of correctly predicted residues. This correlation can be used to predict important model features such as the number of correctly positioned residues. Using Figure 2, 3D-Jury scores can be translated to the estimated number of correctly predicted residues. We plan to upgrade the on-line 3D-Jury to provide the 90%, 95% and 99% prediction intervals for the number of correctly predicted residues automatically.

3D-Jury, in general, also appears to boost server predictions by identifying better models. Our results show that 3D-Jury performs best when all models of all servers are used to calculate the J score. This option, however, is not feasible in the Meta Server since many of the servers participating in CASP7 are not currently available on-line. Nevertheless, 3J1,A, the provided on-line default presents a reasonable choice. We found that 3D-Jury scores can be used as generic reliability scores, an especially important feature for models that are not provided with such values. We have also extracted serverwise 3D-Jury score thresholds to help identifying reliable models. We report the release of a new Meta Server feature: instant 3D-Jury scoring of uploaded user models.

3D-Jury remains to be a valuable tool in the hands of protein structure modellers. Its ability to pinpoint the best server models is founded by the results of our analysis.

Methods

Test model set

In order to assess 3D-Jury we downloaded the complete set of server structure predictions from the Protein Structure Prediction Center [15]. Predictions from our partner servers (BasD [12], ffas03 [13], inub [11], mgenthreader [16], ORFeus-2 [17], pdbblast [18] and 3D-PSSM [19]) were added if missing.

Servers that predicted less than two targets and/or returned only one model for each target were excluded from the server model ranking tests (reported in Table 1). The resulting set contains 25,215 models for 85 targets from 59 servers – a 5 models per server average.

Models with Jscore = 0 were excluded from all correlation and regression analyses.

Server reliability scores (Rscore) that anti-correlate with model quality were multiplied by -1.

Model quality measures

MaxSub [8] score and N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ (defined below) were used to measure the quality of models. Maxsub returns a score between 0.0 (incorrect prediction) and 1.0 (perfect prediction). In this study the score was multiplied by 10.0 as is customary on the 3D-Jury web pages [20]. We say that models with MaxS > 0 are good, while models with MaxS = 0 are bad.

N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@

is the number of C α atoms that are predicted within 3.5 Å from their respective locations in the solved structure, as reported by the MaxSub tool [8] operating on the C α atoms of the structures compared. We say that N C α 3.5 Å MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaWgaaWcbaGaem4qam0aaSbaaWqaaGGaciab=f7aHbqabaWccqGHKjYOcqaIZaWmcqGGUaGlcqaI1aqntCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbhv2BYDwAHbaceiGaa4xXaaqabaaaaa@413E@ gives the number of correctly predicted residues.

3D-Jury model scoring

The 3D-Jury score of a model M is calculated by first comparing M to a set of other models available to the system for the same target. The way these other models are selected is a tunable parameter of 3D-Jury. M is compared to each selected model, and a pairwise similarity score (S M,i , for pair i) is assigned that equals to the number of respective C α atoms that are within 3.5 Å of each other after optimal superposition of the structures represented by their the C α atoms. MaxSub [8] is used to carry out this step. In case a pairwise similarity score falls below a certain cutoff value, it is set to zero. The 3D-Jury score (Jscore) of model M is the sum of its pairwise similarity scores divided by the number of these scores (n) + 1 [7]: J s c o r e M = i n S M , i n + 1 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGkbGscqWGZbWCcqWGJbWycqWGVbWBcqWGYbGCcqWGLbqzdaWgaaWcbaGaemyta0eabeaakiabg2da9maalaaabaWaaabCaeaacqWGtbWudaWgaaWcbaGaemyta0KaeiilaWIaemyAaKgabeaaaeaacqWGPbqAaeaacqWGUbGBa0GaeyyeIuoaaOqaaiabd6gaUjabgUcaRiabigdaXaaaaaa@440E@ .

3D-Jury parameters

3D-Jury offers three tunable parameters: the list of servers to draw models from for pairwise score calculation; the method of server model selection (applicable in case of multiple available models, the name of the method is shown in italics): first model, most similar (in terms of S M,i ) one, or all models; and the pairwise similarity score cutoff [7]. In this analysis we used the publicly available BasD [12], ffas03 [13], inub [11], mgenthreader [16], ORFeus-2 [17], pdbblast [18] and 3D-PSSM [19] as default servers and a constant similarity cutoff of 40 in order to simulate regular on-line use of the service.

3D-Jury operating modes

The four operating modes of 3D-Jury used in this report are: 3J1,A– uses one model of the default servers (a mode typical for on-line predictions); 3J a,A all models of default servers; 3J1,Cone model of all servers; 3J a,C all models of all servers.

Measures for comparing model selection methods

Q%– 3D-Jury vs. original server

Q % = ( M a x S j M a x S s 1 ) × 100 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaGaeiyjaucabeaakiabg2da9maabmaabaWaaSaaaeaadaaeabqaaiabd2eanjabdggaHjabdIha4jabdofatnaaBaaaleaacqWGQbGAaeqaaaqabeqaniabggHiLdaakeaadaaeabqaaiabd2eanjabdggaHjabdIha4jabdofatnaaBaaaleaacqWGZbWCaeqaaaqabeqaniabggHiLdaaaOGaeyOeI0IaeGymaedacaGLOaGaayzkaaGaey41aqRaeGymaeJaeGimaaJaeGimaadaaa@49B4@

MaxS j – sum of MaxSub scores of models selected by 3J1,A

MaxS s – sum of MaxSub scores of the server's first models

Q % m a x MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaqhaaWcbaGaeiyjaucabaGaemyBa0MaemyyaeMaemiEaGhaaaaa@32FD@ – 'best model' vs. original server

Q % m a x = ( m a x ( M a x S ) M a x S s 1 ) × 100 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaqhaaWcbaGaeiyjaucabaGaemyBa0MaemyyaeMaemiEaGhaaOGaeyypa0ZaaeWaaeaadaWcaaqaamaaqaeabaGaemyBa0MaemyyaeMaemiEaGhaleqabeqdcqGHris5aOWaaeWaaeaacqWGnbqtcqWGHbqycqWG4baEcqWGtbWuaiaawIcacaGLPaaaaeaadaaeabqaaiabd2eanjabdggaHjabdIha4jabdofatnaaBaaaleaacqWGZbWCaeqaaaqabeqaniabggHiLdaaaOGaeyOeI0IaeGymaedacaGLOaGaayzkaaGaey41aqRaeGymaeJaeGimaaJaeGimaadaaa@520E@

max(MaxS) – sum of the server's highest, best MaxSub scores per target

MaxS s – sum of MaxSub scores of the server's first models

Receiver operating characteristic (ROC) analysis

We performed a ROC analysis adapted for CASP and Livebench [18] model evaluation for each server. Server models were ordered by the original reliability score (Rscore, when available), or the 3D-Jury score (Jscore). The highest scoring models for each target were collected into separate sets M R and M J , corresponding to the Rscore or Jscore used for ordering. Models in both sets were ordered by their respective scores. Good models (MaxS > 0) were labelled positive, bad models (MaxS = 0) were labelled negative. Using Rscore or Jscore as the discrimination threshold, we plotted the number of true positives (tp) versus the number of false positives (fp) on the [0 – 10] fp range. This was to take into account the absolute number of targets predicted by the servers, focusing on the hardest targets. We used the number of true positives averaged over the [0 – 10] false positive range as a quality measure for the reliability scores, the higher values indicating better reliability scores.

Statistics and figures

Reported correlation coefficients are significant at the 95% significance level.

Statistics and figures were prepared using R [21].

Availability and requirements

Project name: Meta Server/3D-Jury

Project home page: http://meta.bioinfo.pl/

Operating system: Linux

Programming language: Perl

Other requirements: SQL server, web server, mail server, procmail

Licence: the web service is freely accessible to everybody