Introduction

The identification of antigen peptides that bind to major histocompatibility complex (MHC) molecules plays a crucial role in understanding the mechanisms of both humoral and adaptive immunity as well as in developing epitope-based vaccines. Two major types of MHC molecules are involved in the peptide-binding process. MHC class I molecules present endogenous antigens to CD8+ cytotoxic T cells. MHC class II molecules, on the other hand, present exogenously derived proteins through antigen-presenting cells to CD4+ helper T cells (Parham 2005). Generally, antigen peptides that bind to MHC class I molecules are approximately nine amino acid residues long (Bleek and Nathenson 1991). However, the peptide-binding groove of a MHC class II molecule is open at both ends, a property that makes it capable of accommodating longer peptides consisting of 10–30 residues (Castellino et al. 1997; Max et al. 1993; Sette et al. 1989).

Experimental determinations of the binding affinities of peptides to MHC molecules are time consuming and expensive. Therefore, considerable effort has been made on the development of computational tools for the identification of MHC-binding peptides (De Groot and Berzofsky 2004; Doytchinova et al. 2003; Flower 2004; Flower and Doytchinova 2002; Flower et al. 2002; Martin et al. 2003; Schirle et al. 2001). A host of computational methods has been developed for MHC class I prediction over the last two decades (De Groot et al. 2002; Flower et al. 2003; Martin et al. 2003; Nussbaum et al. 2003; Reche et al. 2002; Schirle et al. 2001). A comprehensive list of references can be found in a recent paper (Peters et al. 2006). The number of alleles covered by these methods is large, and the level of accuracy is relatively high.

Conversely, the situation for the prediction of MHC class II binding peptides is quite different. The variability in the peptide length complicates the prediction of peptide–MHC class II binding. The analyses of the binding motif and the structure of peptide–MHC class II complexes have suggested that a core of nine residues within a peptide is essential for peptide–MHC binding. Computational methods for MHC class II binding prediction include simple binding motifs (Borras-Cuesta et al. 2000; Rammensee et al. 1999; Singh and Raghava 2001), quantitative matrices (Bui et al. 2005; Peters and Sette 2005; Sturniolo et al. 1999), hidden Markov models (Kato et al. 2003; Noguchi et al. 2002), artificial neural networks (Brusic et al. 1998; Burden and Winkler 2005; Nielsen et al. 2003), iterative discriminant analysis (Mallios 1998, 2001), support vector machines/regression (Bhasin and Raghava 2004; Donnes and Elofsson 2002; Donnes and Kohlbacher 2006; Liu et al. 2006; Salomon and Flower 2006), the Gibbs sampler and its extension (Nielsen et al. 2007, 2004), partial least squares (Chang et al. 2006; Doytchinova and Flower 2003; Hattotuwagama et al. 2006), and other methods (Altiparmak et al. 2006; Chang et al. 2007; Cui et al. 2006, 2007; Doytchinova and Flower 2001; Hertz and Yanover 2006; Karpenko et al. 2005; Murugan and Dai 2005; Takahashi and Honda 2006; Tong et al. 2006; Wan et al. 2006). Because each method has its own strengths and weaknesses, it is hard for an immunologist to select a single method from the pool of existing predictors. Therefore, a system that produces reliable prediction through the integration of outcomes from major predictors is in clear need.

A consensus strategy for combining three human leukocyte antigen (HLA)-DR binding algorithms—SYFPEITHI (Rammensee et al. 1999), ProPred (Singh and Raghava 2001), and the iterative stepwise discriminant analysis meta-algorithm (Mallios 2001)—has been shown to be consistently best or second best (Mallios 2003) using sets of binding peptides in DRB1*0101 and DRB1*0401. In another integrative system, MULTIPRED (Zhang et al. 2005), the individual predictive engines implemented are hidden Markov models (HMMs) and artificial neural networks (ANNs). The system covers predictions of HLA protein binding peptides belonging to supertypes A2 and A3 (HLA class I) as well as DR (HLA class II). Users can choose either HMMs or ANNs as individual predictors. In addition, the system provides a mechanism that makes consensus prediction by combining the results from the two prediction methods. Significantly, recent work (Moise and De Groot 2006; Moutaftsi et al. 2006) has demonstrated the promise of an integrative approach for the computational identification of peptides with immunogenicity through the prediction of binding affinity to MHC class I molecules. Specifically, the computational prediction reduced the number of possible overlapping peptides by more than 85-fold, accelerating the discovery of 49 epitopes that account for ~95% of the immunome in a mouse model for vaccine development. The key step in their approach is a consensus prediction that combines four matrix-based epitope prediction algorithms (BIMAS: http://thr.cit.nih.gov/molbio/hla_bind/; Bui et al. 2005; Peters and Sette 2005; Udaka et al. 2000). Another recent integrative method for the MHC class I binding prediction uses the sum of the weighted votes from each individual predictor as a combined score for a peptide to make an improved prediction (Trost et al. 2007).

In the present work, a meta-predictive (called probabilistic meta-predictor or PM predictor thereafter) system based on a probabilistic approach is described. This method improves significantly our earlier work of a Naïve Bayesian meta-predictor (Huang et al. 2006, 2007) to achieve fast training and higher performance through a consensus score that combines predictive scores from each individual predictor. Like the previous integrative predictors (Huang et al. 2006, 2007; Trost et al. 2007), the framework presented in this work has the flexibility to incorporate an arbitrary number of predictors that provide predictions based on computed score correlated with the binding affinity.

To illustrate the basic framework of our PM predictor, the MHC class II binding prediction was taken as an example, although this approach can also be applied to the MHC class I. Six individual predictors of MHC class II binding predictions were selected based on their availability from the Internet. They are SVRMHC (Wan et al. 2006), ARB (Bui et al. 2005), RANKPEP (Reche et al. 2004), ProPred (Singh and Raghava 2001), Gibbs Sampler (Nielsen et al. 2004), and the LP model (Murugan and Dai 2005). The output from each of the first three methods for a peptide is a predictive score of binding affinity. Each of the latter three methods provides an allele-restricted position-specific scoring matrix (PSSM) that can be used to compute scores of the overlapping 9-mers of a peptide. The maximum score over these 9-mers is considered as the score of the peptide. Taking these scores of training peptides for a specific allele as inputs, we first estimate the probability distributions of the scores for both binding and nonbinding peptides in the training set for each individual predictor. Then, we combine these distributions a probabilistic model to obtain the integrative predictor. The effectiveness of our model is examined with the use of MHC class II peptides from 13 HLA alleles and three mouse MHC alleles obtained from the Immune Epitope Database and Analysis Resource (IEDB; Peters et al. 2005). The computational analysis shows that the PM predictor uniformly produces stable prediction and in general achieves statistically improved results in comparison with any individual predictor.

Materials and methods

Data set

The computational experiments were conducted using the data set available from the IEDB database (Peters et al. 2005). This data set comprises peptide data with IC50 binding affinities for the 13 HLA (human MHC) and three mouse MHC class II alleles. This data set was also used in the recent study (Nielsen et al. 2007) for quantitative prediction of MHC class II peptide binding. The details of the data set are provided in Table 1.

Table 1 The data set used in this study. Peptide data for the 13 HLA-DR and 3 mouse H2-IA alleles are downloaded from http://www.cbs.dtu.dk/suppl/immunology/NetMHCII/php

Choosing individual predictors

Any predictor that is capable of assigning a predictive score to a peptide sequence can be employed as an individual predictor in our system. The six methods listed below were selected in this study. The coverage of their predictions is summarized in Table 2.

Table 2 The coverage of predictions for the six methods

ARB predictions

The ARB predictions were obtained using a default parameter setting for the ARB web server (http://tools.immuneepitope.org/tools/matrix/iedb_input?matrixClass=II). Each peptide is assigned a predictive score.

SVRMHC predictions

The SVRMHC (Wan et al. 2006) predictions were obtained using a default parameter setting for the SVRMHC web server (http://svrmhc.umn.edu/SVRMHCdb). This is a support vector machine regression-based method that makes predictions of the exact binding affinity of the peptide. The server returns pIC50 prediction scores for each 9-mer within the query peptide, and the maximum score was assigned as the binding pIC50 prediction value for the query peptide.

RANKPEP predictions

The RANKPEP (Reche et al. 2004) predictions were made by submitting peptides to the web server (http://bio.dfci.harvard.edu/Tools/rankpep.html) with default parameters. This method predicts binding peptides based on the scores calculated from a PSSM. The PSSMs are not available publicly, but the server returns a predictive score for each peptide.

Gibbs Sampler predictions

The PSSMs were obtained by submitting the binding peptides with default parameter settings to the web server (http://www.cbs.dtu.dk/biotools/EasyGibbs/). Gibbs Sampler (Nielsen et al. 2004) is an advanced motif sampler method based on the Gibbs sampling technique, which efficiently samples the possible alignment space of binder sequences. For each alignment, a log-odds weight matrix is calculated for the identified binding core subsequences. This matrix serves as the PSSM for the computation of a score for a 9-mer. It should be noted that Gibbs Sampler requires only binding peptides (with IC50 < 500 nM) for the construction of PSSMs.

ProPred predictions

The PSSMs of ProPred (Singh and Raghava 2001) were obtained from its website (http://www.imtech.res.in/raghava/propred/page4.html). This predictor uses the quantitative matrices from 51 HLA-DR alleles for the prediction of MHC class II binding peptides. These matrices were generated from a pocket profile database previously described (Sturniolo et al. 1999) and covered the majority of human HLA-DR specificity. The matrices are the same as the ones in TEPITOPE (Sturniolo et al. 1999).

LP-top2 predictors

The LP method (Murugan and Dai 2005) was motivated by a text mining model. This LP-based iterative learning model enables the use of both binding and nonbinding peptides for the detection of the binding cores from a set of putative binding cores and for the construction of the predictor simultaneously. The outcome of this predictor is a PSSM that can be used to score a 9-mer. The PSSMs were obtained by training the binding and non-binding peptides of each allele with the algorithm (Murugan and Dai 2005). Binding peptides were identified with IC50 binding threshold of 500 nM. The LP-top2 was selected for this study among several variants of the LP method because of its superior performance.

In summary, each of the former three predictors discussed above returns a predictive score that corresponds to the actual binding affinity for a peptide, while each of the three latter predictors returns a PSSM of size 20 by 9 for a set of training peptides of a specific allele. This PSSM will be used to calculate the score of each amino acid at each position of a 9-mer. The final score of a peptide is the maximum score over all overlapping 9-mers in the peptide. The scores derived from the latter three methods are not the actual binding affinity of a peptide; however, their magnitudes should correlate with the strength of the binding.

Several online predictors were not included in our study for various reasons. The current version of SVMHC (Donnes and Elofsson 2002; Donnes and Kohlbacher 2006) also makes MHC class II binding predictions. However, it uses the same matrices published in TEPITOPE (Sturniolo et al. 1999). Because those matrices were also used in ProPred, SVMHC was not selected. MHCPred (Doytchinova and Flower 2003) has an online prediction service; however, it covers only three alleles in our data set. Therefore, it was not included. The web server of MULTIPRED (Zhang et al. 2005) predicts eight HLA-DR variants. Because it only covers five alleles in our data set, it was not selected. For the other methods mentioned in the “Introduction,” the exclusion from this study was mainly due to lack of access to either the online predictors or the programs.

Recently, a new online MHC class II binding predictor was released (Nielsen et al. 2007). This method uses a novel stabilization matrix alignment method that allows for direct prediction of binding affinity. Comprehensive computational study has shown that it outperformed the other state-of-art MHC class II quantitative prediction methods. Ideally, it would be better to include this method as an individual predictor in our current study. However, because the goal of this work is the demonstration of the effectiveness of the integrative system and is not aimed at the best individual predictor, the exclusion of the above predictor does not affect the main results from this study.

The PM predictor

The PM predictor is based on a probabilistic model that combines prediction scores of a peptide from each predictor into a consensus score. The consensus score depends on the probability distribution of scores. A threshold has to be determined so that peptides with consensus scores above this threshold are predicted as binding and peptides with consensus scores below this threshold are correspondingly predicted as nonbinding. Figure 1 illustrates the framework of the PM predictor method. The details of this predictor are described as follows.

Fig. 1
figure 1

Illustration of the framework for building the PM predictor

Calculation of the consensus score

Given a peptide, predictive scores are assigned by each of the m predictors. If there are n peptides in a test set, then the total number of scores from all individual predictors is nm. To consolidate the results from the m predictors, a consensus score is defined for a peptide. This consensus score provides the likelihood of a peptide to be classified into the binding class or the nonbinding class according to the information obtained from the probability distribution of scores for each individual predictor. The consensus score over all predictors is defined as the product of the likelihoods. If the estimations of the probability distributions of the scores for binding and nonbinding peptides are accurate, then the consensus scores of the binding peptides and nonbinding peptides should be grouped into two distinct intervals. This grouping would allow for a simple prediction by using a prescribed threshold of the consensus score. The details of the calculation of the consensus scores are presented in the Appendix.

Estimating distributions of scores

In this study, an important assumption was made about the distributions of scores obtained from each predictor: Scores from binding peptides and nonbinding peptides, respectively, follow normal distributions, respectively, with distinct means. The lowest and the highest 2.5% of the binder and nonbinder scores were dropped to exclude the influence of outliers on the estimation of the distribution parameters. Only the remaining 95% of the scores was used in the rest of the training procedure. With this assumption on the distributions, the estimate of the distribution parameters is straightforward. For each individual predictor, we calculated: (1) the mean and the standard deviation of scores for the binding peptides and (2) the mean and the standard deviation of scores for the nonbinding peptides. These parameters characterize the score distributions for each individual predictor.

The area under the receiver operating characteristic curve (Aroc; Swets 1988) over five different training and test sets (Nielsen et al. 2007) was used for the evaluation of the PM predictor. More specifically, the probabilities were first determined from peptides in the training set, and then the consensus score was computed for each peptide in the test set. The Aroc value was subsequently calculated for the test set. This procedure was iterated through all five different training and test sets, and the average Aroc value was computed.

The majority voting algorithm

To compare the performance of different meta-predictors, the Majority Voting algorithm was also implemented. If a peptide is scored above a specified threshold σ i by the ith predictor, the predictor casts a vote for that peptide to be binding; otherwise, the peptide is voted to be nonbinding. Once all m votes have been cast, the prediction is decided by the majority of the votes. Obviously, if m is even, a rule for breaking the tie is needed.

The Majority Voting algorithm utilized percentiles of peptides’ scores in the training fold as thresholds for individual predictors for testing. For each of the m predictors, the scores of peptides in the four training folds were sorted. The percentiles of the sorted scores were then used as the thresholds. Each of the top zth percentile vectors, \(\sigma ^{} = (\sigma _1^{} ,...,\sigma _m^{} )\)yielded predictions for peptides in the test fold based on the majority votes. By varying z, an Aroc value can be calculated for the test fold. The average Aroc value calculated from the different five training and test folds was reported.

The determinations of the best thresholds for δ in the PM-model and the zth percentile vectors \(\sigma ^{} = (\sigma _1^{} ,...,\sigma _m^{} )\)in the Majority Voting algorithm depend on the selected criterion. When an appropriate criterion is chosen, they can be optimized through a cross-validation procedure. For example, one may prefer a threshold that produces approximately equal sensitivity and specificity.

Results and discussion

The performance of the PM predictor was compared to those of the Majority Voting algorithm and the six individual predictors using the data set described above. We conducted 1,000 bootstrapping experiments for the Majority Voting algorithm and the PM predictor.

Table 3 summarizes the results obtained. In all cases, the performance was evaluated in terms of the Aroc value. The PM predictor demonstrated a higher accuracy than that of the Majority Voting algorithm for 14 out of 16 alleles used in the computational experiments. The only exceptions were the two mouse alleles H2-IAd and H2-IAs, for which the accuracies of the PM predictor are slightly lower than those of the Majority Voting algorithm. The average Aroc value (0.949) of the PM predictor over all tested alleles is slightly higher than that (0.936) given by the Majority Voting algorithm (p value 0.002 for the one-tailed t test). The average standard deviation (0.007) of the Arocs values in the 1,000 bootstrapping experiments for the PM predictor, over all tested alleles, was lower than that (0.011) of the Majority Voting algorithm (p value 3.00 × 10−6 for the one-tailed t test), indicating a greater robustness in the prediction of the former.

Table 3 Summary of the Aroc values for the six individual predictors and the two integrative methods

The PM predictor outperformed every individual predictor for 12 out of 16 alleles. For the two alleles DRB1*0404 and DRB4*0101, the Aroc values of the PM predictor are comparable with those of LP-top2 (0.991 vs 0.994 and 0.970 vs 0.975, respectively); for H2-IAd, the Aroc value is lower that that of LP-top2 (0.910 vs 0.945), and for H2-IAs, the Aroc value is identical with that of the LP-top2.

The average Aroc value (0.949) of the PM predictor over all tested alleles is higher than those of the individual predictors: ARB (0.757, p value 1.75 × 10−9), SVRMHC (0.678, p value 1.00 × 10−4), RANKPEP (0.665, p value 1.57 × 10−9), ProPred (0.735, p value 8.09 × 10−10), Gibbs Sampler (0.851, p value 2.19 × 10−5), and LP-top2 (0.913, p value 3.47 × 10−2). All p values were derived from one-tailed t tests.

Because most of the online predictors do not allow for training of a new model based on training data submitted by the users, we did not obtain the Aroc values for individual predictors in a cross-validated fashion. The PSSMs of Gibbs Sampler and LP-top2 were obtained by training the entire data set only once. Therefore, the Aroc values of the Gibbs Sampler and the LP-top2 actually represent the training performance. Similarly, the ARB models (Bui et al. 2005) were trained using the quantitative binding data contained within the IEDB database. For this reason, higher Aroc values for the Gibbs Sampler, LP-top2, and ARB predictions were obtained. On the other hand, SVRMHC was trained on relatively small sets of quantitative peptide binding data contained within the AntiJen database (Toseland et al. 2005), and the performance could probably be improved, if it were retrained on the data used here. The PSSMs for ProPred and the predictions of RANKPEP were obtained directly from the websites. The peptides used for training are not available. Therefore, the performance presented in the current study for the individual predictors should not be compared. The rigorous comparison for the state-of-art methods can be found elsewhere (Nielsen et al. 2007).

As the goal for our study is the construction of an integrative system that outperforms individual predictors, we accordingly compare first the outcome of an underperformed predictor to the integrative system. The MHC-BPS (Cui et al. 2006, 2007) was not included as an individual predictor in our system because of its low performance. However, we used this predictor to investigate how such a predictor would affect the performance of the integrative systems. The results were summarized in Table 4. The web server MHC-BPS (http://bidd.cz3.nus.edu.sg/mhc/) covers DRB1*0101, DRB1*0404, DRB1*0701, DRB1*0901, DRB1*1101, DRB1*1501, and DRB5*0101. The Aroc values are 0.470, 0.550, 0.555, 0.617, 0.562, 0.617, and 0.594, respectively, which yields an average Aroc value 0.566, a fairly low figure compared to other predictors used in this work. The addition of MHC-BPS to the existing six predictors included in the integrative systems resulted in the Aroc values 0.930 and 0.943 for the Majority Voting algorithm and the PM predictor, respectively, which are moderately smaller than the corresponding Aroc values 0.936 and 0.949 for the same two integrative predictors without the use of MHC-BPS. This result implies that the performance of the integrative systems may not be affected drastically by an individual predictor with a relatively low performance. Similar behavior is also observed in Tables 5 and 6, when the Gibbs Sampler and LP-top2 were removed from the system.

Table 4 Summary of the Aroc values for the seven individual predictors and the two integrative methods
Table 5 Summary of the Aroc values for the five individual predictors and the two integrative methods
Table 6 Summary of the Aroc values for the four individual predictors and the two integrative methods

We now investigate whether the improved performances of the two integrative predictors were driven by the overfitted Gibbs Sampler and LP-top2. Accordingly, we excluded these two predictors and evaluate the performances of the Majority Voting algorithm and the PM predictor. Similarly, we considered two cases: with and without including MHC-BPS in the integrative systems. The results are summarized in Tables 5 and 6. The average Aroc value (0.799) of the PM predictor is slightly higher than that (0.761) of the Majority Voting algorithm with a p value of 0.128. The average Aroc value (0.799) of the PM predictor is higher than those of ARB (0.757 with p value 5.3 × 10−2), SVRMHC (0.678 with p value 2.9 × 10−3), RANKPEP (0.665 with p value 6.9 × 10−5), and ProPed (0.735 with p value 6.93 × 10−3). With the inclusion of MHC-BPS, the Aroc values for both of the integrative systems went very slightly down, from 0.799 to 0.794 for the PM predictor and from 0.761 to 0.759 for the Majority Voting algorithm. However, we still observed a higher average Aroc value for the PM predictor compared to those of the individual predictors. More precisely, the average Aroc value (0.794) of the PM predictor is higher than those of ARB (0.757 with p value 7.6 × 10−2), SVRMHC (0.678 with p value 3.8 × 10−3), MHC-BPS (0.566 with p value 1.2 × 10−7), RANKPEP (0.665 with p value 1.2 × 10−4), and ProPed (0.735 with p value 1.2 × 10−2). Therefore, we conclude, from the results in Tables 3 to 6, that the PM predictor is reliable and consistently performs better than every individual predictor included in the system. Although the Majority Voting algorithm has a slightly lower performance than that of the PM predictor, a similar conclusion holds.

Comparison of the meta-prediction method proposed recently by our group (Huang et al. 2006, 2007) was not included in this study. The reason for that is the demanding training time required for the method when a relatively large number of individual methods are considered. However, a preliminary study using the three methods of ProPred, Gibbs Sampler, and LP-top2 has indicated an inferior performance compared to the PM predictor. Therefore, we conclude that the previous meta-predictor is not as competitive as the PM predictor when a large number of predictors are used in the system. As mentioned above, there are two algorithms (Mallios 2003; Zhang et al. 2005) making consensus predictions for MHC class II binding. However, the Maillios algorithm is not available, and the prediction provided by the website of Zhang el al. only covers six HLA-DR alleles. Accordingly, we did not compare our model with these two algorithms. In addition, we also did not include in this study the recently published method NetMHCII (Nielsen et al. 2007). It is not hard to speculate that this inclusion will improve the performance of both integrative predictors, as NetMHCII was shown to outperform the other state-of-the-art MHC class II prediction methods.

When building the PM predictor, we made the assumption of a normal distribution for the scores obtained from the individual predictors. This assumption greatly simplified the estimate of the parameters of the probability distributions of the predicted scores. However, this assumption may not be valid for the scores of some alleles. In such cases, the PM predictor may exhibit diminished performance. This limitation can be overcome by incorporating appropriate distributions in the framework of the PM predictor.

Although the efficacy of the integrated framework has only been demonstrated through an application on the MHC class II binding predictions, this method can be readily extended to the MHC class I binding prediction. Similarly, the framework of the integrative system proposed recently for MHC class I binding (Trost et al. 2007) can be used for the MHC class II binding prediction. The performance comparison of these two systems on both MHC class I and class II binding predictions may lead to the construction of further improved prediction systems.

Conclusions

A new probabilistic meta-predictor (PM predictor) for MHC class II binding peptides has been developed that is based on the integration of predictions obtained from different methods. Using six state-of-the-art predictors, consistently enhanced performance of the PM predictor has been demonstrated in comparison with the individual methods using a data set including peptides from 13 HLA-DR and three mouse alleles from the IEDB database. The results also indicate that future improvement can be made through the incorporation of a more accurate probability estimate for the scores obtained from each individual predictor.