Cytotoxic T lymphocytes (CTLs) play an important role in the vertebrate immune system. CTLs recognize pathogens via peptide presentation on major histocompatibility complex molecules (MHCs). If the source of peptides is an infectious virus, the CTL response could be stimulated, thus leading to the elimination of virus-infected cells[1]. MHC-bound peptides are called epitopes, and they are usually composed of 8–20 amino acids. Epitope identification is an essential step toward synthetic vaccine development, since epitopes play an important role in the activation of the immune response[2]. Epitopes are traditionally identified by synthesizing a large number of nonapeptides and subsequently performing affinity assays. Those peptides with high affinity for MHC proteins are considered as potential epitopes. However, the process of developing a new vaccine is time-consuming and laborious when performed with traditional methods. To avoid the problems of such bottlenecks, instead computational methods can be effectively applied to search for candidate peptides and identify new promising epitopes.

Due to the importance of vaccines for human, we focus on MHCs in humans, which are referred to as the human leukocyte antigens (HLAs). There are three classes of HLAs: I, II, and III. Epitopes presented on HLA class I molecules are recognized by CTLs. HLA class I proteins can be categorized into three types according to their genes: HLA-A, HLA-B, and HLA-C. A majority of previous studies have focused on the HLA-A*02:01 allele because it is the most frequent allele of the A2 supertype in the Northeast Asian and Caucasian populations[3]. Typically, the HLA-A*02:01 epitope consists of 8–10 amino acids, and many studies have focused on nonapeptides in particular: that is, epitopes that are 9 residues long[46]. Figure1A shows the nonapeptide epitope LLFGYPVYV fitted inside the HLA-A*02:01 binding cleft, which consists of two α-helices and one β-sheet (from PDB entry 1DUZ[7]). Figure1B shows the conformation of the nonapeptide epitope LLFGYPVYV.

Figure 1
figure 1

Visualization of the HLA-nonapeptide complex. (A) Crystal structure of the LLFGYPVYV-HLA-A*02:01 complex resolved by X-ray crystal diffraction (PDB entry 1DUZ[7]) (B) Conformation of the nonapeptide extracted from the complex.

Early epitope binding prediction algorithms were based on allele-specific motifs[8, 9]. For example, for the HLA-A*02:01 allele, positions 2 and 9 of nonapeptides were the most important ones for binding. The residues at both positions were defined as classical anchor residues typically occupied by leucine, valine, and isoleucine since the MHC molecule forms hydrophobic sites for amino acids at these two positions[10]. Additionally, the residues at positions 1, 3, and 7 were identified as secondary anchor residues. Positions 1 and 3 were mainly preferred by tyrosine and phenylalanine[11, 12]. The residue at position 7 was suggested to be an amphipathic residue suitable for amino acids with small hydrophobic side-chains such as valine and alanine[13]. In this manner, unknown peptides that matched with such allele-specific motifs were determined to be epitopes.

As more data became available, statistical methods could be applied to calculating a positional scoring matrix. In the matrix, an element was defined individually for each position and specific amino acids, resulting in an L × 20 coefficient matrix where L is the length of the peptide. In general, the matrix is used under the assumption that each amino acid in a peptide sequence independently contributes to a certain binding energy according to an element included in the positional scoring matrix. Overall binding energy is estimated from the summation of binding energies from all positions. There are several methods based on such a positional scoring matrix: for example, BIMAS[14], RANKPEP[15], Gibbs sampler[16], ARB[17], SMM[18], and SMMPMBEC[19].

Currently, the most successful approach for epitope prediction utilizes machine learning algorithms. These algorithms require large enough datasets for training in order to obtain reliable results. Fortunately, the Immune Epitope Database (IEDB)[20] provides more than 100,000 MHC binding data related to T-cell epitopes from infectious pathogens, experimental pathogens, and self-antigens (autoantigens). IEDB encompasses patent data from biotechnological and pharmaceutical companies, as well as direct submissions from research programs and partners. As reliable experimental data are provided, the volume promises a sufficient grounding for developing good predictive models. Although IEDB is not the only database that provides such information, it has more entries than other existing databases. Examples of other databases are SYFPEITHI[21], FIMM[22], MHCPEP[23], MHCBN[24], and AntiJen[25]. NetMHC[26], a predictor based on artificial neural networks, used data from both IEDB and SYFPEITHI and performed very well. SVRMHC[27], a predictor based on support vector regression (SVR) used data from AntiJen and used LIBSVM[28] for SVR-related implementation. Moreover, there also exists an epitope predictor based on a hidden Markov model[29].

The allele-specific motif method, the positional scoring matrix method, and machine learning-based methods use only sequence information in general. Almost none of these methods can provide a clear explanation about the effects of the physicochemical properties of amino acids on binding affinity. In some cases, there are not enough peptides for training: e.g., when using data from rare alleles. Therefore, three-dimensional (3D) structure-based methods have been developed[3032] to uncover binding mechanisms and address all forces related to binding affinity. However, such methods are currently less reliable than data-driven methods[33]. The reason is that 3D structure-based methods usually require a number of crystal structures of MHC-peptide complexes, which are still not available in large numbers.

Currently, more than 2,000 HLA alleles have been identified. Searching for epitopes that bind to a large number of those alleles would be computationally exhaustive and time-consuming. Therefore, the concept of allele supertypes was developed by clustering alleles into groups based on overlapping epitopes[3438]. Within each supertype, most of the alleles should share the same epitopes. These epitopes are called ‘promiscuous epitopes’, which show great promise for vaccine development due to their potential for a high level of population coverage.

In this study, we have developed a novel epitope prediction method named EpicCapo. Peptides were encoded numerically by combining information on the peptide-MHC (pMHC) contact sites with amino acid pairwise contact potentials (AAPPs), accompanied by a support vector machine (SVM)[39]. Our method’s performance was evaluated by using benchmark datasets and then compared with other high performance methods. In addition, identification of candidates of promiscuous CTL epitopes for influenza A viruses was demonstrated using the proposed method.

The H1N1 or H5N1 strain of influenza A virus caused a lethal flu in humans, as seen in the epidemics of 2005–2009. Although inactivated influenza vaccination is beneficial, the development of more effective vaccines is still needed, particularly in elderly adults who are more susceptible to viral infections[40]. Identification of promiscuous CTL epitopes might aid this issue by providing candidate peptides from viral proteins for vaccine development.

Results and discussion

Comparison of peptide-encoding schemes

We compared our peptide-encoding scheme (Section Peptide data encoding) with binary peptide-encoding and with four amino acid descriptors (Table1). The results of the comparison of the peptide-encoding schemes (Table2) showed that EpicCapo performed better than others in the classification tasks. It achieved the highest average area under the curve (AUC; 0.882), followed by binary encoding (0.879), DPPS (0.878), FASGAI (0.874), z-scale (0.858), and ISA/ECI (0.796) schemes. All of standard deviations were less than 0.01. A comparison of receiver operating characteristic (ROC) curves is shown in Figure2.

Table 1 Amino acid descriptors acknowledged in this study
Table 2 Classification result of peptide-encoding schemes
Figure 2
figure 2

ROC curves of peptide-encoding schemes evaluated on a test set.

Although EpicCapo used the largest number of features (M × K = 360)—higher than binary encoding (180), DPPS (90), FASGAI (54), z-scale (45), and ISA/ECI (18)—we confirmed that its high performance was not due to a larger number of features. In our study, the training dataset was separated into 40 datasets corresponding to 40 AAPPs. Each dataset consisted of 9 features. The classification functions were fitted to these datasets, and after that the AAPPs were ranked by AUC. The results, as shown in Table2, suggested that even by using only three top-ranked AAPPs (27 features in total), the classification performance values are comparable to those obtained by using all AAPPs. These three top-ranked AAPPs were MICC010101, SIMK990101, and SIMK990105 (see Additional file1). They have been previously used in identifying native-like protein structures[44, 45], and were also identified as important AAPPs in our accompanying experiments.

Classification results of benchmark datasets

We applied EpicCapo to benchmark datasets of 34 MHC-I alleles[46]. As shown in Table3, NetMHC performed the best, ahead of ARB, SMM, and SMMPMBEC. For EpicCapo, average AUCs were lower than in NetMHC (0.1%–3.4%) in 13 allele datasets and were higher than in NetMHC (0.1%–9.3%) in 21 allele datasets when using all of the 40 AAPPs (360 features). Almost all of standard deviations were low except several alleles with results of standard deviation larger than 0.01. However, if more data are available, these standard deviations can be decreased. To improve the performance of our method, we developed EpicCapo+ by selecting an appropriate subset of AAPPs. As seen in Table3, the performance of EpicCapo+ was higher than EpicCapo and comparable with NetMHC. The overall performance of EpicCapo+ is significantly higher than that of other methods according to a paired t-test (two-tailed) comparison of average AUCs from all alleles. The IDs of AAPPs used for estimating the predictive models of EpicCapo+ are shown in Additional file2.

Table 3 Classification results of 34 allele datasets

Improved HLA-A-nonapeptide binding predictive models

In this experiment, EpicCapo+ was further developed as EpicCapo+REF to improve the predictive performance and identify important positions of nonapeptides in pMHC binding (Section Improving the performance of HLA-A-nonapeptide binding predictive models). The IDs of AAPPs used in EpicCapo+REF are shown in Table4 (for more details on AAPPs, see Additional file1). The most important AAPPs identified by EpicCapo+ were IDs 14 (MICC010101) and 28 (SIMK990105), which were selected in 13 out of 14 alleles. IDs 11 (KESO980102) and 26 (SIMK990103) were also considered to be important, because they were selected in 9 out of 14 alleles. From previous studies that used AAPPs in MHC I epitope prediction, AAPP IDs 19 (MIYS960102) and 2 (BETM990101) proved to be important in peptide-MHC binding predictions[5, 47, 48]. In our study, however, BETM990101 was not selected for an AAPP subset for any allele, and MIYS960102 was chosen for only two alleles (A*0203 and A*0206). In a report by Schueler-Furman et al.[47], KESO980102 was also tested and compared with MIYS960102; however, there was no significant improvement in the predictive performance. Therefore, it is interesting that MICC010101, SIMK990105, KESO980102, and SIMK990103 were important for generating better predictive models in our study.

Table 4 Optimal subsets of AAPPs and number of selected features identified by EpicCapo +REF using 14 HLA-A allele datasets

We further investigated the generated features according to the selected subset of AAPPs. In our peptide-encoding scheme, nine features were generated from one AAPP, corresponding to the nine amino acid positions in the nonapeptide. Previous studies have indicated that not all positions were important in pMHC binding[4, 1012]. Therefore, some features corresponding to specific positions could be removed to improve the predictive performance.

The Relief algorithm[49] was employed in our study to rank the features according to their importance in separating the nonbinding peptides from the binding ones. The ranking results showed that the ten top-ranked features correspond to positions 9 and 2 in most of the alleles, followed by positions 3, 1, or 7 (see Additional file3). As indicated in Tables3 and4, the overall AUC value of EpicCapo+REF was higher than that of EpicCapo+; however, it was still slightly lower than that of NetMHC in the A*01:01 and A*02:06 alleles. In summary, EpicCapo+REF performed better than other methods, with an average AUC of 0.935. Table4 also shows the number of selected features after employing the Relief-F algorithm. These numbers were different for specific alleles. For the A*01:01, A*02:02, and A*06:01 alleles, no features were removed. However, for the A*02:06, A*24:02, A*29:02, and A*68:02 alleles, 20 or more features were removed. Interestingly, features corresponding to positions 5 and 8, which have previously been considered to not significantly contribute to HLA binding potentials, were still included in some of the selected feature subsets. Therefore, we assumed that features corresponding to different positions are not independent, and that all features from all positions should be required input to estimate the model with the highest-performance (see Additional file3).

Candidates of promiscuous epitopes for a development of influenza A viral vaccines

Since EpicCapo+REF performed better than the other existing methods when testing with 14 HLA-A allele datasets, it was further used to find candidates of promiscuous epitopes from influenza A viral sequences. Epitopes from protein sequences of H1N1 (A/PR/8/34), H3N2 (A/Aichi/2/68), H1N1 (A/New York/4290/2009), and H5N1 (A/Hong Kong/483/97) were identified using EpicCapo+REF. The prediction results of all influenza A strains categorized into specific alleles are shown in Table5. All 14 alleles were assigned to supertype groups using the supertype classification defined by previous studies[3437]. The A*01:01 and A*26:01 alleles were assigned to the A1 group. The A*29:02 allele was assigned to an unidentified group. As shown in Table5, there are a small number of predicted positive peptides in the A1 supertype. For example, in case of H1N1 (A/PR/8/34), only one peptide was identified as positive for the allele A*26:01. In contrast, there were quite high numbers of predicted positive peptides in the A2, A24, and A3 supertypes. Even the A*29:02 allele, which was assigned to an unidentified group, had a higher number of predicted positive peptides than those in the A1 group. Based on our findings, when promiscuous epitopes were identified from the overlapping epitopes of four Influenza A viral strains (Additional file4), the A1 group rarely shared peptides with other groups. As shown in Additional file4, the A*01:01 allele shared only one peptide (YSHGTGTGY) with A*29:02, and the A*26:01 allele shared the peptide DTVNRTHQY with A*29:02 and A*68:01. Moreover, the A*29:02 allele also shared peptides with the A2 and A3 groups: e.g., SMELPSFGV and QTYDWTLNR, respectively (Additional file4). Therefore, A*29:02 can be considered as a special group that links A1, A2, and A3 together. Furthermore, Doytchinova et al.[38] assigned A*29:02 to the A3 group. However, we did not find overlapping epitopes from the four Influenza A viral strains in the A*24:02 allele assigned to the A24 group. This suggested that A*24:02 itself is different from other alleles considered here, and this might be the reason why most of the previous studies assigned it separately to the A24 group[3437]. As shown in Additional file4, 51 peptides (67.1%) of the total 76 epitopes were immunologically validated as positive, whereas 9 peptides (11.8%) were validated as negative. No evidence of immunological validation could be obtained for 16 peptides (21.1%). These results indicate that our newly developed method provides a markedly high accuracy in epitope identification, given the fact that most of the identified epitopes could be correlated with immunological experimental evidence. However, even without such immunological evidence, those epitopes identified by our computational approach might be considered as candidates for new vaccine development.

Table 5 Prediction results of EpicCapo +REF using four influenza A strains categorized by specific alleles

Our results are in agreement with the study by Uchida[50], which identified promiscuous epitopes from influenza A H1N1 (A/PR/8/34), H3N2 (A/Aichi/2/68), H1N1 (A/New York/4290/2009), and H5N1 (A/Hong Kong/483/97). Uchida found experimentally confirmed CTL epitopes in the A2 group. In our results, the epitopes identified by EpicCapo+REF in the A2 group were consistent with them (Table6). In addition, we found promising candidates of promiscuous epitopes also for the A1 and A3 groups as shown in Additional file4.

Table 6 Comparison of epitopes identified by EpicCapo+REFwith the broadly protective influenza A viral epitopes identified by Uchida[50]

Although the overall performance of EpicCapo+REF was high, there are two limitations in the use of this method. The first limitation is the length of input peptides must be equal to 9. In the further study, we will improve EpicCapo+REF to be applicable to peptides with the length of 8–11. The second limitation is that input amino acids must not be special or ambiguous ones. Examples of special amino acids are U (Selenocysteine) and O (Pyrrolysine). Also, examples of ambiguous amino acids are B (Asparagine or aspartic acid), Z (Glutamine or glutamic acid), and J (Leucine or Isoleucine). EpicCapo+REF are not applicable with these amino acids since they are not included in AAPPs.


In this study, we have developed a novel method for epitope prediction. Peptides were encoded numerically, combining information of pMHC contact sites and amino acid pairwise contact potentials, accompanied by an SVM for estimating the predictive model. Our method achieved high performance in testing with benchmark datasets. In addition, our study identified a number of candidates of promiscuous CTL epitopes from four influenza A viral strains, consistent with previously reported immunological experiments. This consistency in results strongly supports the accuracy of our method. We speculate that our techniques may be useful in identifying promising candidates of promiscuous epitopes for the development of new vaccines.


Peptide data encoding

We propose a novel peptide-encoding scheme for machine learning algorithms. This scheme utilized the information of pMHC contact sites retrieved from the international ImMunoGeneTics information system, IMGT[51], the allele-specific positional scoring matrices developed by SMMPMBEC[19], and the AAPPs from AAindex[52].

The reference pMHC contact sites retrieved from IMGT were modified by adding more MHC positions. The added MHC positions were determined by observing the pMHC contact sites of the selected 189 crystal structures of the HLA-nonapeptide complex collected from IMGT entries specific to the MHC-I receptor type. If there were new contact positions, the reference pMHC contact sites were modified by adding those new positions. Therefore, more HLA-nonapeptide contact positions were included in the modified pMHC contact site because the reference pMHC contact sites resulted from the use of only 74 crystal structures of the HLA-nonapeptide complex[51]. Utilizing the modified pMHC contact sites should provide more reliable results during the prediction. Additional file5 shows the references and added pMHC contact sites positions. This information served as a binding template between the peptide and MHC. In NetMHCpan[53], the reference pMHC contact sites were used to extract a pseudosequence representing the given MHC molecule. When performing prediction, sequence information from both peptide and MHC was taken into account. However, the pairs of amino acids between the MHC molecule and peptide were not of concern. Therefore, to generate a more informative predictive model, we used information about the pairs of amino acids at the interface between an MHC molecule and a nonapeptide, represented by AAPPs. In addition, the allele-specific positional scoring matrices developed by SMMPMBEC were used in our study. These matrices provide information of how likely a given amino acid would be preferred or avoided in a specific residue. Like NetMHCpan, SMMPMBEC did not use AAPPs. Consequently, we proved that a proper selection of AAPPs could lead to higher performance in the prediction. The encoded data could be further used in tasks of classification or regression using machine learning algorithms. In this study, we demonstrated the feasibility of the classification task by using the SVM implemented in the R package kernlab[39].

Here, we propose a novel scheme for encoding nonapeptides into input vectors of the SVM. Suppose E(a1,a2) is an AAPP for the amino acids a1 and a2. If two or more types of AAPPs are available, we denote k th type of the AAPP by E k (a1,a2). Also, we denote the i th amino acid of the nonapeptide n and the j th amino acid of HLA by u i (n) and v j , respectively. In order to combine information of position-specific amino acid scores of the nonapeptides with AAPPs, we define a score Sk,i(n) for the i th a k th type of AAPP as follows:

S k , i n = T i u i n · j = 1 L δ ij E k u i n , v j / j = 1 L δ ij ,

where L is the length of the HLA protein, T i (a) is the i th position score of the amino acid a for the nonapeptides described by SMMPMBEC, and δ ij is an indicator variable that takes the value of 1 if the i th amino acid of a nonapeptide and the j th amino acid of HLA contact each other, and 0 otherwise. Here, the positional scoring matrix T i (a) is trained based on training data and multiplied by −1 to reverse the order of values (a high positive value denotes high preference between an amino acid and the position) and scaled into the range of 1 to 10 since we need to avoid loss of information when T i (a) equals zero. In fact, any range that does not include zero can be used; in this study, it is the range of 1 to 10. The scaling of positional scoring matrices is shown in Additional file6. Note that ∑ j = 1Lδ ij is the number of contact sites for the i th amino acid of a nonapeptide (see Additional file5). Intuitively, this score represents average pair-potential of contact sites, weighted by position-specific amino acid score for nonapeptides. Let K be the number of AAPPs available, and M be the length of the peptide, set to 9 throughout this study. Using this scoring scheme, we transform a nonapeptide n into a M × K-dimensional numerical vector, whose (M(k–1) + i)th element is Sk,i(n). For example, the encoded nonapeptides consist of 9 features if one AAPP is used, and 360 features if 40 AAPPs are used. Figure3 illustrates an example of the data-encoding scheme for the first position of the nonapeptide.

Figure 3
figure 3

Our peptide data-encoding scheme, using the first position of a nonapeptide as an example.

Our peptide-encoding scheme was compared with binary peptide-encoding and with four amino acid descriptors, as shown in Table1 using the dataset reported by Bi and colleagues (supplementary information for Table S2 in[54]). This dataset consists of 1,998 quantitative affinity-known HLA-A*02:01-restricted nonapeptides. The dataset was randomly partitioned into a training set containing 1,500 nonapeptides for estimating predictive models using the SVM, and a test set containing 498 nonapeptides for validating the models. For our peptide-encoding scheme, the positional scoring matrix was trained based on the external dataset downloaded from IEDB, consisting of 500 nonapeptides restricted to the HLA-A*02:01 allele (Additional file7). These nonapeptides were included in neither training nor test sets. For the binary peptide-encoding, each amino acid was encoded as a binary vector of length 20, resulting in a vector of length 180 for a nonapeptide. In case of using amino acid descriptors, the length of an encoded vector would be equal to M times larger than the length of descriptor vectors. The performances of the data-encoding schemes were evaluated in classification tasks, using a 10-fold cross validation. Throughout our experiments, the parameter C (cost of constraint violation), epsilon, and the type of kernel used for the SVM were 1, 0.1, and the radial basis kernel, respectively. The class for each nonapeptide was determined by using an IC50 affinity cutoff at 500 nM. Nonapeptides with an affinity less than 500 nM were considered to be binders, and non-binders otherwise. The study by Moutaftsi et al.[55] showed that 90 of epitopes that could stimulate CD8+ T cell responses bound to MHC with affinities lower than 500 nM. The predictive performance is evaluated using five measures: overall accuracy (ACC), sensitivity (sens), specificity (spec), F-score (F1), and area under the curve (AUC) for the received operating characteristic curve. ACC, sens, spec, and F1 are defined as

ACC = TP + TN TP + TN + FP + FN ,
sens = TP TP + FN ,
spec = TN FP + TN ,
F 1 = 2 × TP 2 × TP + FN + FP ,

where TP, FP, TN, and FN are the numbers of overall true positives, false positives, true negatives, and false negatives, respectively.

Validation of predictive models using benchmark datasets

The performance of EpicCapo was validated by using benchmark datasets of 34 MHC-I alleles provided by Peters et al.[46]. In this experiment, the positional scoring matrices were trained based on training data according to the cross validation technique. 20 iterations of 5-fold cross validation were conducted to evaluate AUCs for EpicCapo. We compared the results of our method with those of ARB, NetMHC, SMM, and SMMPMBEC.

EpicCapo was further developed as EpicCapo+ by selecting AAPPs. Each encoded allele dataset was initially separated into 40 datasets according to the 40 AAPPs. The classification task was performed for each dataset to calculate AUC using the SVM and using the same parameters as EpicCapo. Then, the 40 datasets were ranked by AUC from highest to lowest. Next, the classification task was performed again by adding the datasets of AAPPs one by one based on their rank. Finally, the optimal subset of AAPPs that led to the highest AUC was identified for each allele. The average AUCs of all alleles as calculated from EpicCapo+ were compared with those from EpicCapo and other methods using paired t-tests (two-tailed). For each allele, the AUCs from 20 iterations of 5-fold cross validation of EpicCapo and EpicCapo+ were compared with the maximum AUC among other methods by using t-tests (one-tailed, significance level = 0.01).

Improving the performance of HLA-A-nonapeptide binding predictive models

To increase the performance of our predictive models, the positional scoring matrices used in this experiment were trained based on datasets containing larger number of nonapeptides. These matrices are available at[56]. After encoding 14 HLA-A allele datasets using the downloaded matrices, EpicCapo+ was performed again to identify optimal subsets of AAPPs therein. We used the Relief-F algorithm[49] implemented in the machine learning software Weka[57] to perform the feature selection task, ranking the features according to their importance in discriminating the MHC binder peptides from the non-binder ones. The default parameters provided by Weka were used, and a 5-fold cross validation was conducted for evaluating feature importance. The best feature subsets were constructed by adding the features, one by one, from the top-ranked feature to the last one in the classification task using the SVM. The AUC gradually increased with the addition of features, until it reached the highest value. Features after this point were considered irrelevant and ignored. We named this method, accompanied with the Relief-F algorithm, EpicCapo+REF.

Identification of candidates of promiscuous epitopes

EpicCapo+REF was further tested to identify candidates of promiscuous epitopes—i.e., nonapeptides that were predicted to be MHC binders for various HLA alleles—from the protein sequences of four influenza A viral subtypes: H1N1 (A/PR/8/34), H3N2 (A/Aichi/2/68), H1N1 (A/New York/4290/2009), and H5N1 (A/Hong Kong/483/97). These protein sequences were downloaded from the NCBI website ( The nonapeptides were generated from these sequences by using a nonamer sliding window. Next, all of the generated nonapeptides were used as inputs in EpicCapo+REF predictive models. These models were estimated by using 14 HLA-A allele datasets, and each model was specific for each allele type. The identified epitopes were validated by cross-checking with the results of immunological experiments.