MaER: A New Ensemble Based Multiclass Classifier for Binding Activity Prediction of HLA Class II Proteins
Abstract
Human Leukocyte Antigen class II (HLA II) proteins are crucial for the activation of adaptive immune response. In HLA class II molecules, high rate of polymorphisms has been observed. Hence, the accurate prediction of HLA IIpeptide interactions is a challenging task that can both improve the understanding of immunological processes and facilitate decisionmaking in vaccine design. In this regard, during the last decade various computational tools have been developed, which were mainly focused on the binding activity prediction of different HLA II isotypes (such as DP, DQ and DR) separately. This fact motivated us to make a humble contribution towards the prediction of isotypes binding propensity as a multiclass classification task. In this regard, we have analysed a binding affinity dataset, which contains the interactions of 27 HLA II proteins with 636 variable length peptides, in order to prepare new multiclass datasets for strong and weak binding peptides. Thereafter, a new ensemble based multiclass classifier, called Meta EnsembleR (MaER) is proposed to predict the activity of weak/unknown binding peptides, by integrating the results of various heterogeneous classifiers. It preprocesses the training and testing datasets by making feature subsets, bootstrap samples and creates diverse datasets using principle component analysis, which are then used to train and test the MaER. The performance of MaER with respect to other existing stateoftheart classifiers, has been estimated using validity measures, ROC curves and gain value analysis. Finally, a statistical test called Friedman test has been conducted to judge the statistical significance of the results produced by MaER.
Keywords
HLA class II proteins Machine learning MHC Peptide binding T cell epitopes1 Introduction
Tcells [1] are specialized immune cells, playing a crucial role in activation of the adaptive immune system. Once the HLA II proteins have established a stable binding with the exogenous peptide antigens (Tcell epitopes), they are transported on the extracellular domain of the antigenpresenting cell (APCs). The Tcell receptors (TCRs) located on the surface of the Tcells, interacts with the HLA IIpeptide constituting trimeric complexes responsible for the activation of the Tcell CD4+. HLA II proteins are encoded in three different genetic loci: HLADP, HLADQ and HLADR (also called isotypes) and are constituted of two separate protein chains: \(\alpha \) and \(\beta \) [2]. They contain an openended binding cleft which allows the peptides to accommodate using multiple binding frames [3]. Thus the complexity in binding prediction problem is significantly increased.
In this regard, different computational techniques have been developed to predict HLA class II binding. Among them sequencebased and structurebased approaches are the most popular. Sequencebased methods include matrix models [4], binding motif recognition [5], artificial neural network [6], quantitative matrices [7], hidden markov models [8], support vector machines [9] and QSAR [10] based methods. Structurebased methods involve threading algorithms [11], peptide docking [12] and molecular dynamics [13]. Sophisticated methods, such as an iterative metasearch algorithm [14] and ant colony search [15] have been developed to resolve the dynamic variable length problem of HLA class II proteins prediction. Apart from this, some of the recent approaches has also significantly outperformed more traditional methods [16, 17, 18].
Statistics of dataset used for MaER

2 Materials and Methods
2.1 Preparation of Dataset
The peptide classification into multiple isotype is defined with respect to the threshold value \(k\). As the lower \(k\) value increases the number of peptides, defined as strong HLA binders. The statistics about the effect of variation of \(k\) both in terms of isotype ratio and percentage of strong binders, are given in Table 1 and has been taken into account for the choice of \(k\). Ultimately, the value of \(k\) has been set to 0.15, in order to either maintain a comparable ratio among the isotypes, and grant a equal definition of strong and weak binding peptides.
Since the classification technique requires a common number of features for each peptide, a common length of 15 AAs is adopted. In the homogenized dataset, the edging AAs of peptides longer than 15 AA are shorted. The dissection is performed upon an accurate comparative analysis of the less conserved residues within the original peptides. In order to represent the entire pool of 636 peptides in a numerical form, a 40 highquality AA indices (HQI40) [20, 30, 31] are used. Therefore, the length of the peptide sequence is 15\(\times \)40=600. The block diagram representation of the experiment with data generation is given in Fig. 1(a).
2.2 The Proposed MetaEnsembleR
Meta EnsembleR (MaER) is an ensemble based classifier, where four heterogeneous classifiers like support vector machine, naive bayes, decision tree and \(K\)nearest neighbor are used. It creates a diverse set of training points by preparing different nonoverlapping sets of features. In order to discuss MaER, some notations are introduced here. Let \(\mathcal {P}\) is a matrix of size \(n \times M\), which consists of \(M\) input attributes or features values for each training instance and \(\mathcal {Q}\) be an one dimensional column vector contains the output attribute of each training instance in \(\hbar \). Therefore, \(\hbar \) can be expressed as after concatenating \(\mathcal {P}\) and \(\mathcal {Q}\) horizontally, i.e., \(\hbar = [\mathcal {P} \mathcal {Q}]\). Also let \(\mathcal {F}\) = \(\lbrace \mathcal {P}_1, \mathcal {P}_2,\ldots , \mathcal {P}_M \rbrace \) and \(\mathcal {T}\) are the set of features (\(M\ge 4\)) and ensemble size. Therefore, it can be assumed that a training set of \(n\) labelled instances \(\hbar =\lbrace p_j, q_j\rbrace _{j=1}^n\) in which each instance \((p_j,q_j)\) is described by \(M\) input attributes or features and an output attribute, i.e., \(p \in \mathbb {R}^n\) and \(q \in \mathbb {R}\) where \(q\) takes a value from the label space \(\lbrace L_1,L_2,\ldots ,L_c \rbrace \). In a classification task, the goal is to use the information only from \(\hbar \) to construct a classifier which can perform well on the unseen data. Note that in MaER, the feature set, \(\mathcal {F}\) = \(\lbrace \mathcal {P}_1, \mathcal {P}_2,\ldots , \mathcal {P}_M \rbrace \), splits into \(\mathcal {S}\) number of feature subsets, where \(\mathcal {S} \in [ 2, \lfloor \frac{M}{2} \rfloor ]\). Also from the pool of classifiers, one classifier is randomly selected for the each value of \(\mathcal {T}\). In order to construct the training and testing datasets for a classifier in ensemble, the following necessary steps are performed.

Step1: Randomly split \(\mathcal {F}\) into \(\mathcal {S}\) number of subsets, i.e.,\(\mathcal {S}_{s,t}\) for simplicity, where \(t\) counts the ensemble size and \(s\) signifies the current attribute or feature subset. As \(\mathcal {S} \in [ 2, \lfloor \frac{M}{2} \rfloor ]\), therefore, the minimum number of subsets is 2 with at least 2 features in each subset is considered.

Step2: Repeat the following steps \(\mathcal {S}\) times for each subset, i.e., \(s = 1, 2,\ldots ,\mathcal {S}\).

(a) A new submatrix \(\mathcal {P}_{s,t}\) is constructed which corresponds to the data in matrix \(\mathcal {P}\).

(b) From this new submatrix a bootstrap sample \(\mathcal {P}^{\prime }_{s,t}\) is drawn where the sample size is generally smaller than \(\mathcal {P}_{s,t}\).

(c) Thereafter, \(\mathcal {P}^{\prime }_{s,t}\) is used for PCA and the coefficients of all computed principal components are stored into a new matrix \(\mathcal {C}_{t,s}\).


Step3: In order to have a matrix of same size of feature, arrange each \(\mathcal {C}_{t,s}\) into a block diagonal sparse matrix \(\mathcal {D}_t\). Once the coefficients in \(\mathcal {C}_{t,s}\) are placed in to the block diagonal sparse matrix \(\mathcal {D}_t\), the rows of \(\mathcal {D}_t\) are rearranged so that the order of them corresponds to the original attributes in \(\mathcal {F}\). During this rearrangement, columns with all zero values are removed from the sparse matrix.

Step4: The rearranged rotation matrix \(\mathcal {D}^r_t\) is then used as \([\mathcal {P}\mathcal {D}^r_t; \mathcal {Q}]\) and \([\mathcal {I}\mathcal {D}^r_t]\) for training and test sets of classifier, where \(\mathcal {I}\) is a given test sample.
 Step5: In the testing phase, let \( MaER _{v,t}(\mathcal {I}\mathcal {D}^r_t)\) be the posterior probability produced by the classifier \( MaER _t\) on the hypothesis that \(\mathcal {I}\) belongs to class \(L_v\). Then the confidence for a class is calculated by the average posterior probability of combination base classifiers:Thereafter, \(\mathcal {I}\) is assigned to the class with the largest confidence. Note that all the five steps will repeat for \(t= 1,2,\ldots ,\mathcal {T}\).$$\begin{aligned} \mathcal {Q}_v (\mathcal {I})=\frac{1}{\mathcal {T}} \sum _{t=1}^\mathcal {T} MaER _{v,t} (\mathcal {I}\mathcal {D}^r_t),~~ where ~~v=1,2,\ldots ,c \end{aligned}$$(3)
This is to be noted that due to the process of random feature subdivision, in each iteration, the selected classifier in MaER will have new sets of training and testing data, which will help to diversify the ensemble of classifiers in order to get better classification results. The MaER is applied to predict the multiclass binding activity of HLA class II protein, i.e., DP, DQ and DR at a time. For this purpose, based on threshold value \(k\)=0.15, multiclass dataset of strong binding peptides is used to train the MaER to predict the classes of weak binding peptides. Note that here the train and test datasets are normalized, where each input data is normalized to the range [0,1]. The flowchart of MaER is shown in Fig. 1(b).
3 Results and Discussions
Performance comparison of MaER based HLA IIpeptide binding predictor with other classifiers in terms of average Accuracy, Precision, Recall, Fmeasure, MCC and AUC values
Algorithm  Accuracy (%)  Precision or PPV  Recall or Sensitivity  Fmeasure  MCC  AUC 

MaER  86.84  81.81  78.59  79.70  0.70  0.92 
SVM  84.99  77.75  77.53  77.52  0.66  0.89 
DT  75.22  65.88  59.31  60.24  0.43  0.78 
NB  67.19  49.81  50.29  49.94  0.25  0.58 
KNN  84.47  76.28  75.98  75.80  0.64  0.89 
Table 2 reports the performance of all classifiers including MaER on HLA IIpeptide binding prediction. From the results, it can be clearly stated that the MaER performing better than the other classifiers. Moreover, the result is also suggesting that it can be used as a potentially computation tool for discovering multiclass HLA class II binding epitopes that has a great importance in vaccinology. The best ROC curves of all the classifiers are shown in Fig. 2. The curves produced by MaER for DP, DQ and DR, are showing the average AUC values like 0.92, 0.93 and 0.91. Moreover. The MaER classifier achieves the average gain values of 2.18 %, 15.45 %, 29.25 % and 2.81 % over SVM, DT, NB and KNN classifiers, respectively. Finally, Friedman test based has been conducted average accuracy values of all classifiers to judge the statistical significance of the predicted results. The test produced a Chisquare value of 86.81 and pvalue of 0.126\(\times 10^{5}\) at \(\alpha \)= 0.05 significance level. The results provide a strong evident in order to rejecting the null hypothesis, that means there is a significant difference in the results produced by various classifies, while MaER produced the best results among them.
4 Conclusions
In this article, the binding prediction of HLA II isotypes is considered as a multiclass problem. For this purpose, new multiclass training and testing datasets of HLA II proteins have been prepared after analysing raw binding affinity interaction dataset of 27 HLA II proteins and 636 peptides. Thereafter, an ensemble based multiclass classifier, named as Meta EnsembleR (MaER) has been developed for the same problem. MaER is an ensemble based classifier which avoids the weakness of a single classifier while improving the prediction performance by integrating the outputs of multiple heterogeneous classifiers. It generally preprocesses the original training and testing datasets by making feature subsets, bootstrap samples and creates diverse datasets using principle component analysis. The efficacy of the developed MaER has been demonstrated in comparison with support vector machine, decision tree, naive bayes and \(K\)nearest neighbor on newly generated test data in terms of average accuracy, precision, recall, Fmeasure, MCC, area under the ROC curve (AUC) and gain values. It is observed that MaER achieves the maximum gain of \(29.25\,\%\) over Naive Bayes classifier. Finally, the statistical significance of the results produced by MaER has been justified by the Friedman test.
The application of the MaER method could be particularly beneficial wherever the informations about HLA isotype propensity and coverage are crucial. One example is the design of peptidebased vaccines, where the identification of the epitopes able to interact with different HLA isotypes, is a crucial factor for vaccine efficacy and population coverage. Another case is the study of autoimmune diseases, where the detection of self epitopes showing extensive cross reactivity with several HLA isotypes, is indeed a central issue. Apart from this, MaER can be used to find the potential markers from gene expression data [32, 33].
Notes
Acknowledgments
This work was supported by grants from the Polish National Science Centre (2014/15/B/ST6/05082 and UMO2013/09/B/NZ2/00121), COST BM1405 EU action and European Union Seventh Framework Program (FP7/20072013) under the grant agreement no.: 246016.
References
 1.Flower, D.R. (ed.): Bioinformatics for Vaccinology. WileyBlackwel, Oxford (2008)Google Scholar
 2.Janeway, C.A., Travers, P., Walport, M., Capra, J.D.: Immunobiology: The Immune System in Health and Disease. Garland Publications, New York (1999)Google Scholar
 3.Rudensky, A., Janeway, C.A.: Studies on naturally processed peptides associated with MHC class II molecules. Chem. Immunol. 57, 134–351 (1993)Google Scholar
 4.Sturniolo, T., Bono, E., Ding, J., Raddrizzani, L., Tuereci, O., Sahin, U., Braxenthaler, M., Gallazzi, F., Protti, M.P., Sinigaglia, F., Hammer, J.: Generation of tissuespecific and promiscuous HLA ligand databases using DNA microarrays and virtual HLA class II matrices. Nat. Biotech. 17(6), 555–561 (1999)CrossRefGoogle Scholar
 5.Sette, A., Buus, S., Appella, E., Smith, J.A., Chesnut, R., Miles, C., Colon, S.M., Grey, H.M.: Prediction of major histocompatibility complex binding regions of protein antigens by sequence pattern analysis. Proc. National Acad. Sci. 86, 3296–3300 (1989)CrossRefGoogle Scholar
 6.Brusic, V., Rudy, G., Honeyman, M., Hammer, J., Harrison, L.: Prediction of MHC class IIbinding peptides using an evolutionary algorithm and artificial neural network. Bioinformatics 14(2), 121–130 (1998)CrossRefGoogle Scholar
 7.Hammer, J., Bono, E., Gallazzi, F., Belunis, C., Nagy, Z., Sinigaglia, F.: Precise prediction of major histocompatibility complex class IIpeptide interaction based on peptide side chain scanning. J. Exp. Med. 180, 2353–2358 (1994)CrossRefGoogle Scholar
 8.Noguchi, H., Kato, R., Hanai, T., Matsubara, Y., Honda, H., Brusic, V., Kobayashi, T., Biosci, J.: Hidden markov modelbased prediction of antigenic peptides that interact with MHC class II molecules. J. Biosci. Bioeng. 94(3), 264–270 (2002)CrossRefGoogle Scholar
 9.Wan, J., Liu, W., Xu, Q., Ren, Y., Flower, D.R., Li, T.: SVRMHC prediction server for MHCbinding peptides. BMC Bioinform. 7, 463 (2006)CrossRefGoogle Scholar
 10.Dimitrov, I., Garnev, P., Flower, D.R., Doytchinova, I.: Peptide binding to the HLADRB1 sypertype: a proteochemometric analysis. J. Med. Chem. 45(1), 236–243 (2010)CrossRefGoogle Scholar
 11.Adrian, P.E., Rajaseger, G., Mathura, V.S., Sakharkar, M., Kangueane, P.: Types of interatomic interactions at the MHCpeptide interface: Identifying commonality from accumulated data. BMC Struct. Biol. 2, 2 (2002)CrossRefGoogle Scholar
 12.Atanasova, M., Dimitrov, I., Flower, D.R., Doytchinova, I.: MHC Class II binding prediction by molecular docking. Mol. Inf. 30, 368–375 (2011)CrossRefGoogle Scholar
 13.Oytchinova, I.D., Petkov, P., Dimitrov, I., Atanasova, M., Flower, D.R.: HLADP2 binding prediction by molecular dynamics simulations. Protein Sci. 20, 1918–1928 (2011)CrossRefGoogle Scholar
 14.Mallios, R.R.: Predicting class II MHC peptide multilevel binding with an iterative stepwise discriminant analysis metaalgorithm. Bioinformatics 17(10), 942–948 (2001)CrossRefGoogle Scholar
 15.Karpenko, O., Shi, J., Dai, Y.: Prediction of MHC class II binders using the ant colony search strategy. Artif. Intell. Med. 35, 147–156 (2005)CrossRefGoogle Scholar
 16.Salomon, J., Flower, D.R.: Predicting class II MHCpeptide binding: a kernel based approach using similarity scores. BMC Bioinform. 7, 501 (2006)CrossRefGoogle Scholar
 17.Zhang, W., Liu, J., Niu, Y.: Quantitative prediction of MHCII binding affinity using particle swarm optimization. Artif. Intel. Med. 50(2), 127–132 (2010)CrossRefGoogle Scholar
 18.Bhowmick, S.S., Saha, I., Mazzocco, G., Maulik, U., Rato, L., Bhattacharjee, D., Plewczynski, D.: Application of RotaSVM for HLA class II proteinpeptide interaction prediction. In: Proceedings of the Fifth International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS 2014), pp. 178–185 (2014)Google Scholar
 19.Bhowmick, S.S., Saha, I., Rato, L., Bhattacharjee, D.: RotaSVM: a new ensemble classifier. Adv. Intel. Syst. Comput. 227, 47–57 (2013)Google Scholar
 20.Saha, I., Mazzocco, G., Plewczynski, D.: Consensus classification of human leukocyte antigen class II proteins. Immunogenetics 65, 97–105 (2013)CrossRefGoogle Scholar
 21.Pio, G., Malerba, D., D’Elia, D., Ceci, M.: Integrating microRNA target predictions for the discovery of gene regulatory networks: a semisupervised ensemble learning approach. BMC Bioinform. 15(Suppl 1), S4 (2014)CrossRefGoogle Scholar
 22.Marbach, D., Costello, J.C., Kuffner, R., et al.: Wisdom of crowds for robust gene network inference. Nat. Methods 9, 796–804 (2012)CrossRefGoogle Scholar
 23.Saha, I., Zubek, J., Klingstrom, T., Forsberg, S., Wikander, J., Kierczak, M., Maulik, U., Plewczynski, D.: Ensemble learning prediction of proteinprotein interactions using proteins functional annotations. Mol. BioSyst. 10, 820–830 (2014)CrossRefGoogle Scholar
 24.Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)zbMATHCrossRefGoogle Scholar
 25.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufma, California (1993)Google Scholar
 26.George, H., Langley, J.P.: Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)Google Scholar
 27.Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (1967)zbMATHCrossRefGoogle Scholar
 28.Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11, 86–92 (1940)CrossRefGoogle Scholar
 29.Greenbaum, J., Sidney, J., Chung, J., Brander, C., Peters, B., Sette, A.: Functional classification of class II human leukocyte antigen (HLA) molecules reveals seven different supertypes and a surprising degree of repertoire sharing across supertypes. Immunogenetics 63(6), 325–335 (2011)CrossRefGoogle Scholar
 30.Saha, I., Maulik, U., Bandyopadhyay, S., Plewczynski, D.: Fuzzy clustering of physicochemical and biochemical properties of amino acids. Amino Acid 43(2), 583–594 (2011)CrossRefGoogle Scholar
 31.Plewczynski, D., Basu, S., Saha, I.: AMS 4.0: consensus prediction of posttranslational modifications in protein sequences. Amino Acid 43(2), 573–582 (2012)CrossRefGoogle Scholar
 32.Saha, I., Maulik, U., Bandyopadhyay, S., Plewczynski, D.: Improvement of new automatic differential fuzzy clustering using SVM classifier for microarray analysis. Expert Syst. Appl. 38(12), 15122–15133 (2011)CrossRefGoogle Scholar
 33.Saha, I., Plewczynski, D., Maulik, U., Bandyopadhyay, S.: Improved differential evolution for microarray analysis. Int. J. Data Min. Bioinform. 6(1), 86–103 (2012)CrossRefGoogle Scholar