Background

Pathogen peptide fragments are displayed on the surface of professional antigen presenting cells via the Major Histocompatibility Complex (MHC) class II. Such peptide fragments are known as epitopes. When helper T cells recognize epitopes bound to MHC-II, an adaptive immune response can be triggered for a specific pathogen. Computational prediction of MHC-II binding peptides can accelerate the development of vaccines and immunotherapies by identifying a narrow set of epitope candidates for further testing. Prediction of MHC-II epitopes is particularly challenging because the open binding cleft of the MHC-II molecule allows epitopes to bind beyond the peptide binding groove; therefore, the molecule is capable of accommodating peptides of variable length [1]. The binding core of the MHC-II is approximately nine amino acids long [2]; however, the complete epitope length can vary from 9 to 25 amino acids [3] and may even bind to whole proteins [4]. In addition, a successful computational prediction is based on sufficiently large set of high quality training data. Obtaining a large dataset for MHC-II epitope prediction can be difficult.

Machine-learning methods like artificial neural networks (ANNs) and support vector machines (SVMs) are classification techniques that have been successfully applied to predict MHC-II binding [5, 6]. These methods, however, have some limitations. The biggest limitation of SVM lies in the choice of the kernel. The best choice of kernel for a given problem is still a research problem [7]. SVMs deliver a unique solution because the optimality problem is convex. On the other hand, ANNs have multiple solutions associated with local minima, which makes the method unrobust. The sparse representation (SR) approach proposed in this paper for peptide binding classification relies on the natural selective nature of the solution of an 1-minimization problem [8]. This method overcomes the limitations of the machine-learning methods. Model selection is not necessary to differentiate between two classes; this contrasts with the need to test different kernel functions in the SVM approach when trying to find the best separating hyperplane with larger margin between classes. Furthermore, the use of the 1 norm promotes robustness in the method with respect to outliers in the data being used for classification [8], discarding bad training samples and allowing the handling of noisy data.

The 1 norm of a vector x R n is defined as the sum of the absolute values of each of its components, i.e.,

x 1 =| x 1 |+| x 2 |++| x n |.
(1)

Convex relaxation approaches based on the 1 norm have been proven to successfully promote sparse solutions (i.e. solutions with few nonzero elements) to linear system of equations with high probability. The work in the area of compressed sensing initiated in late 2004 by Emmanuel Candés, Justin Romberg and Terence Tao, and independently by David Donoho [8] encouraged the implementation of different fast solvers capable to find sparse solutions using the 1 norm as regularizer. Applications in science and technology have been successfully implemented with promising results in signal reconstruction, image processing, inverse problems, data analysis, among others. Finding sparse solutions has brought practical benefits such as the need of fewer antennas for remote sensing, fewer measurements needed in geophysical surveys, and more precise identification of genes.

The goal of this project is twofold. First, to develop a classifier using the SR approach for epitopes of variable length with the largest margin between the observations belonging to two different classes (binders/non-binders), while minimizing the training error. Second, to evaluate epitope encoding techniques for binding prediction.

Methods

MHC molecules are extremely polymorphic, with different alleles and thousands of epitopes identified in humans and other vertebrates [9]. To have a varied testing data set in terms of alleles and number of entries, we selected two different alleles for mice (H2-IAb and H2-IAd) and three alleles for humans (HLA-DRB1*0101, HLA-DPA1*0103/ DPB1*02:01 and HLA-DRB1*0401). These alleles have been used previously in computational experiments [5, 9, 10]. Peptide sequences and their binding affinities from the alleles selected were collected from the Immune Epitope Database and Analysis Resource (IEDB) [11] (Table 1). This database contains data related to antibody and T cell epitopes for humans, non-human primates, rodents, and other animal species. We removed duplicated epitopes and unnatural peptides with more than 75% alanine. To further evaluate the prediction performance and robustness of our algorithm we generated receiver Operating Characteristic (ROC) curves, distinguishing binders and non-binders and taking into consideration different cut-off points according to the half maximal inhibitory concentration (I C 50) for each epitope, as shown in Table 1.

Table 1 Peptide sequences and their binding affinities

Data

Encoding scheme

The most common way of amino acid encoding is the binary encoding scheme represented by a 20-bit binary vector, where 19 bits are set to zero and one bit is set to 1. Property encoding, on the other hand, is based on a vector containing one or more amino acid properties. Property encoding has two main advantages over binary encoding. First, physicochemical properties play an important role in biomolecular recognition; therefore, this type of encoding is more informative. Secondly, property encoding mitigates the problem of flexible lengths. To test the reliability of property encoding, we used classical binary encoding and compared it against two property encoding methods, 11-factor encoding and divided physicochemical property scores (DPPS). The 11-factor encoding is calculated from physicochemical properties of amino acids as described by [12]. The properties were obtained from general physicochemical properties of amino acids and a number of properties identified in 3-D quantitative structure-activity relationship (QSAR) analysis [13]. The DPPS scheme was proposed by [14]. The DPPS descriptor was obtained by applying principal component analysis (PCA) to thousands of amino acid structural and property parameters. The resulting transformation yielded score vectors involving significant nonbinding properties of each of the 20 amino acids.

We represented every epitope of length n as a vector of 10 or 11 factors, corresponding to second and third encoding schemes, respectively, by adding to each position of the vector v i the amino acid x i 11-factor encoding or DPPS values in the following way:

v i = i = 1 n x i ,
(2)

Thus, every vector correlates directly with the physicochemical properties of amino acids, allowing the prediction of class II-peptide interaction. Additionally, we mitigated the problem of flexible lengths since every epitope is represented as a vector of size 10 or 11.

Classification via sparse representation

We applied the selective nature of sparse representation to perform classification. As presented in [8], 1-minimization techniques provide a satisfactory method to solve sparse representation problems. We propose a classifier based on the solution of an 1-minimization problem for classification. A supervised learning system performing classification is commonly called a classifier.

Formally, given an input dataset, W = {w 1,…,w n }, a set of labels/classes T = {t 1,…,t n }, and a training dataset D = {(x i ,t i ):i = 1,…,n}, such that t i is the label/class associated to the sample x i , a classifier is a mapping from W to T, assigning the correct label tT to a given input wW, that is, F(w,D)=t.

Let us consider a training data set {( x i , t i ):i=1,,n}, x i R d , t i {1,2,,N}, where n is the number of samples and N the number of classes. The vector x i R d , represents the i th sample (for instance containing “gene expression” values, special features, etc), and t i denotes its corresponding label (in our case, binding or non-binding). Assume that d < n, that is, the length of each sample is less than the number of elements in the training dataset.

The sparse representation problem is formulated as follows: For a testing sample y R d , find the sparsest vector c = [ c 1,c 2,…,c n ]T such that

y= c 1 x 1 + c 2 x 2 ++ c n x n .
(3)

Equation (3) states that we express the vector y as a linear combination of the collection {x 1, x 2,…,x n }. Using matrix algebra notation, equation (3) can be posed as the underdetermined linear system of equations

y=Ac,
(4)

where the matrix A R n × d is constructed such that the j th column corresponds to sample x j , and the vector c = (c 1,…,c n )T. Since we look for a sparse vector c, equation (3) states that the test sample y is a linear combination of only a few training samples. We are interested in the sparsest solution of the system of linear equations in (4). In order to find such a sparse solution, we solve the following 1-optimization problem

min c , e λ c 1 + 1 2 e T e subject to A c - e = y ,
(5)

In [8], a novel optimization algorithm was proposed to solve problem (5) based on a iterative smooth convex relaxation methodology. One of the advantages of our formulation is that lack of robustness with respect to noise, missing data, and outliers can be overcome (a known property of the 1 norm is the regularization of an inverse problem). An additional advantage is that we do not need to care for model selection because the selective nature of the sparse representation captures the level of membership of a given input in one of the different classes. In the following section, we describe how to decide the class of a given input after obtaining its sparse representation. The approach consists of associating the nonzero components of c with the columns of A corresponding to those training samples that have the same class. First, let Ω k denote the set of indices given by

Ω k = j : training sample x j has label t j = k .
(6)

Therefore,

Ω 1 · ∪ Ω 2 · ∪· ∪ Ω N = 1 , 2 , , n ,
(7)

that is, the collection of sets of indices Ω i i = 1 N forms a partition of the set {1,2,…,n}, where n is the amount of samples available in the training dataset.

Then we define the discriminant functions by

g k (y)=y-A c k 2 ,k=1,,N,
(8)

where A c k is defined by

A c k = j Ω k c j x j ,
(9)

Notice that the function g k in (8) measures the error obtained when the testing sample y is represented with elements of the training set that have the same class k. Finally, we classify y in the category with the smallest approximation error. That is, we compute

g s ( y ) = min g 1 ( y ) , g 2 ( y ) , , g N ( y ) , t = s ,
(10)

and conclude that the testing sample y has label t=s. In this manner, we identify the class of the test sample y based on how effectively the coefficients associated with the training samples of each class recreate y.

Support vector machines (SVM)

We compared the results of our proposed method for classification problems with the well known SVM strategy that has been commonly used in different pattern recognition and machine learning applications. SVMs are a set of related supervised learning methods that analyze data and recognize patterns commonly used for classification and regression analysis. The original SVM algorithm was proposed by Vladimir Vapnik and the current standard implementation was proposed by Corinna Cortes and Vladimir Vapnik, [15]. Standard SVM takes a set of input data and predicts for each given input which of two possible classes the input is a member of, which makes the SVM a non-probabilistic binary linear classifier. Intuitively, an SVM model is a representation of the samples as points in space, mapped so that the samples of the separate categories are divided by a clear gap that is as wide as possible.

Slow training is a possible drawback of SVM approaches because SVMs are trained by solving quadratic programming problems where the number of variables is equal to the number of samples in the training data set. When a large number of training data is available, the training process might turn slow. More information about the different strategies used in SVM for classification problems are described in [16]. Here we use the implementation of SVM available in MATLAB as part of the Statistics Toolbox, and report the results for the best possible setup (using radial basis functions) found after an appropriate parameter tuning stage (model selection).

Evaluation of method performance

To evaluate the prediction performance and robustness of our algorithm, we performed a 10-fold (n-fold) cross-validation. An illustration of the 10-fold cross validation partition process is shown in Figure 1. In the n-fold cross-validation, all the binding and non-binding epitopes were mixed and then divided equally into n parts, keeping the same distribution of binders and non-binders in each part. Then n-1 parts were merged into a training data set while the remnant was taken as a testing data set. This process was repeated 10 times and the average performance of n-fold cross-validation computed. We then measured sensitivity (Sn), specificity (Sp), accuracy (Acc) and Matthew’s Correlation Coefficient (MCC) for every fold and then took the average (Avg) as shown in Table 2. In addition, we performed a ROC curves analysis using different I C 50 thresholds.

Figure 1
figure 1

A 10-fold cross validation partition example.

Table 2 Average results for 10-fold cross-validation with 1 -minimization and SVM

We examined the association between cutoff value, encoding factor and method for every allele, after 10-fold cross-validation using logistic regression analysis. The results show that sensitivity and specificity are statistically associated with the three predictors in most cases, whereas no significant associations were seen in accuracy. Results shown in Table 3.

Table 3 Logistic regression p-values with cutoff value, encoding factor and predictive method as predictors

Results

Prediction accuracy

Techniques for predicting MHC binding include ANNs (NetMHCpan [6] and NN-Align [17]), position Specific Scoring Matrices (PSSMs) (RANKPEP [18]), and amino acid pairwise contact potentials as input vector for SVM (EpicCapo [10]). These methods have typical prediction accuracies of almost 70-90%[19]. Overall, our binding prediction accuracies are comparable to the reported 70-90% accuracies. Table 2 shows a comparison of the three encoding schemes with two alleles. While our method consistently favors DPPS encoding with the alleles tested, SVM shows a slightly better accuracy with 11-factor encoding. The experiments performed indicate that the physicochemical properties of amino acids are more informative in predicting MHC-II binding peptides. These results are also consistent with the MCC obtained for binary encoding, which yielded negative and zero scores on various occasions, implying that the predictions were not better than random predictions.

ROC curves analysis

We also applied ROC analysis to examine the performance of the 1-minimization and SVM classifiers. An ROC graph is a plot with the false positive rate on the x axis (1 - specificity) and the true positive rate on the y axis (sensitivity). The point (0,1) is the perfect classifier: it classifies all positive cases and negative cases correctly. The point (0,0) represents a classifier that predicts all cases to be negative, while the point (1,0) is the classifier that is incorrect for all classifications. The ROC curves were calculated using the thresholds shown in Table 1 to distinguish binders from non-binders.

In Figures 2, 3 and 4 we present the corresponding ROCs for the sparse representation method and ROCs were calculated using different IC 50 cutoff values.The area under the ROC curve (AUC) provides a measure of overall prediction accuracy. An AUC value of 0.5 indicates random choice; while values close to 1 indicate excellent predictive capabilities of the method used. AUC values were computed using trapezoidal rule for numerical integration. With DPPS encoding, the 1-minimization method for predicting epitopes on H2-IAd and HLA-DPA1*0103/DPAB1*0201 molecules rendered AUC values of 0.729 and 0.764, respectively, higher than any of the AUC of SVM for the same encoding scheme. However, with 11-factor encoding, the AUC obtained by SVM was 0.806 for molecule HLA-DPA1*0103/DPAB1*0201, a higher value than any AUC obtained by 1-minimization. Tables 4, 5, 6, 7 and 8 show the values of sensitivity (Sn), specificity (Sp) and accuracy (Acc) for each of the IC 50 cutoff points, when using both the 11-factor and DPPS encoding schemes.

Figure 2
figure 2

11-factor encoding. ROCs for sparse representation and SVM.

Figure 3
figure 3

DPPS encoding. ROCs for sparse representation and SVM.

Figure 4
figure 4

Binary encoding. ROCs for sparse representation and SVM.

Table 4 11-factor and DPPS encoding for H2-IA b
Table 5 11-factor and DPPS encoding for H2-IA d
Table 6 11-factor and DPPS encoding for HLA-DPA1*0103/DPB1*0201
Table 7 11-factor and DPPS encoding for HLA-DRB1*0101
Table 8 11-factor and DPPS encoding for HLA-DRB1*0401

Discussion

Table 2 gives the results of our 1-minimization algorithm and SVM predictions based on independent evaluation sets of three different epitope encoding methods. The experiments performed with our method revealed binder average accuracies in the range of 70-88% for the alleles used, similar to those predictive accuracies reported elsewhere [19].

With DPPS encoding, the 1-minimization method delivered higher AUC values of than any of the AUC of SVM for the same encoding scheme. However, with 11-factor encoding, the AUC obtained by SVM was higher than any AUC obtained by 1-minimization. These results imply that different properties of amino acids are significant in the association process between the MHC-II molecule and the epitope, leading to higher performance in the prediction. This has a biological interpretation since nonbonding effects, such as electrostatic, van der Waals, hydrophobic interactions and hydrogen bond, play central roles in peptide-MHC interactions [14]. Hence, physico-chemical properties of amino acids should be considered when encoding epitopes for prediction. Since the 1-minimization approach proposed here, where no model selection is involved, requires a more robust way of presenting the information to the algorithm, in this case, we conclude that the DPPS encoding is more appropriate. This is because DPPS directly relates to peptide-MHC association. In Figure 5 we show the accuracy of the sparse representation and SVM methods when working with DPPS encoding for different IC 50 cutoff values. On the other hand, once the best choice of kernel has been obtained (model selection), SVM can handle less robust encoding schemes. We hypothesize that if more information is available in the encoding scheme of epitopes, our sparse representation algorithm could achieve higher performance.

Figure 5
figure 5

DPPS encoding accuracy. Comparison of predictive accuracy.

Conclusions

The proposed 1-minimization algorithm is able to produce accurate classification of MHC class II epitopes with sensitivity, specificity and accuracy to those from SVM approaches. We studied the algorithm performance for peptide binding classification and compared it with SVM for a collection of both human and mice alleles. Our methodology relies on the natural selective nature of sparse representation in order to perform classification wherein no model selection is involved; with regards to robustness to outliers, our classification enabled us to discard bad training samples and handle noisy data [8, 20]. This contrasts with the need to test different kernel functions in the SVM approach when trying to find the best separating hyperplane with a larger margin between classes. Our methodology involves a very simple learning stage and the use of an 1-minimization solver first proposed in [8]. For the set of alleles studied in this work, we found the DPPS encoding scheme to be efficient in conjunction with the proposed methodology for peptide binding classification.