Introduction

The immune epitope database (IEDB: http://www.iedb.org) contains data related to antibody and T cell epitopes for humans, non-human primates, rodents, and other animal species (Vita et al. 2010). This system registers an important amount of information about the molecular structure and the experimental conditions (c ij ) in which different i-th molecules were determined to be immune epitopes or not. With the availability of these types of databases (Gao and Kurgan 2014), epitope prediction using computational methods has emerged as a promising approach for developing peptide-based vaccines. Such techniques allow for screening among large numbers of possible immune-active peptides in order to find those likely to induce an immune response to a particular cell type, providing a fast and cost-effective way to identification of potential candidates for vaccine development (Du et al. 2007; Chen et al. 2007).

Quantitative structure–activity/property relationship (QSAR/QSPR) methods let transform molecular structures into numeric molecular descriptors (λi) and find relationships between these structures and their biological activity. Consequently, these techniques are widely used today to predict the properties of complex molecular systems, including peptides, proteins, RNAs, drug-protein complexes, and protein–protein complexes (see, e.g., Bermúdez et al. 1999; Agüero-Chapín et al. 2005; Du et al. 2005; Galindo et al. 2006; Chou and Shen 2008; Du et al. 2008a, b; Prado–Prado et al. 2008; Chou 2009; Du et al. 2009; Rodríguez-Soca et al. 2009; Viña et al. 2009; Wei et al. 2009; Toropov et al. 2012; Toropova et al. 2015). Likewise, QSAR/QSPR methods have been successfully used in immunoinformatics to predict the propensity different molecular structures have for playing different roles in immunological processes (see, e.g., Doytchinova et al. 2004; Estrada et al. 2004; Gerberick et al. 2004; Xiao and Segal 2005; Bhasin et al. 2006; Barh et al. 2010; Bremel and Homan 2010; Díez-Rivero et al. 2010; Roberts and Patlewicz 2010; Bi et al. 2011; Martínez-Naves et al. 2011; Tenorio-Borroto et al. 2012; Fagerberg et al. 2013; Patlewicz et al. 2013).

On the other hand, perturbation theory comprises methods that add “small” variation terms to the mathematical description of problems with known solutions in order to find an appropriate solution for related problems with no known solutions. Accordingly, this theory has been widely used in all branches of knowledge, including bio-molecular sciences. The reader may see the interesting review by González-Díaz et al. (2013a) on this topic. In the same work, the authors also formulated a general-purpose perturbation theory for multiple-boundary QSAR/QSPR problems. Subsequently, this new modeling method was applied by González-Díaz et al. (2014) to develop an electronegativity QSPR-perturbation model for B-epitopes reported in IEBD able to predict the probability of occurrence of an epitope after a perturbation in the peptide sequence (m i ), source organism (so), host organism (ho), immunological process (ip), and experimental technique (tq) used.

In principle, there are more than 1600 different molecular descriptors (λi) that may be generalized and used to solve QSPR problems in chemical structures (Todeschini and Consonni 2008). In the present study, three different physicochemical molecular properties for peptide sequences reported in IEDB were calculated in order to develop three different QSPR models able to predict the efficiency of a new peptide as B-epitope given perturbations in m i , so, ho, ip, and tq. The statistical parameters of the models were compared to the results achieved by the model developed by González-Díaz et al. (2014).

Materials and Methods

Calculation of Molecular Descriptors for Peptides

The same database recently utilized by González-Díaz et al. (2014) was used in the present study. The data contains variations in >50,000 peptides determined in experimental assays with boundary conditions involving >500 source organisms, >50 host organisms, >10 biological process, and >30 experimental techniques (González-Díaz et al. 2014). The calculation of the molecular descriptors was implemented in the program MARCH-INSIDE (González-Díaz et al. 2007), which makes use of a Markov Chain method to calculate the k-th mean values of different physicochemical molecular properties k λ(m i ) for i-th molecules (m i ). These k λ(m i ) values are calculated as an average of atomic properties (λ i ) for all atoms in the peptide molecule and its neighbors placed at a topological distance d ≤ k. The parameter k is called the parameter of the Markov Chain, the natural power of the Markov matrix. In this work, the average value of all atomic polarizabilities k α(m i ), partition coefficients k P(m i ), and polarities k Pol(m i ) for all δ i atoms connected to the i-th atom (i → j) and their neighbors placed at a distance d ≤ 5 was calculated for all peptides (González-Díaz et al. 2013b):

$$ {}^{k}\lambda \left( {m_{i} } \right) = \frac{1}{6}\sum\limits_{k = 0}^{5} {{}^{k}\lambda_{j} } = \frac{1}{6}\sum\limits_{k = 0}^{5} {\sum\limits_{i \to j}^{{\delta_{i} }} {p_{k} \left( {\lambda_{j} } \right) \cdot \lambda_{j} } } $$
(1)

The probabilities k p(λ j ) for the atomic properties in question were calculated using a Markov Chain model for the gradual effects of the neighboring atoms at different distances in the molecular backbone, as has been explained in detail in González-Díaz et al. (2013b).

Derivation of the QSPR-perturbation Models

In a recent work, González-Díaz et al. (2014) have applied the perturbation theory to the QSPR peptide prediction problem and formulated an electronegativity QSPR-perturbation model able to predict the probability of occurrence of a B-epitope after a variation in the structure and/or the boundary conditions of a peptide of reference. Therefore, the theoretical foundations of the method are not detailed here. In the present work, three new QSPR-perturbation models for prediction of B-epitopes reported in IEDB were developed using different types of molecular descriptors λ(m i ) to codify structural information: atomic polarizability, partition coefficient, and polarity. The construction of this type of models has been explained in detail before (González-Díaz et al. 2014); therefore, only the general equation is presented:

$$ \lambda \left( {\varepsilon_{ij} } \right)_{\text{new}} = {}^{{\prime }}c_{0} \cdot \lambda \left( {\varepsilon_{qr} } \right)_{\text{ref}} + \sum\limits_{{{\text{j}} = 1}}^{4} {{}^{{\prime }}} {\text{d}}_{\text{ij}} \cdot \Delta \Delta \lambda_{ijqr} + {}^{{\prime }}e_{0} $$
(2)

Here, in line with González-Díaz et al. (2014), λ(ε ij )new is the efficiency function as epitope of a new peptide obtained after a change in the structure and/or the boundary conditions c j  ≡ (c 0 , c 1 , c 2 , c 3 c n ) of a peptide of reference. The set of boundary conditions used here are the same reported in IEDB: c 0  = the specific peptide; c 1  = the organism that expresses the peptide (so j); c 2  = the host organism exposed to the peptide (ho j); c 3  = the immunological process (ip j); and c 4  = the experimental technique (tq j). The variable λ(ε qr )ref refers to a known efficiency function as epitope of a peptide of reference experimentally determined under a set of c j boundary conditions. The function λ(ε ij ) was defined as a discrete value function for classification purpose: λ(ε ij ) = 1 for epitopes reported in the conditions c j and λ(ε ij ) = 0, when otherwise. The values c 0 and d ij are the coefficients obtained for the linear discriminant analysis (LDA) classification functions. The variational perturbation terms ΔΔλ ijqr account both for the deviation of the molecular descriptors of all amino acids in the sequence of the new peptide with respect to the peptide of reference and with respect to all boundary conditions. The constant e 0 represents the independent term of the model (González-Díaz et al. 2014). The expanded formula of the models is given below:

$$ \lambda \left( {\varepsilon_{ij} } \right)_{\text{new}} = {}^{{\prime }}c_{0} \cdot \lambda \left( {\varepsilon_{qr} } \right)_{ref} + \sum\limits_{j = 1}^{4} {{}^{{\prime }}} d_{ij} \cdot \left( {\left( {\lambda_{i} - \lambda_{j} } \right) - \left( {\lambda_{q} - \lambda_{r} } \right)} \right) + {}^{{\prime }}e_{0} $$
(3)

Statistical Analysis

An LDA was carried out using the STATISTICA 6.0 software (StatSoft.Inc. 2002). In the absence of a true external data set, the original data set was randomly divided into two series, a training series for model development and a cross-validation series for model validation (75 and 25 % of the data set, respectively). A forward stepwise strategy was used for variable selection, and the statistical significance of the models was determined by calculating the canonical correlation coefficient (R c ) and U-statistic. The accuracy, specificity, and sensitivity for the training and cross-validation series were also examined (Hill and Lewicki 2006). In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test (Chou and Zhang 1995). However, of these three test methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset as elaborated in Chou (2011). Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., Zhang et al. 2008; Esmaeili et al. 2010; Mohabatkar 2010; Sahu and Panda 2010; Khosravian et al. 2013; Mohabatkar et al. 2013). However, to reduce the computational time, the independent dataset test was adopted in this study.

Results and Discussion

In the present work, three different QSPR-perturbation models were developed, one for each class of molecular descriptor calculated with the software MARCH-INSIDE: atomic polarizability (α), partition coefficient (P), and polarity (Pol). The following were the best QSPR-perturbation models found:

Polarizability-perturbation model:

$$ \begin{aligned} \lambda \left( {\varepsilon_{ij} } \right)_{new} = - 4.683 \cdot \lambda \left( {\varepsilon_{ij} } \right)_{ref} - 44.099 \cdot \Delta \alpha_{seq} + 2.666 \cdot \Delta \Delta \alpha_{ho} + 16.482 \cdot \Delta \Delta \alpha_{so} \hfill \\ \quad \quad \quad \quad \quad - 21.668 \cdot \Delta \Delta \alpha_{ip} + 47.096 \cdot \Delta \Delta \alpha_{tq} + 2.0103 \hfill \\ \quad \quad \quad \quad \quad \quad N = 155169\quad Rc = 0.91\quad U = 0.18\quad p < 0.01 \hfill \\ \end{aligned} $$
(4)

Partition coefficient-perturbation model:

$$ \begin{aligned} \lambda \left( {\varepsilon_{ij} } \right)_{new} = - 4.345 \cdot \lambda \left( {\varepsilon_{ij} } \right)_{ref} - 98.689 \cdot \Delta P_{seq} + 7.741 \cdot \Delta \Delta P_{ho} + 30.378 \cdot \Delta \Delta P_{so} \hfill \\ \quad \quad \quad \quad \quad - 7.073 \cdot \Delta \Delta P_{ip} + 69.851 \cdot \Delta \Delta P_{tq} + 1.851 \hfill \\ \quad \quad \quad \quad \quad \quad N = 155169\quad Rc = 0.89\quad U = 0.21\quad p < 0.01 \hfill \\ \end{aligned} $$
(5)

Polarity-perturbation model

$$ \begin{aligned} \lambda \left( {\varepsilon_{ij} } \right)_{new} = - 4.846 \cdot \lambda \left( {\varepsilon_{ij} } \right)_{ref} - 708.845 \cdot \Delta Pol_{seq} + 37.565 \cdot \Delta \Delta pol_{ho} + 206.803 \cdot \Delta \Delta Pol_{so} \hfill \\ \quad \quad \quad \quad \quad - 204.545 \cdot \Delta \Delta Pol_{ip} + 661.274 \cdot \Delta \Delta Pol_{tq} + 2.084 \hfill \\ \quad \quad \quad \quad \quad \quad N = 155169\quad Rc = 0.92\quad U = 0.16\quad p < 0.01 \hfill \\ \end{aligned} $$
(6)

In these equations, N is the number of cases used to train the models, R C is the canonical correlation coefficient, and U is the Wilk’s lambda or U-statistic. In line with González-Díaz et al. (2014), the output of the models λ(ε ij )new is a real value function that scores the propensity with which a new peptide obtained after perturbation of the initial conditions acts as B-epitope. On the other side, the first input term λ(ε ij )ref is the scoring function λ of the efficiency of the initial process εij. The function λ(ε ij )ref = 1, if the i-th peptide could be experimentally demonstrated to be a B-epitope in the assay of reference (ref) carried out in the conditions cj. λ(ε ij )ref = 0 if otherwise. The perturbation terms Δλ cj = λ(m q )ref − λ(m i )new are the difference in the mean value of the molecular property in question for all amino acids in the sequence of the peptide of reference. The independent variables ΔΔλ cj  = Δλ cj-ref − Δλ cj-new = [λ(m q )ref − * λ(c qr )ref] − [λ(m i )new − * λ(c ij )new] quantify values of the conditions of the new assay cj-new that represent perturbations with respect to the initial conditions c ij -ref of the assay of reference. The quantities * λ(c ij ) and * λ(c qr ) are the average values of the mean values λ(m i ) and λ(m q ) of the molecular property in question for all new and reference peptides in IEDB that are epitopes under the j-th or r-th boundary condition (González-Díaz et al. 2014). The variational perturbation terms ΔΔλcj resemble terms typical of perturbation theory and moving average functions used in Box-Jenkins models in time series (Box and Jenkins 1970; González-Díaz et al. 2013a). This type of information has been recently incorporated inside QSAR/QSPR models (Speck-Planche et al. 2013a, b, c; Vázquez-Prieto et al. 2014).

The models obtained here are very stable and robust, yielding values of accuracy, sensitivity and specificity >90 % for both training and cross-validation series (see Table 1). The present results are excellent compared with other similar models in the literature including moving average or perturbation models (Speck-Planche et al. 2012a, b; González-Díaz et al. 2013a). These models are not able to improve the model developed by González-Díaz et al. (2014) in terms of specificity (97 and 97.1 %), sensitivity (93.6 and 93.3 %), and accuracy (95.5 and 95.4 %) for both training and cross-validation series respectively. However, the results obtained are very similar and the values of different statistical parameters demonstrate the high significance of the models, validating the consistency of the method. Thus, the information obtained from the four different types of QSPR-perturbation models developed to date may be combined to increase the likelihood of a correct prediction of new epitopes or the optimization of known peptides towards computational vaccine design (González-Díaz et al. 2014).

Table 1 Detailed training and cross-validation results for the different QSPR models developed in this work

Because user-friendly and publicly accessible web-servers represent the future direction for developing more practically useful models, simulated methods and predictors (Chou and Shen 2009), efforts shall be made in the future work to provide a web-server for the method presented in this paper, as done in a series of recent papers (see, e.g., Guo et al. 2014; Lin et al. 2014; Liu et al. 2014; Qiu et al. 2014a, b; Xu et al. 2014).

Conclusions

In conclusion, this work has demonstrated that atomic polarizability, partition coefficient, and polarity values calculated with MARCH-INSIDE seem to also be good molecular descriptors for finding QSPR-perturbation models which are able to predict the results of variations in peptide sequences and experimental assay boundary conditions reported in IEBD. Consequently, this type of approach may constitute a potentially valuable route for predicting in silico” new optimal peptide sequences and/or boundary conditions for vaccine development. In addition, this study may serve as a basis for building better and more reliable models in the future (e.g., consensus QSPR models). This computational technique is by no means aimed at replacing experimentation but rather helps us to somewhat rationalize this process, while at the same time reducing costs in terms of material resources and time.