Background

The humoral immune system protects the extracellular space from foreign objects like bacteria and viruses. A central role in the immune response is played by antibodies that are secreted by B-cells. These proteins recognize the disease-causing agents and thereby trigger their neutralization. The recognition mechanism involves the binding of antibodies to antigens, which are usually proteins or polysaccharides from the pathogenic substances. Epitopes are the regions of the antigen surface that are bound by the antibodies. The localization and identification of epitopes, which are targeted by specific antibodies and are capable of inducing an efficient immune response, is of utmost importance for the rational design of potential vaccines [13].

We focused in this paper on protein antigens. These are classified as linear (or continuous) and conformational (or discontinuous) epitopes, depending on their structure and interaction with antibodies [4]. Specifically, linear epitopes consist of amino acids that are contiguous in the polypeptide chain, while conformational epitopes contain amino acids that are distant along the sequence but spatially close in the native structure. Linear epitopes are often found in peptides and conformational epitopes in proteins.

For over 30 years, computational methods have been developed for facilitating epitope recognition [5]. In the past, the majority of the in silico methods were focused on linear epitopes. Most of these approaches are sequence-based and use amino acid-based propensity scales, such as hydrophilicity, solvent accessibility, secondary structure and flexibility; a score derived from the propensity scales is assigned to each residue, and the whole sequence is examined for high-scoring window fragments, which are then predicted as epitopes [612]. However, the prediction results of these methods have only marginally better performances than random selections [13]. In the last few years, several groups investigated the combination of multiple amino acid propensity scales to predict linear B-cell epitopes [1417] with no significant improvement of the prediction success rate. Recently, not only sequence-based, but also structure-based, amino acid features have been used in conjunction with machine learning methods and have been shown to slightly improve the prediction accuracy of linear B-cell epitope predictions [1423].

Although the large majority of B-cell epitopes are conformational [24], they started to be studied later. Many groups have analyzed various physicochemical, structural, and geometrical features of epitopes in order to determine which of them significantly distinguish epitope from non-epitope antigen residues [2529] and what are the characteristics of antigen-antibody interfaces compared to other protein-protein interfaces [3033]. The existing conformational epitope prediction tools were developed by combining such informative attributes, which are based either purely on the sequence, or both on the sequence and the structure [3439]. More recently, machine-learning techniques have been used to improve the prediction performance of conformational epitopes [4047].

In this study, we describe SEPIa, a conformational epitope prediction method that requires only the amino acid sequence as input and is based on commonly used features, but also on new ones. It utilizes a meta-learning approach, which combines the predictions obtained with two different classifiers through a voting procedure and yields a single prediction with improved accuracy [48].

Methods

Datasets

We constructed a non-redundant data set of 85 of antigen-antibody complexes, noted S85, from the Immune Epitope Database (IEDB-3D) [49], which is an updated and extended version of the one we used earlier [29]. To increase the number of antigen sequences used for developing our method, and given that the 3-dimensional (3D) structure is only required for the identification of epitope residues, we considered structures with resolution better than 3 Å, against 2.5 Å in our previous study. The other criteria remained the same: (i) for complexes represented by more than one crystal structure, the 3D structure with the best resolution was chosen; (ii) structures in which the antibody binds the antigen but involves no residues from complementarity determining regions (CDRs) were excluded; (iii) complexes in which the antibody does not contain both the light and heavy chains were discarded; (iv) for structures with more than one complex in one asymmetric unit and no structural difference between them, only one complex was chosen; and (v) to obtain a non-redundant data set, the sequences were pairwise aligned using the ClustalW program [50], and if two sequences had a sequence identity of more than 70%, only one was kept. Note that epitopes from similar antigens were kept if these antigens were in complex with different antibody CDR sequences. With this procedure, antibody–antigen complexes were selected and the corresponding coordinate files were obtained from the Protein Data Bank (PDB) [51].

Two of the antigens of the S85 dataset have common epitopes, which are not identified as epitopes in all antigen-antibody complexes. We defined the S83 set that contains all 85 antigen chains of S85 except these two. The lists of antigens of the S85 and S83 sets are given in Additional file 1: Table S1.

To determine the epitopes, we proceeded as in reference [29]. We calculated the solvent accessibility values of the antigen residues without taking the antibody into account (ACCunbound), using an in-house program [29], and compared them with the accessibility of antigen residues in the complex (ACCbound). All antigen residues with a solvent accessibility variation of 5% at least upon antibody binding (ACCunbound - ACCbound ≥ 5%) were considered as epitope residues. The S85 set contains 1,667 conformational B-cell epitope residues and 16,780 other residues. The ratio between epitopes and non-epitopes is thus almost exactly 1:10.

An independent dataset of 19 antigen sequences [42], noted S19, was used to evaluate the predictor and to compare it with other methods; it has already been used for that purpose in other investigations [42, 45, 52]. These sequences and epitope assignments were taken from the conformational epitope database (CED) [53]. The epitope residues were here not identified on the basis of the 3D structure of the complexes, but rather experimentally, with the help of techniques such as surface plasmon resonance, ELISA and immunoblotting. The ratio between epitope and non-epitope residues in this set is 1:13, with 407 epitope and 5,192 non-epitope residues. The members the S19 set are listed in Additional file 1: Table S2

The sequences from both datasets S85 and S19 were pairwise aligned using the ClustalW server [50]. None showed a sequence identity of more than 70%, which is the identity threshold used for building S85. The two datasets may thus be considered as independent.

Features

We evaluated 14 sequence-derived features, referred to as F1–F14. These are:

Amino acid composition

It is well known that certain amino acid types show preferences to be located in epitopes, in non-epitope protein surfaces, or in the protein core [29]. We used here two features related to the amino acid composition: the ratio of the amino acid frequency observed in epitopes and in the remaining antigen surface (referred to as F1), and the ratio of the amino acid frequency in epitopes and in the remaining antigen (F2). These two features were computed on the S85 dataset. Their values are given in Additional File 1: Table S3.

Hydrophilicity

A characteristic closely related to the amino acid composition is the hydrophilicity. Epitopes are known to be enriched in charged and hydrophilic amino acids [29]. We used here the hydrophilicity scale of Hopp and Woods [6] as feature F3.

Secondary structure

As epitopes have been shown to be more often located in the loop regions of the antigen [29], the predicted secondary structure was added as feature F4. We used for that purpose the program NetSurfP [54] that provides amino acid propensities for being in an α-helix, β-strand or coil. We also used the program BetaTPred3 [55] that estimates the β-turn propensities in protein sequences, and tested them as epitope feature F14.

Flexibility

Given that epitopes often involve loop regions, flexibility could be expected to be an informative factor. We used two programs to predict flexibility from sequence, DynaMine [56] (F5) and PredyFlexy [57] (F6), which are based on two different definitions of flexibility. DynaMine predicts the backbone flexibility at the residue level in the form of backbone N-H S2 order parameter values; a value of 1 means a rigid conformation, while a value of 0 means highly dynamic. PredyFlexy is instead based on root mean square fluctuations (RMSF) obtained from molecular dynamics simulations.

Intrinsically disordered regions

The tendency of protein sequences of being structured or unstructured is another feature that could help distinguishing epitope from non-epitope regions. Two web servers were used to calculate such regions from the amino acid sequence. IUPred [58] (F7) predicts intrinsically disordered regions and ANCHOR [59] (F8) disordered binding regions.

Energy-like

A novel feature that we introduce here is based on the statistical potential formalism [6062] to evaluate the propensity of a residue to be part of an epitope. The first “potential”, Δw1, measures the influence of a residue of type s at position j along the sequence on the epitope/non-epitope state u (u = E for epitopes and u = N for non-epitopes) of the residue at position i:

$$ {\Delta \mathrm{w}}_1\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}}\right)=-\mathrm{R}\mathrm{T} \ln \frac{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}}\right)}{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}}\right)\mathrm{F}\left({\mathrm{s}}_{\mathrm{j}}\right)}, $$
(1)

where F are relative frequencies computed from the learning dataset S85, R is the Boltzmann constant, T the absolute temperature taken to be room temperature, and j − w ≤ i ≤ j + w, with w an integer between 0 and 8. The window size I = 2w + 1 is a parameter that will be optimized to get the best prediction performances.

The second “potential”, \( \Delta {\mathrm{w}}_2 \), measures the influence of two residues of types s and s′ at positions j and k along the sequence on the epitope/non-epitope state u of the residue at position i:

$$ {\Delta \mathrm{w}}_2\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)=-\mathrm{RTln}\ \frac{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)}{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}}\right)\mathrm{F}\left({\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)}, $$
(2)

with j − w ≤ i ≤ j + w and k − w ≤ i ≤ k + w.

To correct for sparse data, we applied the usual correction [5456]:

$$ \begin{array}{c}\hfill \frac{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}}\right)}{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}}\right)\mathrm{F}\left({\mathrm{s}}_{\mathrm{j}}\right)}\to\ \frac{1}{\upsigma +\mathrm{n}\left({\mathrm{s}}_{\mathrm{j}}\right)}\ \left(\upsigma +\mathrm{n}\left({\mathrm{s}}_{\mathrm{j}}\right)\frac{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}}\right)}{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}}\right)\mathrm{F}\left({\mathrm{s}}_{\mathrm{j}}\right)}\right),\hfill \\ {}\hfill \frac{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)}{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}}\right)\mathrm{F}\left({\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)}\to\ \frac{1}{\upsigma +\mathrm{n}\left({\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)}\ \left(\upsigma +\mathrm{n}\left({\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)\frac{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)}{\mathrm{F}\left({\mathrm{u}}_{\mathrm{i}}\right)\mathrm{F}\left({\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right)}\right),\hfill \end{array} $$
(3)

with n (sj) and n (sj,s′k) the number of residues of these types in the learning set, and σ = 10. This correction ensures that the “potentials” tend to 0 when the number of observations in the data set is too small.

Using these “potentials”, we computed an energy-like contribution for each residue i in a protein sequence, which measures their propensity of being an epitope (u = E) or a non-epitope (u = N):

$$ \begin{array}{cc}\hfill {\Delta \mathrm{W}}_1\left({\mathrm{u}}_{\mathrm{i}}\right)={\displaystyle \sum_j}{\Delta \mathrm{w}}_1\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}}\right);\hfill & \hfill {\Delta \mathrm{W}}_2\left({\mathrm{u}}_{\mathrm{i}}\right)={\displaystyle \sum_{j, k}}{\Delta \mathrm{w}}_2\left({\mathrm{u}}_{\mathrm{i}},{\mathrm{s}}_{\mathrm{j}},\mathrm{s}{\hbox{'}}_{\mathrm{k}}\right),\hfill \end{array} $$
(4)

with j and k in a sequence interval I around residue i. We consider as feature F9 the sum ΔW(ui) = ΔW1(ui) + ΔW2(ui). The values of this feature, for different amino acids and amino acid pairs, are given in Additional file 2.

Solvent accessibility

As epitopes are located at the protein surface, an indispensable feature is the predicted solvent accessibility. We used two different programs for that purpose. The first is NetSurfP [54] (F10), which not only predicts the secondary structure but also classifies residues in buried (B) and exposed residues (E). The second (F11) is an energy-like solvent accessibility predictor that is similar to the epitope/non-epitope predictor described in Eqs (1–4) with the state ui of residue i being exposed (u = E) or buried (u = B). We define a residue to be in the state E (B) if its solvent accessibility is higher (lower) than 5%. This quite low percentage was chosen to ensure that all epitope residues are exposed [29]. The values of this feature, for different amino acids and amino acid pairs, are given in Additional file 3.

Solubility

The intrinsic solubility of the amino acid residues in a sequence is closely related to their propensity of having a certain solvent accessibility. The solubility per residue was calculated using the sequence-based version of the CamSol [63] program (F12).

Evolutionary information

It is a priori not obvious whether epitope residues are equally conserved during evolution than non-epitope residues. To analyze this, we evaluated the evolutionary conservation of the epitope and non-epitope residues using the position-specific scoring matrix (PSSM) obtained by aligning the target antigen sequence against a non-redundant set of protein sequences with the PSI-BLAST [64] tool. The so obtained conservation value per residue was used as feature F13.

Feature windows

As the characteristics of a residue are influenced not only by the residue itself but also by the neighboring amino acids along the chain, we considered the features in a sequence window around the targeted residue to predict the epitope/non-epitope state of the latter. Note that this effect is already built in the energy-like features. We considered windows W from 3 to 9-residues centered on the central residue.

Machine learning

We applied two machine-learning methods to classify residues as epitopes or non-epitopes on the basis of the 13 features F1-F13. These are the Gaussian Naïve Bayes [65] and Random Forest [66] algorithms. All the parameters of these classifiers were optimized first; in particular, better performances were achieved using Random Forest with 100 trees. The predictions of these two basic classifiers were then combined using a voting algorithm, based on averages of the predicted probabilities. This technique tends to balance out the weaknesses of individual machine-learning classifiers. We used the implementation of these techniques in the scikit-learn [67] package.

Imbalanced dataset

One of the difficulties in predicting epitopes is the strongly imbalanced dataset. Indeed, the number of epitopes is one order of magnitude smaller than the number of non-epitopes. All classifiers that use this dataset for training tend to predict every residue as non-epitope. We tested several standard techniques to deal with such imbalanced data. We finally selected a variant of the SMOTE [68] algorithm, i.e. the SVM SMOTE algorithm [69]. The SMOTE algorithm proceeds by oversampling the minority class by creating synthetic instances using a k-nearest-neighbor approach. Similarly, the SVM SMOTE is an oversampling method that uses a Support Vector Machines (SVM) classifier to create new instances of the minority class. This approach was implemented with the Imbalanced-learn python toolbox [70], which is compatible with scikit-learn. We optimized the parameters of the radial basis function kernel in SVM; the parameters achieving the highest scores were C = 1 and γ = 0.01.

In this way, the size of the original dataset was changed, leading to roughly the same number of epitope and non-epitope residues. This procedure was found to be superior to the common method consisting in dividing the majority class randomly into N equal parts (N being equal to the ratio of non-epitopes to epitopes) and combining each part with the minority class to form N distinct learning sets.

Performance evaluation metrics

The Kolmogorov-Smirnov (KS) statistic [71, 72] is a nonparametric test that quantifies a distance (D-value) between the empirical distribution function of two samples, as shown in Fig. 1. We used it for analyzing, for each feature, the difference – if any - between the distributions observed for epitope and non-epitope residues.

Fig. 1
figure 1

Cumulative distributions for individual features, with the D-value of the KS test indicated (a) Energy-like solvent accessibility feature F11 for the sequence interval of size I = 7, with a D-value of 0.185; (b) Feature F2 defined as the ratio of the amino acid frequency in epitopes and in the remaining antigen, with a D-value of 0.177

The performance of the prediction models was measured by the area under the receiver operating characteristic (ROC) curve (AUC) [73]. This curve is obtained by plotting the true positive rate (TP) against the false positive rate (FP) for various threshold values, and illustrates the performance of binary classifiers.

The prediction performances were evaluated in 10-fold cross validation on the S85 set, as well as on the independent test set S19. Note that in the 10-fold cross validation procedure, the dataset was split before applying the SVM SMOTE algorithm (see section “Imbalanced dataset” here above), to avoid introducing similarities between the training and test sets. The different classifiers were applied on the same training and test folds.

Results and discussion

The 14 amino acid-based physicochemical, energy-like, evolutionary and statistical features F1-F14 described in Methods were first evaluated separately and then combined to build our B-cell epitope predictor SEPIa. Importantly, all the features were calculated on the basis of the sole amino acid sequences of the antigens and do not require any structural information.

Feature analysis

We first analyzed the 14 features separately, and compared their capacity to distinguish epitope from non-epitope residues. We used therefore the statistical KS-test, which involves computing the D-value that corresponds to the largest vertical distance between the cumulative distributions of epitope and non-epitope samples computed from the S85 learning set (See Methods and Fig. 1). Another measure that we used for that purpose is based on the construction of a prediction model using the Gaussian Naïve Bayes and Random Forest machine learning algorithms, and combining them using a voting procedure, as explained in Methods. The AUC score of this prediction model, calculated in 10-fold cross validation on the S85 set, yields another estimation of the informative power of the individual features.

A novel sequence-based characteristic introduced in this study in the context of B-cell epitope prediction consists of the energy-like features ΔW1 (E/N) and ΔW2 (E/N) (Eq. (4)), which compute the influence of single amino acids and amino acid pairs, respectively, in a sequence interval of I residues centered around a central residue on the epitope/non-epitope state on this central residue. To identify the optimal size of I, we first used the KS-test. For all tested sizes, ranging for I = 3 to 17, the cumulative distributions of energy values for epitope and non-epitope residues were found to be significantly different (P-value < 0.0001). The best size, determined as the one that gives the largest D-value, is I = 3 and I = 7 almost ex-aequo for ΔW1, and I = 7 for ΔW2. Testing the informative value of these features on the basis of the AUC of the prediction model also selects I = 7 as the optimal sequence interval size. We hence fixed I = 7, and considered as feature F9 the sum of the two potentials, ΔW (E/N) = ΔW1 (E/N) + ΔW2 (E/N). The cumulative distributions for E/N states obtained from this feature are slightly better separated than for the individual potentials, with a D-value of 0.166 and an AUC score of 0.551 (Table 1).

Table 1 Prediction performance of the individual features F1-13 and of their combination (F), for all window sizes W = 0-9, estimated by the AUC score and evaluated by 10-fold cross validation of the S85 set. The features indicate intrinsically disordered regions (F8 and F7), flexibility (F5 and F6), evolutionary information (F13), energy-like (F9), secondary structure (F4), solvent accessibility (F10 and F11), solubility (F12), hydrophilicity (F3), and amino acid composition (F1 and F2)

Epitope residues are always located near the surface, and solvent accessibility is thus obviously an important epitope characteristic. We introduced an energy-like feature that compiles solvent accessibility propensities in much the same way than the energy-like epitope/non-epitope feature analyzed above. In this case the state of a residue is surface/buried (S/B) rather than epitope/non-epitope. The optimal sequence interval size I was evaluated on the basis of the KS D-value and the AUC score of the prediction model. Again, I = 7 appears as the best compromise, both for the feature ΔW1 (S/B) based on individual amino acid propensities and ΔW2 (S/B) based on pairs of amino acids. The sum of these two potentials, ΔW (S/B) = ΔW1 (S/B) + ΔW2 (S/B) with I = 7 is defined as feature F11, and leads to cumulative distributions that are slightly better separated than the individual ones, with a D-value of 0.185 (Fig. 1), and an AUC score of 0.521 (Table 1).

Note that the other energy-like feature F11, which is based on the preference of amino acids to be at the surface or buried, distinguishes better epitopes from non-epitopes than feature F9, which is based on the preference of amino acids to be (non-) epitopes. This apparent discrepancy is due to the fact that the epitope/non-epitope assignments in the S85 dataset include more errors than the surface/buried assignments. Indeed, the latter are obtained from the structure and thus are basically error-free. The epitope residues are also correctly assigned, as they are obtained from the structures of the antigen/antibody complexes. In contrast, some of the residues assigned as non-epitopes are probably epitopes in other antigen/antibody complexes. This obviously induces noise in the epitope learning dataset.

Besides the features F9 and F11, we tested the informative content of all other features F1-F14. According to the KS-test, the features F1-F13 differ significantly between the epitope and non-epitope samples with P-values < 0.05, and have higher than random AUC scores (see Table 1). Only the β-turn feature F14 did not show a statistically significant difference between epitope and non-epitope residues. Indeed, the KS-test D-value was found to be equal to 0.028 with a P-value of 0.183, and the AUC score is 0.506. We thus dropped this feature and kept the 13 features F1-F13 for building the epitope predictor.

The characteristics of a residue are influenced not only by the residue itself but also by the neighboring residues along the chain. They are also influenced by the residues that are in spatial contact, but as we restrict ourselves to sequence-based features, we cannot consider them. We tested the information gain obtained with sequence windows W between 3 to 9 residues. Note that these windows are related to the sequence intervals considered for the two energy-like features F9 and F11, for which the optimal value was I = 7. These intervals are considered in the feature construction, whereas the windows are used at the level of the prediction. We tested varying window sizes W in designing the predictor for all features including F9 and F11.

The results of the AUC scores, computed in 10-fold cross validation, are given in Table 1 for the different window sizes and the 13 individual features. For 7 out of the 13 features, the optimal value is W = 9. We thus selected this window size.

The 13 features are ranked as a function of increasing AUC in Fig. 2 (for W = 9). The best individual features are F1, F2 and F3, and are related to the amino acid composition. The best one, F2, is equal to the ratio of amino acid frequencies in epitopes and in the remaining antigen, and reaches an AUC of 0.62. The second best, F1, is the ratio of the amino acid frequency in epitopes and in the remaining antigen surface, and has an AUC of 0.61. The third best feature, F3, is hydrophilicity, with an AUC score of 0.59. The energy-like epitope/non-epitope feature F9, which is based on more complex combinations of amino acid propensities, is slightly less performing, with an AUC of 0.55. Note that it is higher (0.58) for W = 5 and 7.

Fig. 2
figure 2

Prediction performance of the individual features F1-13 and of their combination (F), estimated by the AUC and evaluated by 10-fold cross validation of the S85 set, using a sequence window size W = 9. The bold horizontal line indicates the level of random prediction. From least to best performing: intrinsically disordered regions (F8 and F7), flexibility (F5 and F6), evolutionary information (F13), energy-like (F9), secondary structure (F4), solvent accessibility (F10 and F11), solubility (F12), hydrophilicity (F3), and amino acid composition (F1 and F2)

The next best features are related to the solvent accessibility prediction, which is obviously an important factor since epitopes are at the protein surface. The best of these is the energy-like accessibility feature F11, with an AUC of 0.59, followed by the solvent accessibility feature F10, based on NetSurfP predictions, with an AUC of 0.57. The related feature F12, solubility, also presents an AUC of 0.57.

The flexibility features F5 and F6 also carry some information. F6, obtained from the RMSF computed from molecular dynamics simulations, shows an AUC of 0.58, while F5, obtained from backbone N-H S2 order parameters, is only slightly above random, with an AUC of 0.53. This reflects the observations that epitope residues are usually positioned in flexible regions. Similarly, epitopes are more often localized in loop regions than in helices and strands; the predicted secondary structure, feature F4, has an intermediate AUC of 0.56. The related features F7 and F8, based on the prediction of intrinsically disordered regions, have AUC values in the same range: 0.56 and 0.53, respectively.

The last feature, F13, which is based on evolutionary sequence conservation, has a low AUC of 0.55: epitope regions are slightly less conserved than other regions, but the difference is small.

Overall, the analysis of the individual features indicates that all the tested attributes possess a weak to medium ability of differentiating epitope from non-epitope residues.

SEPIa predictor

We combined the 13 tested features to set up the final predictor, called SEPIa. The algorithm used is the same as for the individual features, a combination of Gaussian Naïve Bayes and Random Forest algorithms using a voting procedure (see Methods). Note that even those features that are only slightly better than random carry some information. Indeed, their elimination decreases the prediction score.

The results obtained in 10-fold cross validation procedure on the S85 set are given in Table 1 for window sizes W = 0-9 (column F). The best AUC scores are obtained for W = 9, like for the individual features. They reach 0.65, which is small but significant increase with respect to the individual features. Indeed, the best feature, F2, reached only 0.62. The gradual improvement of the overall performance upon sequential addition of the features is given in Table 2.

Table 2 Increase of the prediction performance upon sequential addition of features. The window size is W = 9, and the AUC score is evaluated in 10-fold cross validation on the S85 dataset

Two of the antigens of the S85 training set have residues that are epitopes in some antigen-antibody complexes and non-epitopes in others. As most such common epitopes have been found related to autoimmunity [74], we removed these antigens from the S85 set, and trained another model on this restricted set S83 (Additional file 1: Table S1). The AUC obtained in 10-fold cross validation is equal to 0.65, and is thus identical to that obtained from the full dataset. We thus chose to keep the SEPIa predictor obtained with the complete S85 training set.

We also tested the SEPIa predictor on S19, an independent dataset of 19 antigen sequences [42], whose epitope assignment was made on the basis of experimental (non-structural) data (see Methods), and which is here used for comparison with other methods (see next section). The results on this test set were quite similar to those obtained from S85, as shown in Table 3: the window size 9 appears to be the best, and the AUC score reaches 0.65. The agreement between the results obtained from these two independent datasets increases their confidence level.

Table 3 Prediction performance of the combination of features as a function of the window size, estimated by the AUC score and evaluated on the S19 test set

The SEPIa prediction model, obtained with the scikit-learn package, is available as a file SEPIa.zip in Additional file 4. It needs as input the sequence of the target protein, and the 13 features computed on it.

Comparison with other methods

Several other B-cell epitope prediction methods have been developed, of which a certain number have been tested on the S19 set, while trained on an independent set. These are: Zhangbound and Zhangunbound [45], Zhengbound and Zhengunbound [52], CBTOPE [75], EPCES [76], Epitopia [41], DiscoTope [34], BPredictor [43], SEPPA [39], and EPSVR [42]. The former five use the amino acid sequence as sole input, whereas the last six also use the 3D structure. The AUC scores obtained by these methods are given in Table 4; they are taken from the original articles and from [45].

Table 4 The performance of different epitope prediction servers, estimated by the AUC score and evaluated on the S19 test set

With its AUC score of 0.65, SEPIa appears to slightly outperform the other methods. The second best sequence-based method, CBTOPE, has an AUC of 0.61. Note, however, that the S19 dataset is too small for these score differences to be statistically significant.

The structure-based methods do not perform better than the sequence-based methods. The best one, EPSVR, has an AUC of 0.61. This can seem surprising, given that considering the 3D structure obviously adds information. However, in this case, the predictors focus on surface residues only and classify them into epitope and non-epitope regions. Therefore, the scores of the sequence- and structure-based predictors cannot be compared: the former distinguish epitope residues out of all surface and core residues, and the latter epitope residues out of surface residues only.

Case study

To test the performance of the SEPIa predictor, we applied it on the human β2 adrenergic G-protein-coupled receptor (β2AR), which was co-crystallized with an antigen-binding fragment (Fab) and solved by X-ray crystallography (PDB ID: 2R4R) [77]. Due to its low resolution of 3.4 Å, it was excluded from the learning set S85. Its sequence identity with the antigens from the S85 learning set is below 20%, which makes it a good case study. Several sequence regions are absent in the X-ray structure: the C-terminus, the three extracellular loops and the third intracellular loop. Comparative modeling with the help of the SwissModel [78] server was used to build a complete structural model of β2AR, using the 2R4R structure as a template. The structure of the complex is depicted in Fig. 3.

Fig. 3
figure 3

Predicted and observed epitope residues in the human β2AR receptor. The predicted epitope residues are in green, the observed epitopes are in red, and the residues that are both predicted and observed as epitopes are in blue. Above: amino acid sequence, with the modeled loop regions in italic and underlined. Below: structure of β2AR co-crystallized with a Fab fragment, shown as ribbons with predicted and observed epitopes in sticks; β2AR is colored in light purple with modeled regions in light pink, Fab heavy chain in dark gray and Fab light chain in light gray

The epitope residues were assigned from the structure of the antibody-antigen complex as described in Methods. There are 12 epitope residues, depicted in red and blue in Fig. 3. The predictions obtained with SEPIa involve 29 epitope residues (in green and blue). Only 4 epitopes are both predicted and observed (blue). Hence, the number of correctly predicted epitopes is TP = 4, of correctly predicted negatives TN = 272, of incorrectly predicted epitopes FP = 8, and of incorrectly predicted non-epitopes FN = 28. The AUC score on this test protein is equal to 0.77.

Although the score is quite low, it is worth looking in more detail at the predictions shown in Fig. 3. Six residues that are incorrectly predicted as epitopes are actually very close to the binding interface. Adding the 4 correctly predicted epitopes yields a total of 10 residues that overlap the actual epitope region of 12 residues. The 18 other incorrectly predicted epitopes are all but one situated in loop regions at the protein surface; the fact that SEPIa does not predict epitopes in the core – on the basis of the sole amino acid sequence - is in itself already a positive result. Moreover, the other incorrectly predicted epitopes do not form clusters of interacting residues, except in one region. Note that some of these incorrectly predicted epitopes could very well correspond to epitopes in another complex, with another antibody, as discussed above.

Conclusions

The identification of immunogenic regions on antigen proteins provides the basis for the rational design of potential vaccines. In this study, we have developed the SEPIa conformational epitope predictor, which is based on the amino sequence alone. It uses a voting algorithm for combining the predictions of two classifiers, a Gaussian Naïve Bayes and a Random Forest classifier. This procedure tends to alleviate the weaknesses of the individual models. Thirteen amino acid-based features were exploited in the predictor. Some of them are directly linked to the amino acid sequence and describe the frequency, physicochemical properties, and evolutionary conservation of the amino acid residues. The others are sequence-based predictions of different features, in particular the epitope state, solvent accessibility, secondary structure, flexibility, and disorder. The values of all these features in a sequence window of 9 residues are taken into account to predict whether or not the middle residue is an epitope. Our method achieves an AUC score of 0.65 in 10-fold cross validation on the learning dataset. Almost the same performance is obtained on an independent dataset, on which other predictors have also been tested.

Strikingly, the performance of the SEPIa predictor, albeit limited, outperforms other methods. Moreover, its application to a test protein, β2AR, demonstrated its usefulness. First, many of the predicted epitope residues in this protein are either correctly predicted or spatially close to the experimentally determined epitope residues. Furthermore, most incorrectly predicted epitope residues are located in loops – and could thus be assumed to correspond to different, not yet identified, epitopes -, and/or do not cluster together in space – which could allow to mark them as unlikely epitopes. This last characteristic opens new perspectives for proteins of known or modeled structure, which involves combining the predicted epitope residues that are close in space into epitope clusters, with the largest clusters being more likely to correspond to true epitope regions.

Finally, we would like to underline two difficulties which all epitope predictors are faced with, and which limit their performances. The first is the noisy learning dataset, due to the fact that some residues marked as non-epitopes are in fact epitopes in other antigen-antibody complexes. The second is the strong imbalance between the number of epitope and non-epitope residues, which impedes high-performance machine learning. The last is related to the observation that proteins can exist and be stable without being recognized by antibodies. This implies that the properties of epitope residues are not very different from those of other surface residues, and increases the complexity of the prediction issue.