Background

As the major histocompatibility complex (MHC) in humans, human leukocyte antigens (HLAs) are important immunologic proteins found on the surface of somatic cells [1]. They can present antigenic peptides from the infectious agents to T-cells to induce immune responses [25]. People of different ethnicities or from different regions may carry distinct HLA variations or alleles [6, 7]. According to IMGT/HLA database [6] by Mar 15, 2015, there are more than 12,000 alleles identified for HLAs, making the HLAs as one of the most polymorphic genes in humans. HLA genes contain multiple loci including A-G. The HLA D locus is classified as Class II and the rest are categorized as Class I due to their differences of responding T-cells and functions [25]. Since the binding grooves of Class I HLAs are determined by one single chain and there are a lot of peptide binding data and three-dimensional structures available [8, 9], we selected Class I HLAs in this study to demonstrate the applicability of network analysis to predict HLA-peptide binding.

Patients carrying certain HLA alleles are more likely to develop adverse drug reactions (ADRs) after taking specific drugs. Drug-HLA associations have been identified between abacavir and HLA-B*57:01 [1012], flucloxacillin and HLA-B*57:01 [13], and carbamazepine and HLA-B*15:02 [14], etc. Several mechanisms have been proposed to understand the HLA related ADRs, including the hapten concept, the super-antigen model, the p.i. concept, the altered repertoire model and the danger hypothesis [1518]. In all the hypotheses except the danger hypothesis, the HLAs on the surface of antigen-presenting cells (APCs) present peptides to T-cell receptors (TCRs) on the surfaces of T-cells and the drug molecules interfere with the system through covalent binding to the peptides, instable interaction, or insertion into the binding grooves of HLAs. Ultimately, it is beneficial to predict ADR occurrences of drugs before patients take the drugs. However, the mechanisms for ADRs are complicated and each of the players in the system has a large number of variations in their structures, making it very challenging to study HLA related ADRs. Our previous molecular modeling study showed the drug-HLA binding prediction was improved when the binding peptide was incorporated in the modeling system [19]. Therefore, better understanding and accurately predicting HLA-peptide binding could facilitate predicting ADRs related to genetic predisposition.

Various machine learning models have been developed to predict HLA-peptide binding for individual HLAs [2022]. However, lacking enough experimental HLA-peptide binding data to train machine learning models for many HLAs limits the capability of this approach. In addition, a significant part of machine learning models uses parameters that are derived from peptides with the same length but experiments showed the same HLA can bind peptides with different lengths, making predicting HLA-peptide binding using these methods very challenging. New methods for accurately predicting HLA-peptide binding that overcome the challenges of the reported machine learning models are in urgent need. Therefore, in this study we conducted network analysis to understand the binding characteristics between HLAs and peptides and developed a new method named neighbor-edges based and unbiased leverage algorithm (Nebula) to predict HLA-peptide binding.

Through analyzing the HLA-peptide binding network, we identified nine modules that are densely connected regions in the network [23]. Modularity is the measurement of goodness of a division of a network into modules [24] and was used to yield nine modules. The modularity of the real HLA-peptide binding network was compared to the modularity values yielded from random networks. Peptides and HLAs in the same modules shared similar properties. We further developed Nebula to predict HLA-peptide binding. To our best knowledge, this study is the first one to use network analysis for understanding and predicting HLA-peptide binding.

Methods

Study design

An overview of this study's workflow is shown in Figure 1. We first harvested qualitative Class I HLA-peptide binding data from four major databases that collected and curated HLA binding assays from the literature. The HLA-peptide binding network was then constructed from the harvested data. Thereafter, the fast greedy modularity optimization algorithm was used to identify modules. The modularity analysis on 1,000 randomly generated networks was conducted to verify that the modules yielded from the HLA-peptide binding network could not be generated by chance. Finally, we implemented Nebula to make predictions and evaluated its performance via leave-one-out (LOO) validations.

Figure 1
figure 1

A flowchart of this study's workflow. Qualitative HLA-peptide binding data were harvested from IEDB, SYFPEITHI, MHCBN and AntiJen databases. The HLA-peptide binding network was generated from the binding data. The modularity analysis was then conducted on the HLA-peptide binding network and the 1,000 randomly generated networks. Finally, Nebula was used to predict HLA-peptide binding.

Data preparation and network construction

Four major databases, IEDB [25], SYFPEITHI [26], MHCBN [27] and AntiJen [28], contain HLA-peptide binding data curated from the literature. IEDB, MHCBN and AntiJen provided qualitative binding categories (positive or negative), while SYFPEITHI contains all positive bindings. For IEDB data, "positive-high", "positive-intermediate" and "positive-low" were all considered as positives. For AntiJen data, the "weak binders" are considered as negatives according to the paper's description [28].

We harvested the qualitative binding data of Class I HLA-peptide binding assays from the four databases by Aug 25, 2014 and combined all data into a single dataset. The dataset contains three columns: HLA, peptide and binding category (positive or negative). For an HLA-peptide pair with multiple entries in the databases, we calculated the proportion of positives. If the proportion is larger than or equal to 0.5, then it was stored as a unique record of positive (otherwise it was labelled as a negative). We removed the peptides and HLAs that contain only one binding datum for two reasons: 1) the datum may be in low quality because only the HLA or the peptide shows binding among such a large number of peptides or HLAs; and 2) it would not be able to be predicted by Nebula because no data for the HLAs or peptides could be used for the predictions in the LOO validations. The data filtering process was run through several iterations to make sure all the peptides and HLAs had more than one binding datum. To enable the calculations in the network analysis, we used 2 and 1 to represent positive and negative, respectively. Finally, the HLA-peptide binding network was constructed using the igraph package (version 0.7.1) in R 3.1.3.

Module identification and modularity analysis

We used the fast greedy modularity optimization algorithm by Clauset et al. [24] via the igraph package to identify modules within the HLA-peptide binding network. This algorithm is well-known for its advantage to fast detect modules from large networks [29]. To examine whether the yielded modules really have binding characteristics for HLAs and peptides or could be generated by chance, we generated 1,000 random networks using three criteria: (1) the network topology, both nodes and edges, in the random networks remain the same as in the real HLA-peptide binding network; (2) only weights (positive or negative) were randomly shuffled, while keeping the same amount of positives and negatives; and (3) the modules from random networks were generated using the same algorithm and parameters. The modularity values [24] were then compared between the real HLA-peptide binding network and its randomly permutated networks.

Comparative analysis of modules

The modules yielded from modularity analysis of the HLA-peptide binding network were compared in terms of both the HLAs and the peptides. To compare the HLAs across modules, the available protein sequences of all the HLAs were first downloaded from the IMGT/HLA database [6]. The HLA sequences were then aligned using the MUSCLE method in MEGA 5.2.1 [30] with default parameters. Chelvanayagam [8] identified a uniform list of HLA residues that specifically interact with each amino acid (AA) position in 9-mer peptides. These residue numbers were given referring to the sequence of A*02:01 (PDB ID: 3HLA) [9]. For each AA position, we extracted the corresponding residues from all the HLA sequences and put them together as position-specific pseudo-sequences. The pairwise sequence identities of the pseudo-sequences were calculated using Clustal Omega 1.2.0 [31].

Nebula

Nebula was developed through modification of the collaborative filtering algorithm [32] and the network-based inference (NBI) method [33, 34]. The HLA-peptide binding data can be constructed into a weighted bipartite network, where an edge is drawn between HLA h i and peptide p x if there is a binding datum (positive or negative) between them. The edge weight is given by w h i , p x in equation (1).

w h i , p x = 2 , i f p o s i t i v e b i n d i n g 1 , i f n e g a t i v e b i n d i n g
(1)

If HLA h i and peptide p x do not have a binding datum, a prediction value for the edge between h i and p x can be calculated from the edges neighboring to the edge in prediction using equations (2-6).

P h i , p x = w h i ¯ + j ( w h j , p x - w h i ¯ ) S h i , h j j S h i , h j
(2)

w h i ¯ is the average weight of all edges that connect to HLA h i , h j is a HLA that connects to peptide p x , and S h i , h j is the Pearson correlation coefficient between h i and h j and calculated using equation (3).

S h i , h j = a ( w h i , p a - w h i ¯ ) ( w h j , p a - w h j ¯ ) a ( w h i , p a - w h i ¯ ) 2 a ( w h j , p a - w h j ¯ ) 2
(3)

p a indicates a peptide that connects to both h i and h j . When h i and h j do not share a connected peptide, S h i , h j =0. Likewise, P p x , h i can be calculated using equation (4).

P p x , h i = w p x ¯ + k ( w p k , h i - w p k ¯ ) S p x , p k k S p x , p k
(4)

w p x ¯ is the average weight of all edges that connect to peptide p x , p k is a peptide that connects to HLA h i , and S p x , p k is the Pearson correlation coefficient between p x and p k and calculated using equation (5).

S p x , p k = a ( w p x , h a - w p x ¯ ) ( w p k , h a - w p k ¯ ) a ( w p x , h a - w p x ¯ ) 2 a ( w p k , h a - w p k ¯ ) 2
(5)

h a is a HLA that connects to both p x and p k . When p x and p k do not share a connected HLA, S p x , p k =0.

Nebula treats the contributions from the edges connected to the two nodes of the edge equally in prediction, that is P h i , p x and P p x , h i . Therefore, the final prediction value between HLA h i and peptide p x as F(h i ,p x ) is calculated using equation (6).

F ( h i , p x ) = P h i , p x + P p x , h i 2
(6)

F(h i ,p x ) is a continuous value which is converted into a categorical prediction value C(h i ,p x ) in Nebula using the unbiased leverage (UL) as presented by equation (7). Since we assigned the weights for positive binding as 2 and negative as 1, the UL was set to be 1.5.

C ( h i , p x ) = p o s i t i v e , i f F ( h i , p x ) U L n e g a t i v e , i f F h i , p x < U L
(7)

Evaluation of Nebula performance

To evaluate the performance of Nebula, we used LOO validations. Each of the edges was taken out one at a time from the HLA-peptide binding network, and the remaining network was used to predict the weight of the taken-away edge. A receiver operating characteristic (ROC) curve was generated using the continuous final prediction values F(h i ,p x ) against the binding labels using AUC package in R (version 0.3.0). Sensitivity, specificity and accuracy were calculated by comparing the categorical prediction values C(h i ,p x ) against the labels determined from HLA-peptide binding assays. We did a similar evaluation for NBI method [34] as a comparison. The author of NBI, Dr. Feixiong Cheng, provided the NBI codes to us.

Two-fold cross-validations were also conducted to eliminate potential over-fitting from the LOO validations. Each time the entire HLA-peptide binding network was randomly divided into two even portions and each portion was used to predict HLA-peptide binding in the other portion. We ran 100 iterations and calculated the sensitivity, specificity, accuracy and area under the ROC curve (AUC) values to measure the performance of Nebula.

Results and discussion

Modularity analysis

After data pre-processing, we obtained 118,959 binding data points (39.6% positives and 60.4% negatives) between 18,630 peptides and 211 Class I HLAs for network construction and modularity analysis (Supplementary Table S1 in Additional file 1). Nine modules were identified from the HLA-peptide binding network using the fast greedy modularity optimization algorithm as shown in Figure 2a. A modularity value of 0.489 was found. The calculated results of the peptides and HLAs in the nine modules are given in Table 1. The sequences of the peptides and HLAs in the nine modules are listed in Supplementary Table S2 and S3, respectively, in Additional file 1.

Figure 2
figure 2

Results of modularity analysis. Nine modules were generated from the HLA-peptide binding network and plotted via Cytoscape 3.2.0 (a). The HLAs are shown in red, peptides in cyan and edges in grey. Modularity values of 1,000 randomly permutated networks are given in the histogram (b). As a comparison, the modularity value yield from the real HLA-peptide binding network is indicated by the red arrow.

Table 1 Statistics of peptides and HLAs in the nine modules

Using the same modularity analysis algorithm, we analyzed the 1,000 randomly generated networks. The 1,000 modularity values were plotted as a histogram shown in Figure 2b. All the 1,000 modularity values are lower than the modularity yielded from the real HLA-peptide binding network (the red arrow in Figure 2b), indicating the nine modules harvested by the algorithm is not likely some result obtained by chance. In order to discover potential signals buried in the nine modules, we analyzed the peptide and HLA properties across the modules.

In this modularity comparison, we used very strict criteria to generate the random networks not even altering the topology of the original network. Another way is to generate random networks by reconnecting edges while keeping the same amount of nodes, positive and negative edges, which resulted in modularity values with even larger differences (data not shown).

Peptide analysis across the modules

The distributions of peptides and HLAs in the nine modules are listed in Table 1. For peptides with a specific length such as 8-mers, the peptide distribution across the modules is also shown in Table 1 by column with the column sum equal to 100%. Using 10% as a cutoff, we found 8-mer and 11-mer peptides are more likely to appear in modules 4-7, while 9-mers and 10-mers majorly exist in modules 1-4. Modules 1-3 and 5-7 show a higher specificity on peptide lengths while module 4 is a mixture of 8-mers, 9-mers and 11-mers. The results indicate peptides in different modules may have different binding interactions with HLAs.

Modules 1-7 are the major modules that contain more than 1,000 peptides and 9-mers are the majority of peptides (68.8%). We further analyzed 9-mer peptides across these modules to explore HLA-peptide binding characteristics of the modules. According to Hong et al. [35], the common 20 amino acids (AAs) can be categorized into 3 groups: (1) polar charged (Arg, Asp, Glu, His and Lys), (2) polar uncharged (Asn, Cys, Gln, Gly, Ser, Thr and Tyr), and (3) apolar (Ala, Ile, Leu, Met, Phe, Pro, Trp and Val). For each module, we categorized the AA residues within the 9-mer peptides into the three groups from Position 1 to 9. The result is shown in Figure 3.

Figure 3
figure 3

The distribution of amino acid residues by position from 9-mer peptides in each module. Each bar chart shows the 9-mer peptides from a module indicated at the top of the bar chart. The x-axis marks P1 to P9 represent positions 1 to 9 in 9-mers. At each position, 20 possible amino acid residues were categoried into three groups: polar charged (green), polar uncharged (red) and apolar (blue). The percentages of the three groups are shown at each position.

We found the distributions of AA residue categories are similar across different modules for most positions. However, two positions showed very distinct characteristics across modules. For position 2 (P2), while module 4 contains 80.8% apolar AA residues and module 6 includes 93.2% apolar AA residues, module 7 has a dominant proportion of polar charged AA residues (up to 98.1%). For the last position (PΩ, or P9 for 9-mer peptides), while module 3 showed a majority of polar charged AA residues (79.0%), modules 4, 6 and 7 had 87.5%, 78.4% and 81.4% of apolar AA residues, respectively. Detailed information regarding AA residue distributions at Position 2 and 9 in each module compared to their overall distributions among all nine modules is attached in Supplementary Table S4 in Additional file 1. Similar results were also observed for other lengths of peptides (results not shown). Therefore, we found differences in peptide lengths and properties across the modules.

It has been reported that position 2 (P2) and, especially the last position (PΩ), of peptides have critical effects on drug binding inside the HLA binding grooves which may affect the occurrence of HLA related ADRs [16, 36]. Our network analysis differentiated peptides with specific amino acid properties into different modules and clustered those similar ones together in certain modules, which could be useful to further our understanding of HLA related ADR mechanisms.

HLA sequence analysis across the modules

Since Class I HLA alleles are highly similar, the sequences of their peptide-binding regions (residues 2 to 182, Chain A) aligned well without a single gap. These partial HLA sequences are given in Supplementary Table S3 in Additional file 1. To assess the sequential differences of HLAs across the modules, we plotted the available sequences in Figure 4 to highlight the residues that are different from the most frequent residue for each position. The different residues are colored according to the three AA categories mentioned above.

Figure 4
figure 4

The HLA sequence difference across the modules colored by residue category. HLA sequences from seven major modules are shown. Each line is an HLA partial sequence (residues 2 to 182 on Chain A) of the peptide-binding region. Each column represents a residue at the same position across all the HLAs. For each column, if a residue has their highest frequency within the column, it is not shown (white color). Otherwise, it is colored according to the three categories as explained in the legend to Figure 3.

We observed some HLA sequence differences in the modules at certain sequence positions. For example, in module 7, the residues at position 24 and 67 are apolar and the residues at position 45 are polar charged, which showed a uniformity and difference against the rest of the modules. Since these three HLA residues are reported to interact with position 2 of 9-mer peptides [8], combining the peptide sequence analysis results that indicated the position 2 of 9-mer peptides in module 7 is dominant by polar charged residues (98.1%), we think the peptides and HLAs in the same modules are concordant to form specific binding patterns.

We also analyzed the identities of HLA pseudo-sequences that specifically interact with each AA position of 9-mer peptides, and the results for positions 2 and 9 are given in Supplementary Table S5 in Additional file 1. For positions 2 and 9, we found the pseudo-sequence identities within the modules are generally significantly higher (p < 0.05) than those between modules. Especially, for position 2 pseudo-sequences, module 7 had the highest average identity within the module and lowest average identity between the modules.

In summary, this study revealed that not only the peptides, but also HLA sequences showed more similarities and concordant properties within the modules than between the modules. Modularity analysis of the HLA-peptide binding network is helpful to understand HLA-peptide binding interactions that, in turn, could facilitate understanding of HLAs related ADRs.

Validations for Nebula

To evaluate the performance of Nebula, LOO validations were conducted in which each of the 118,959 HLA-peptide binding data was left out for prediction by the network constructed from the rest of the 118,958 binding data points. The results were plotted as a ROC curve shown in Figure 5a. AUC was calculated to be 0.868. The sensitivity, specificity and accuracy are 70.8%, 88.7% and 81.6%, respectively.

Figure 5
figure 5

HLA-peptide binding prediction performance. The comparison of ROC curves of leave-one-out validations were plotted for Nebula (a) and NBI (b). The sensitivity, specificity, accuracy and AUC values were plotted for the 100 iterations of two-fold cross-validations (c).

As a comparison, the performance of NBI for predicting HLA-peptide binding in the same dataset was evaluated using the same LOO validations. The results were given in Figure 5b, with a lower AUC of 0.799. The sensitivity, specificity and accuracy are 33.2%, 94.4% and 68.1%, respectively. The results indicated that Nebula generally outperformed NBI and holds a promising application in analyzing big and sparse datasets such the HLA-peptide binding dataset used in this study.

In order to reduce the potential over-fitting from the LOO validations, we conducted 100 iterations of two-fold cross-validations for Nebula. The results were shown in Figure 5c. The sensitivity, specificity, accuracy and AUC values are 69.2% ± 0.2%, 86.3% ± 0.2%, 79.5% ± 0.1% and 0.830 ± 0.001, respectively, slight lower than the LOO validations as expected, indicating over-fitting is not a big concern for Nebula.

Machine learning methods such as artificial neuron network (ANN) [37] and support vector machine (SVM) [38] were used for HLA-peptide binding predictions. However, most conventional machine learning methods have been applied for a limited number of HLAs and peptides of specific lengths because they require a large enough amount of data to train a reliable prediction model. Moreover, the independent variables used in most reported HLA-peptide binding prediction models were derived from peptides with a fixed length unless an extra process [39] was implemented to process peptides with different lengths. As demonstrated by the results, Nebula not only achieved good prediction accuracy, but also does not require a large amount of experimental data for an HLA allele or a fixed length for peptides.

Conclusions

We identified nine modules in the HLA-peptide binding network using the fast greedy modularity optimization algorithm. The modules showed distinct distributions and properties for both peptides and HLAs across the modules, indicating network analysis is a promising approach to understand structures and characteristics of big and sparse data. We developed Nebula for prediction based on network analysis and used HLA-peptide binding dataset as a case study to demonstrate it is reliable and practicable for big data analysis. Our results suggest that the network analysis methods such as Nebula are applicable and effective to interpret and predict large and sparse datasets such as the HLA-peptide binding dataset used in this study. We showed such methods could accurately predict HLA-peptide binding that, in turn, could improve predictions of HLA related ADRs to better implement precision medicine.

Disclaimer

The findings and conclusions in this article have not been formally disseminated by the US Food and Drug Administration (FDA) and should not be construed to represent the FDA determination or policy.