Background

HIV and Bevirimat

Bevirimat (BVM) [1] belongs to a new class of anti-HIV substances that inhibit maturation of virus particles by preventing cleavage of precursor polyprotein by the retroviral protease (PR). BVM prevents the final cleavage of precursor protein p25 to p24 and p2, hence p25 proteins are accumulating in the immature virions. These immature viral particles are not capable of transforming to an infectious stage, and the viral replication cycle is interrupted. A first set of mutations conferring resistance to BVM were found in selection experiments with BVM and were located at CS p24/p2 [14]. In clinical phase II trials, polymorphisms in the QVT-motif of p2 were found to prevent antiretroviral activity of BVM and were extensively studied in phenotypic resistance assays [57].

Machine learning

The notion of a resistance mutation is often useful as a first, simple approximation to describe relations between point mutations and resistance phenotypes. However, it is often observed that the more data become available the more complex are the relations between genotype and phenotype that show up. For instance, it has been observed that mutations in the QVT motif (wild type sequence 369-371) are preferentially associated with resistance to BVM [8]. However, as the data analyzed in the current study shows, the same set of mutations of QVT to QAS can be associated with BVM resistance [5] or susceptibility [6], depending on the mutational background. Machine learning methods are built to cope with such complex associations.

There are several machine learning methods that have been successfully employed to this end, e.g. rule-based methods [9], decision trees [10, 11], support vector machines [12], random forests (RFs) [13], or artificial neural networks (ANNs) [1416].

ANNs are universal approximators that can be used to solve non-linear classification problems; they are prone to overtraining if not properly set up [17, 18]. RFs are also excellent non-linear models, and in general perform better than single decision trees (DTs) [19]. They are less easily interpretable than DTs, although they provide variable importance measures [20]. In contrast, rule based systems yield rules that are well intelligible, but often classify not optimally [21, 22].

Methods

Data

Sequences of the p24/p2 region of 45 strains of HIV-1 with susceptibility or intermediate resistance to BVM (here defined as IC50 ≤ 10) were used, and 110 sequences of resistant strains (IC50 > 10). The phenotype was determined in experiments in which HIV-1 was cultured in the presence of increasing concentrations of BVM. The concentration of BVM inhibiting 50% of viral replication compared to cell culture experiments without BVM is defined as IC50 (50% inhibitor concentration). In general, drug resistance means reduced inhibition of viral replication by antiretroviral drugs, resulting in increased IC50 values. The IC50 values of the drug resistant isolates and HIV wild type are used to calculate resistance factors

a standardized measure of HIV drug resistance. The cut-off value of the resistance factor used to define the classes "resistant to BVM" and "susceptible to BVM" was previously derived from data obtained in phase II clinical trials with BVM correlating phenotypic resistance and clinical response [6, 7].

All data were collected from several studies that have investigated polymorphisms in p2, especially in its C-terminal half [1, 57] (see additional file 1 for complete set). Duplicated sequences in each class were removed prior to analysis.

Multiple Sequence Alignment

Multiple sequence alignments of the sequences were produced with clustalw [23], t-coffee [24], muscle [25], and prank [26]. Clustalw and muscle gave very compact alignments with a width of 21 columns and most rows free of gaps. The alignment from t-coffee was wider by one column, and the prank alignment much wider with 36 columns. Since clustalw and muscle gave similar alignments, and the prank alignment led to a relatively poor predictive performance, we restrict ourselves in the following to reporting results based on the output of clustalw and t-coffee (see additional files 2 and 3).

Descriptor set

It is often helpful to analyze not the sequences of amino acids as strings of characters, but to associate with each amino acid a numerical "descriptor" value, for instance a value that captures a physico-chemical property of this amino acid. Recently, it has been shown that the descriptor set is the most critical element in classification [27, 28], and that physico-chemical descriptors outperform simpler descriptors [29]. In our search for a method with maximum predictive power we tested several numerical descriptors, including hydrophobicity values of Kyte and Doolittle [30], molecular weight, isoelectric point (IEP) and pKa values for each amino acid. Moreover, we used the predicted probability for cleavage by HIV protease as a descriptor [31]. The numerical descriptor values for gaps from the multiple sequence alignment are undefined a priori. We therefore tested three values for gaps, namely 0, -1 and an interpolated value (mean of the two amino acid descriptor values on both sides of gap). In the case of 0 and interpolated values for gaps the descriptor values of the amino acids were normalized to the interval [-1,1], and in the case of -1 for a gap they were normalized to [0,1]. Apart from using numerical descriptors, we also trained an RF with the multiply aligned p2 sequences using as factors the single letter codes of the amino acids and "-" for gaps.

Neural Networks

We used a Java implementation http://www.heatonresearch.com/encog of neural networks with one hidden layer and three to seven hidden neurons. Resilient propagation (Rprop) was applied as a learning rule [32]. We used the identity function as activation function for the input layer and the logistic function for the hidden and output layer, respectively. We have used the logistic function because it has been shown in recent studies that it leads to a better generalization ability [33, 34]. The weights of the ANNs were initiated by applying the Nguyen-Widrow-method [35]. Stop-training was performed in order to avoid overfitting of the neural networks [36].

Random Forests

As an alternative to ANNs we trained Random Forests (RFs) [20] for the prediction of BVM resistance, using the implementation in the randomForest package of R [37]. In our application each RF consisted of 2000 randomly and independently grown decision trees. When using the trained RF for prediction, an unseen sequence was assigned to the resistance class voted for by at least 50% of the trees. The importance of each variable, i.e. sequence position, for the correct classification can be assessed by determining the increase in misclassification rate due to leaving this variable [20]. Furthermore, we used the rpart package of R [37] to create single decision trees.

Rule-based systems

We used the rule-based algorithms JRip [38] and PART [39] as implemented in R [37].

Cross-validation

All machine learning methods were validated using 100-fold leave-one-out [40] validation to assess for the different machine learning methods the mean prediction sensitivity, specificity, and accuracy (see formulas below) and the ability to generalize to unseen sequences. In addition to this cross-validation, we report for RFs an out-of-bag (OOB) error, an upper limit of the classification error [20].

For each test in the cross-validation, the sensitivity, specificity, and accuracy were calculated according to:

with true positives TP, false positives FP, false negatives FN and true negatives TN. Figure 1 shows sensitivities and specificities as ROC curves (Receiver Operating Characteristics) [41] for the non-discrete methods in our study. Table 1 gives the corresponding areas under the curve (AUC). ROC curves were drawn with R-package ROCR [42].

Table 1 Area under the curve.
Figure 1
figure 1

ROC curve. Averaged ROC curves of the best performing descriptor for each (non-discrete) machine learning approach. The standard deviation bars mark the 95% confidence intervals. blue: SVM; green: ANN; red: RF.

Structure and cleavage-site prediction

Secondary structures of all p2 sequences of 20 or more residues were predicted using JPred [43]. Based on statistical evidence, the secondary structure predictions did also yield a reliability index from 0 (unreliable) through 9 (highly reliable) for each residue being in a predicted secondary structure state.

HIV protease cleavage sites for all p2 sequences were predicted with HIVcleave [31] based on earlier work by Chou et al. [44].

Statistical comparison

All models were compared by applying Wilcoxon Signed-Rank test [45] on the AUC distributions from the 100-fold leave-one-out cross-validation runs [46]. The null hypothesis was that there are no differences between the compared classifiers.

Results and Discussion

Prediction performance of machine learning methods

All machine learning methods were trained in various configurations and with several descriptors as described in methods. The prediction qualities, such as the mean AUCs (), standard deviation (sd) and coefficient of variation (cv = ) are shown in Table 1.

The ANNs yielded AUCs between 0.72 ± 0.036 (descriptor IEP) and 0.84 ± 0.028 (descriptor hydrophobicity). According to the Wilcoxon Signed-Rank test [46] with significance level α = 0.001 the mean AUC for descriptor molecular weight was not significantly different from that obtained with descriptor hydrophobicity, while all other descriptors gave significantly lower values of mean AUC. There were no significant differences (α = 0.001) between the mean AUCs of each descriptor with regard to the number of hidden neurons.

RFs performed consistently better than ANNs for all descriptors, reaching AUC values between 0.85 ± 0.003 (cleavage site prediction) and 0.93 ± 0.001 (hydrophobicity). Again, the best results, with only small differences, were obtained from hydrophobicity and molecular weight as descriptors. The OOB error with this descriptor was 7.59%. For comparison, the best single decision tree, which was created with rpart in R [37], reached a pro forma AUC of 0.841 (see Table 1).

The RFs find the most important sequence positions for resistance prediction in the second half of the p2 sequence, especially at sequence positions 369-376 (Figure 2) in the clustalw alignment; in the wild type sequence this region corresponds to the motif QVTNSATI. The two positions 370 (V in wild type) and 372 (N in wild type) have by far the highest importance in the investigated data set. This finding is in partial agreement with the findings of other workers who identified the QVT motif at positions 369-371 as important [7]. Positions 363 and 364 are not as prominent in terms of importance, although they were previously identified as crucial [47] for resistance to BVM. The apparently lower importance of these positions in the current study can be explained by the nature of our data set, which focuses on resistance mediated by baseline mutations within the p2 region in clinical HIV isolates.

Figure 2
figure 2

Importance of sequence positions in RF predictions. Importance of sequence positions in p2 for prediction of BVM resistance by RFs. The y-axis denotes the "percental increase in misclassification rate" [20]. The upper horizontal axis indicates wild type sequence.

We also trained RFs on the actual sequences, i.e. without numerical descriptors. These RFs gave OOB errors above those trained with hydrophobicities, namely of 13.55% for the t-coffee alignment and 14.84% for the clustalw alignment. For comparison, other machine learning methods were tested as well, including Hidden Markov Models (HMMs) [48] and linear models. We tested linear support vector machines (SVMs) and logistic regression as implemented in R [37], and furthermore, simple perceptrons implemented in Java http://www.heatonresearch.com/encog. All of these models performed worse compared to the RFs. The best linear model (AUC 0.826 ± 0.008) was a linear SVM using hydrophobicity as a descriptor. In Table 1 we report the results of the best linear model for each descriptor. The HMMs were not able to classify the sequences. In Figure 1 the ROC-curves for the descriptors performing best for each (non-discrete) machine learning method are shown.

Genotype-phenotype rules

The two algorithms JRip and PART for rule extraction provided rule sets that performed well in the cross-validation with accuracies reaching almost that of RFs. Since the rules derived from the t-coffee alignment had lower errors than those based on the clustalw alignment, we here report only the former rules.

During cross-validation JRip generated most frequently a set of three rules relating alignment positions, hydrophobicities, and BVM resistance class. Translated to amino acid residues, the rules are:

  1. 1.

    IF position 370 ∈ {I, V} AND NOT position 373 ∈ {K, R} AND position 374 ∈ {I, V, L, F, C, M, A} ⇒ susceptible

  2. 2.

    IF position 372 ∈ {K, R} AND position 373 ∈ {P, H, E, N, Q, D} ⇒ susceptible

  3. 3.

    ELSE resistant

JRip reaches in the cross-validation a mean sensitivity of 77.01% at a specificity of 88.14%. Dropping the first rule leads to a sensitivity of 11.76% and a specificity of 99.21%. Dropping the second rule leads to a sensitivity of 72.54% with a corresponding specificity of 88.1%.

In the cross-validation PART most frequently extracted fifteen rules (see additional file 4) with a sensitivity of 85.5% and a specificity of 93.27%. Remarkably, the PART rules did take exactly those sequence positions into account that had non-zero importance in the RF analysis (see Figure 2). As suggested by the JRip and PART rules, resistance is generally caused by patterns of two or more residues. However, the importance plot (Figure 2) show that single positions may be useful indicators as well. E.g. we found that at sequence position 372 the hydrophobicity values of resistant and susceptible group clustered around two different values, 0.39 for the resistant and 0.26 for the susceptible. From this we could derive the rule (Rule372): a sequence is resistant if the hydrophobicity at 372 is closer to the mean hydrophobicity of the resistant cluster than to that of the susceptible cluster and vice versa. The rule is predictive with 52% sensitivity and 90% specificity.

Structural and functional implications of resistance mutations

After experiments [49] have excluded the classical molecular mechanism of protease inhibition, i.e. blocking of its catalytic site, there are still several molecular mechanisms for BVM action considered in the literature (for review see [1]): BVM could directly occlude the protease cleavage site ("direct" mechanisms, possibly with contact of BVM and protease), or it could stabilize a Gag structure that has to be weakened or dissolved to make the cleavage sites accessible to the protease ("indirect" mechanisms, possibly without contact of BVM and protease). Accordingly, there are several possible resistance mechanisms discussed in the literature, such as mutations that perturb the BVM binding site, that weaken the mentioned Gag structure, or that make the affected cleavage site easier digestible for the protease. A hypothetical resistance mechanism that to our knowledge so far has not been addressed is a shift of the cleavage site. We have therefore investigated associations of resistance mutations with cleavage site locations properties, as predicted computationally. In all susceptible and most resistant sequences the predicted PR cleavage sites with maximum probability were unchanged with respect to the wild type (see additional file 5): cleavage was predicted to be most probable at P1-sites 363 and 367 in agreement with experimental findings [50], and cleavage probabilities at P1 363 were rather invariable across the data set. In a few resistant sequences cleavage sites probabilities were indeed predicted to shift (see additional file 6). Amongst these sequences we observed a tendency for the second cleavage site at P1 367 to have lower probabilities whereas position 365 did emerge as a new possible P1 site. However, since this occurs rather rarely, the data do not support a shift of the cleavage site as a major resistance mechanism. It is notable that in the studied data the positions 372-376 most relevant for resistance (Figure 2) lie outside the protease binding region P4-P4' for P1 at 363 (P4' 367). Even for the internal cleavage site at P1 367 (P4' 371), more than half of these important positions are outside the protease binding site. This finding is consistent with a model that allows for an "indirect" mechanism of BVM, though it cannot exclude "direct" mechanisms. In fact, mutations found in other studies closer to the cleavage sites [47, 49] also allow for a direct model.

A key component of an indirect mechanism is a structure within Gag that has to be weakened prior to cleavage of p24/p2. A candidate structure is the α-helix first predicted by Accola et al. [50]. We have extended secondary structure predictions to all sequences of the data set, including the wild type. All these structures were predicted as mainly α-helical in the central part (additional file 7). This gross feature is consistent with the experimental structure by Morellet et al. [51], though the predicted helices are shorter. While in the Morellet structure the helix comprises all of the residues starting at position 358, the predicted helices comprise between seven and twelve of the 21 sequence positions and typically start at position 361 (Figure 3A). Apart from the deficiencies of the prediction method the difference between experiment and prediction may be due in part to the experimental conditions [51] where a substantial amount of trifluoroethanol in the solution could have led to a helix content exceeding that in the native state. The earlier work by Worthylake et al. [52] supports the view that the helix formed by p2 as such is not very stable. A very stable helix at the cleavage site could possibly prevent PR from cleaving, because the protease requires its substrate in an extended conformation [53]. On the other hand, recent data from electron microscopy [54] are compatible with bundles of six p2 helices stabilizing the immature matrix of the virus. In summary predictions and experiments point to a weak p2 helix that is stabilized by its environment.

Figure 3
figure 3

Helix length and confidence. Secondary structure predictions for p2 with susceptibility and resistance to BVM. A: For all p2 sequences the secondary structure was predicted by JPred [43] and then for each sequence position the helix probability (fraction of helix at this position) was computed separately for the susceptible and resistant sequences. B: Histograms of the confidence with which JPred predicts a helix (0 lowest, 9 highest). The confidence values were averaged for each sequence over all positions predicted to be helical.

It is remarkable that the end of the predicted helices around position 369 coincides with start of the sequence region most important for resistance (Figure 2) in our data set; in other words, the sequence positions most important for resistance in our data lie outside the predicted α-helix in a region of unspecified secondary structure. Moreover, the resistant sequences have a tendency for shorter helices compared to susceptible sequences, as can be seen in the earlier drop of helix probability at around position 368 in Figure 3A.

We have also analyzed the confidence with which the secondary structure prediction algorithm assigns residues to a helical state. If we assume that the prediction is based on a representative sample of sequences observed as helices and non-helices, respectively, then this confidence could have a positive correlation with helix stability. A comparison of resistant and susceptible sequences with respect to mean confidence along the helix shows that resistant sequences have a tendency to lower helix confidence, and, if the assumption holds, lower helix stability (Figure 3B).

The above tendencies of resistance class, predicted helix length and confidence may reflect possible "indirect" resistance mechanisms: shorter and weaker helices could limit the effect of BVM in several ways, e.g. by destabilizing the binding site of BVM that may lie on the six-helix bundle mentioned above, or by easing the unraveling of the remaining helix, and thus cleavage by PR in presence of BVM. This argument suggests to test whether helix length and helix confidence are predictive of resistance. We have therefore trained another random forest solely with the predicted helix lengths and confidences, and without reference to the detailed sequences. This random forest had an OOB error of 23%, which is not as good as the errors of 8-15% reported above for random forests or rule-based methods trained on sequence information, but still much better than random guessing. This means that tuning of helicality of p2 could indeed be a BVM resistance mechanism.

Conclusions

BVM was the first drug of the new class of maturation-inhibitors of HIV-1 that has reached phase II clinical trials. Several polymorphisms in p2 of HIV-1 hampered the sustained suppression of viral replication in these patients and conferred phenotypic resistance [7]. Since these polymorphisms were found in about 30% of treatment naïve HIV isolates and were significantly accumulated in PI resistant HIV isolates [55], genotypic resistance testing seems to be mandatory before administration of BVM.

Our analysis has shown that with the available sequences and corresponding phenotypic data it is possible to train machine learning methods that predict phenotypic resistance to BVM, mediated by baseline mutations of the p2 region, for unseen sequences with an error of less than 10%. This result is compatible with the view that mutations in p2 are the main reason for BVM resistance observed in clinical isolates not responding to BVM in clinical phase I and II studies. The high classification accuracy is encouraging for personalized therapy based on genotypical testing in case BVM-like drugs will become part of the antiretroviral repertoire. With a larger, representative data set of genotype-phenotype pairs, it could become feasible to use machine learning methods not only for classification but also for regression, i.e. prediction of quantitative resistance factors.

Random forests, the method with the best classification results amongst those tested, also allowed for the identification of the sequence positions most relevant for resistance. In our data set, these sequence positions cluster in the C-terminal half of p2, mostly outside the P4-P4' protease binding region. This is in agreement with the outcome of rule-based methods.

As judged from predicted cleavage positions, resistance mutations do usually not shift the cleavage site. Secondary structure prediction shows that resistance mutations may affect the length and strength of the α-helix formed by at least sequence positions 371-377 and covering also the cleavage site. This hypothesis is in agreement with propositions by other workers [1] and suggests possible resistance mechanisms that also may occur in combination, e.g. (a) resistance mutations could destroy the BVM binding site that may lie in the C-terminal half of p2, formed by several p2 peptides in the six-helix bundle suggested by Wright et al. [54]; (b) resistance mutations could weaken the α-helix in p2, and thus, the six-helix bundle in the immature virus. This could ease unraveling of the helix prior to cleavage by PR, and hence, may functionally outweigh a stabilizing effect of BVM on the helix bundle.