Background

For over fifteen years, drug resistance has been a primary challenge in the effective treatment of HIV, and our understanding of resistance mechanisms has evolved along with the virus itself as new therapies have emerged[16]. Thanks to worldwide efforts to tackle HIV drug resistance, many successful treatment regimens have been developed, including combination therapies[7, 8] such as the Highly Active Anti-Retroviral Therapy (HAART) regimens[9, 10], but treatment options have been uncertain for patients who fail these regimens due to the accumulation of drug-resistant mutations[11]. More recently, in addition to targeting molecules other than HIV-1 reverse transcriptase (RT) and protease, second-generation RT and protease inhibitors (PIs) have been developed such that they remain potent against variants resistant to first-generation inhibitors. Specifically, tipranavir[12] and darunavir[13], the two PIs most recently approved for clinical use, have been shown to be potent against viruses harboring multidrug resistance mutations such as V82A and L90M, in the cases of both tipranavir and darunavir[1316], and V82T or I84V in the case of darunavir[13, 16]. However, even these drugs have been shown to lose potency in the presence of certain mutations or mutation patterns[14, 1720]. In fact, the existence of HIV-1 variants showing resistance to all clinically-approved inhibitors highlights the issue of cross resistance, or the existence of mutation patterns arising from a certain therapeutic regimen that simultaneously cause resistance to other drugs as well. Cross resistance among HIV-1 PIs has been studied[2126] and reviewed[1, 4, 2729] extensively for over a decade, with several key mutation patterns thought to confer cross resistance to the vast majority of PIs. Consequently, one strategy is to take advantage of the lack of cross resistance when a mutation confers resistance to one PI but maintains susceptibility to other PIs. For example, D30N and I50L are associated with resistance specifically to either nelfinavir and atazanavir, respectively, but such mutations do not greatly reduce susceptibility (and I50L actually increases susceptibility) to other PIs[3033]. Sequential or simultaneous administration of regimens that are each potent against variants toward which the other fails may be a potential strategy to prevent drug resistance and treatment failure[34]. In light of the combinatorial number of both potential treatment regimens and potential mutation patterns, it is becoming increasingly important to understand both the major mutation patterns conferring resistance on the genotypic level as well as the major phenotypic patterns of cross resistance - or lack thereof - of these mutation patterns toward the nine clinically-approved PIs.

Computational analyses have played a key role in increasing our understanding of the genotypic and phenotypic patterns of HIV drug resistance and our ability to predict drug response phenotype from genotype[3537]. The large amount of publicly available data has greatly facilitated these analyses[35, 38]. Several computational studies have analyzed new or existing data to identify mutations associated with one or more PI or RT drugs[3948]. Some studies have presented longitudinal mutagenetic tree or mutation pathway models for the temporal appearances and contingencies of such mutations[4952]. Others have uncovered pairs or clusters of correlated mutations associated with PI or RT therapy through direct enumeration, statistical or information-theory based methods, clustering, or a combination of techniques[39, 4346, 51, 5363]. One particularly successful application of computational analysis is the accurate prediction of drug resistance (phenotype) - often measured as a fold-change in IC50 of a drug toward the mutant vs. wild-type - of a target variant given its amino acid sequence (genotype). Many approaches have been used to create prediction models, including regression-based methods[26, 6469], decision trees[70], and other machine learning methods, including artificial neural networks, support vector machines, and others[67, 7174]. Several studies have also comparatively evaluated or combined methods to improve accuracy[67, 72, 73, 75]. Models have also been created for predicting drug resistance phenotype[76] and virological success or failure[7780] resulting from combination therapies. In addition to these data-driven approaches, structure-based approaches for predicting drug response have also been developed, often in conjunction with the bioinformatics-based approaches[66, 81, 82]. Taken together, the large collection of available predictive methods still require interpretation and comparison when making patient treatment decisions[83, 84], but overall they have been valuable tools both for practical decision-making and for increasing scientific understanding.

The many computational studies of HIV genotype-phenotype data therefore demonstrate the power of uncovering patterns in data, with each study providing a valuable perspective on important features of HIV drug resistance. However, the vast majority of studies have offered a perspective at the genotypic level first - that is, they look for patterns on the genotypic level that correlate with phenotypic responses, usually to one drug or drug regimen at a time, in turn. To our knowledge, a rigorous cluster-based analysis of genotype-phenotype data that first uncovers patterns within the complete phenotypic space and then determines representative genotypes giving rise to the multidrug response phenotypes has yet to be done. The goal of this study is therefore to provide this unique, simultaneous view into the existing phenotypic patterns amongst all the HIV-1 PIs, as such a perspective can provide novel insights into the major combinations of PIs for which cross resistance can occur.

In this work, we analyze phenotypic drug resistance patterns by considering experimental resistance data of 398 clinical isolates of HIV-1 protease measured against the nine clinically-approved HIV-1 protease inhibitors. To determine phenotypic drug resistance patterns toward all nine drugs, a constrained k-medoids clustering method implemented via integer programming was employed. Clusters were validated by quantifying their ability to predict a sequence's level of resistance toward one drug knowing the sequence's level of resistance toward other drugs. The selection of representative genotypic sequences from each cluster indicated mutations associated with common patterns of phenotypic resistance and can serve as a "panel" of mutants that collectively represent clinically important variants. Furthermore, our direct analysis of phenotypic space allowed us to determine that the virus often utilizes multiple genotypes to achieve similar phenotypic patterns of multidrug resistance. We also show that certain drugs show highly correlated antiviral activities, while other drugs - especially tipranavir - have unique responses. Finally, information theoretic approaches were employed to determine amino acid positions and identities within HIV-1 protease that are most informative for selection into a phenotypic cluster. Taken together, this work provides a simplified framework for understanding major drug resistance patterns toward clinically-approved HIV protease inhibitors and the mutation patterns that best characterize them.

Methods

Data set

We analyzed 398 HIV-1 isolates in the HIV Drug Resistance Database[38] (HIVDB) for which cell-based in vitro PI susceptibility testing had been performed by the PhenoSense (Monogram, South San Francisco, CA) assay[85]. Susceptibility was quantified by the Monogram-measured fold-change[85], defined as the ratio of the 50% inhibitory concentration (IC50) of the isolate to the IC50 of a wild-type control. Only those isolates for which susceptibility had been tested against all nine clinically-approved inhibitors were included. The nine inhibitors considered were amprenavir (APV), atazanavir (ATV), indinavir (IDV), lopinavir (LPV), nelfinavir (NFV), ritonavir (RTV), saquinavir (SQV), tipranavir (TPV), and darunavir (DRV). The data set size was limited by the availability of isolates tested for DRV susceptibility. Many clinical isolates contained mixtures at one or more amino acid positions. Due to the limited data, mixtures were not excluded from the data set. In this work, we will refer to clinical isolates as "sequences," though we recognize that some contain mixtures at certain positions.

To estimate the degree to which mutation frequencies in the genotype/phenotype (n = 398) data set are representative of true population frequencies, the frequencies of non-polymorphic treatment-selected mutations within non-WT sequences were compared between a larger genotype-only data set of 12,290 sequences[38] and the data set used here. Reasonable correlation (Spearman's ρ = 0.88) was found between the data sets (Fig. S1, Additional File 1).

Fold-change values were log-scaled such that for a given drug, a constant factor of fold-change is represented by a constant numerical difference. Because the relationship between fold-change and clinical response is different for each drug, scaled values were standardized so that they represent predicted clinical responses, the phenotype of interest in this work. To do this, the logarithm base used for the log scaling of each drug was set to either the Monogram biological cutoff, the geometric mean of the Monogram lower and upper clinical cutoffs, or the single clinical cutoff provided, depending on which type of cutoff was available for a particular drug (Table 1). Monogram biological cutoffs are defined as the fold-change values below which 99% of the WT sequences reside, and therefore fold-changes above this value likely have decreased susceptibility. Monogram lower and upper clinical cutoffs are fold-change values at which reduced clinical response and unlikely clinical response occur for a given drug, respectively. Ritonavir-boosted cutoff values were used when available. After log-scaling, scaled resistance values of 1 and 0 qualitatively signify decreased susceptibility and susceptibility equal to WT, respectively, for all drugs. To equalize the range of variation in the scaled resistances for each drug and to confine variation to a clinically meaningful range, we capped the maximal and minimal scaled resistances of all drugs to the least extreme value of these among the nine inhibitors -- those of DRV (Table 1). The upper cap of the scaled values (1.83) corresponded to a raw fold-change value for DRV of 500, the upper-limit value used when the fold-change toward DRV was greater than the upper limit of the assay. Sequences with scaled resistances equal to the capped values are therefore considered either highly resistant (upper cap) or potentially hypersusceptible (lower cap). An interpretation of scaled resistance values is in Table 2.

Table 1 Scaling and capping of raw fold-change values.
Table 2 Interpretation of the scaled resistance values used throughout this work.

Clustering

Sequences were clustered based on their drug-resistance phenotypes, quantified by scaled resistance values. A globally-optimal constrained k-medoids clustering approach was implemented via a linear integer program similar to other variations of integer and mixed-programming-based k-means and k-medoids clustering formulations[8689]. The k-medoids approach was chosen after exploration of multiple clustering methods (k-means, hierarchical, and a method based on a tight clustering approach[90]), as it was deterministic, provably optimal, and allowed for the easy implementation of hard constraints, which we felt were crucial here for generating clusters that were phenotypically similar across all drugs.

The clustering method was as follows: First, each sequence was assigned a point in a 9-dimensional space whose coordinates are the scaled resistances toward the nine inhibitors. From these points, a distance matrix was generated, in which element dij is the Euclidean 2-norm distance between the ith and jth sequences. The goal was to select k cluster centers (medoids) from within the data set and assign each point in the data set to one of these k medoids such that the sum of the distances from points to their assigned medoids was minimized.

Constraints were placed on this optimization to guarantee phenotypic similarity within a cluster, as the goal of this work is for the clusters to represent major phenotypic patterns. First, a hard constraint was set to bound the distance between any cluster member and its medoid to be less than or equal to a specified value, C. Secondly, a hard constraint was set to cap the maximum infinity norm of the distance between any cluster member and its medoid to a specified value, C. Such a constraint prohibits grouping together two sequences that are highly similar toward 8 drugs but differ qualitatively in their level of resistance toward only one drug - an undesirable outcome if we wish for our clusters to highlight major cross resistance patterns.

k, the number of clusters, is determined by feasibility; it is the minimum value of clusters for which the constraints are satisfied. In this work we use C = 0.95 and C = 0.58; the value of C = 0.95 occurs roughly at the "elbow"[91] or "kink"[92] of a plot of the minimum k needed as a function of tightness (C and C) (Fig. S2, Additional File 1), suggesting that it allows a reasonable balance between maintaining both a low number of clusters and adequately tight clusters. A C of 0.58 guarantees that a given cluster members' scaled resistances toward any given drug cannot vary by more than 2 C = 1.16; there will not be a pair of cluster members in which one sequence shows no resistance to a given drug while another shows high levels of resistance (see Table 2). Higher values of C would make clusters too diffuse along individual dimensions, preventing their interpretation as clinically-relevant phenotypic patterns. Lower values were found to be too restrictive and generated additional clusters with redundant patterns (data not shown). To check for robustness of clustering as a function of these parameters, C and C were each varied in turn up to +/-0.05 units in increments of 0.025. Qualitative phenotypic patterns remained very similar, and pairs of sequences that were clustered together in the original clustering remained together an average of 71% as these parameters were varied.

Figure S3 (Additional File 1) is a plot of the number of clusters (k) vs. data set size, using random subsets of the data. As our data set is currently not large enough to show robust convergence (k increases with increasing data set size), the quantitative results that are affected by data set size are to be considered preliminary; more data could allow for more robust convergence in future studies and would increase confidence in the quantitative conclusions.

The integer programming formulation used is shown in Supplementary Methods (Additional File 1). All integer programs in this work were implemented using the GAMS interface (GAMS Development Corporation, Washington, D.C.) and were solved using CPLEX 11.0.0 (IBM ILOG, Armonk, NY).

Validation

The clustering was validated by its effectiveness (relative to controls) in predicting the level of drug resistance of a sequence to one drug based on the sequence's levels of drug resistance toward other drugs, using the following n-fold cross-validation procedure[92]:

remove each sequence (in turn) from the data set - label it sequence "A."

cluster the remaining sequences using the above method.

choose one of the nine drugs and eliminate its phenotypic data for sequence "A".

Assign sequence "A" to the cluster to whose centroid it is closest, based on 8-dimensional distance (i.e. removing the eliminated drug's dimension)

Predict the level of drug resistance of sequence A toward the eliminated drug to equal the cluster centroid's scaled resistance value for the eliminated drug. Based on this value, classify sequence A with a resistance score from 0-4 (Table 2).

For each drug, the total RMS error and the percent correctly classified after leaving out each sequence in turn was compared to two controls:

Control 1 ("Random Control"): To predict the resistance of a sequence toward a drug, randomly choose a value from the distribution of scaled resistances in the data set toward the particular drug, and classify it using the corresponding resistance score. This control assumes that the level of resistances between drugs is not correlated.

Control 2 ("Average Control"): To predict the resistance of a sequence toward a given drug, simply use the mean of the levels of sequence "A's" scaled resistances to the other eight drugs, and classify with the corresponding resistance score. This control assumes that resistances toward the nine drugs are highly correlated.

Genotypic Analyses

In the absence of amino acid mixtures at positions within isolates, the genotypic distance between any two sequences was defined simply as the number of positions at which their amino acid sequence differed. For some analyses, all 99 protease positions were considered. To reduce noise due to polymorphic positions in certain analyses, only 21 positions that have been associated with resistance or drug treatment by previous statistical learning or analysis methods [26, 39, 48] were considered, unless otherwise noted: 10, 24, 30, 32, 33, 43, 46, 47, 48, 50, 53, 54, 71, 73, 74, 76, 82, 83, 84, 88, and 90. We note that there may be unavoidable arbitrariness in the selection of such a set without considerable initial genotypic-phenotypic analysis (which was exactly what we sought to avoid in this study), and in the course of our research we tried multiple sets, allowing us to check for robustness.

To account for mixtures in isolates, the contribution toward the genotypic difference between two sequences due to a position, dm, was defined in the general case as follows:

where "c" is the number of amino acids that the isolates have in common at that position, and max(s) is the number of amino acids in the mixture with the greater number of amino acids at that position. As an example, if one isolate contained a mixture of leucine and methionine at a position and another contained only leucine, then dm for this position would be (1-(1/2)) = 1/2.

Intracluster genotypic or phenotypic variability was estimated as the average of all the pairwise genotypic or phenotypic distances. A bootstrapping procedure was used to generate p-values to assess statistical significance of either distance for selected clusters. Random clusters of a size equal to the considered cluster were selected with replacement from the unclustered data, and the distance metrics were calculated. This procedure was repeated 10,000 times to generate distributions for both genotype and phenotype distances, from which p-values were calculated. Bootstrap studentized statistics were obtained by dividing the difference between a value and the bootstrapped distribution mean by the standard deviation of the distribution.

From each cluster, representative sequences were selected. For genotypically diverse clusters, we wished to select multiple representative sequences from each cluster to highlight genotypic diversity. To that end, constrained k-medoids optimizations were run on each cluster using integer programming; the resulting medoids became the representative sequences. For each phenotypic cluster, the minimum value of k was determined such that all sequences within the cluster would be within a genotypic distance of ti of at least one medoid. We used a value of ti = 9 when possible, as it produced one representative sequence for all but the most diverse clusters (except for other exceptions noted below), allowing for easy interpretability. Additionally, at this k, the sum of the distances between each sequence and its assigned medoid was minimized. Sequences containing mixtures at any of the 21 positions listed above were excluded from being representative, as were sequences with any of the 99 amino acid positions undefined (only 2 within the data set). With this constraint, it becomes possible for phenotypic clusters (other than single-membered ones containing mixtures at relevant positions) not to generate any representative sequences with ti = 9. To account for this, ti was increased to 10 for clusters 3 and 19 and 10.5 for cluster 10. The integer-programming formulation used here is shown in Supplementary Methods (Additional File 1).

Sets of sequence positions or amino acid residue identities most informative of overall cluster assignment or membership in an individual cluster were identified according to an incremental mutual information (MI)-based method described previously (MIST)[93]. Briefly, the method approximates high-order joint entropies to determine an optimal small subset of features (e.g., residue positions) that collectively have the highest mutual information (MI) with a given output (e.g., phenotypic cluster). These approximated MI values have also been shown to correlate with classification error and with exact MI values in analytically solvable systems. First, the MI between variables of interest was computed, using the frequencies to estimate probabilities. For each MI, the bias in the value was estimated by computing the MI of the pair after randomizing the ordering of the sequence data for each variable 100 times. Variables whose MI with the outputs exceeded their maximum shuffled MI were considered statistically significant and included in subsequent steps; remaining positions were omitted. Sequence positions or binary mutation variables were then selected incrementally to maximize the joint-MI (as estimated by MIST) between the set of all chosen variables and either the cluster assignment or membership in a specific cluster. Mixtures were not included in the distributions. Features were added incrementally until all positions or mutations were included, yielding a full ranking.

Miscellaneous

Data scaling and other matrix manipulations, including principal component analysis, were done using Matlab 2010a and 2011a (The Mathworks, Natick, MA). Matlab and Microsoft Excel (Microsoft, Inc., Bellevue, WA) using VBA were used for figure generation.

Results

Cluster Analysis Reveals Specific Phenotypic Resistance Patterns Among Clinical Isolates

Globally-optimal k-medoids clustering was used to find groups of sequences with similar multidrug phenotypes, using the tightness constraints C and C mentioned in the Methods to enforce thresholds of phenotypic similarity. The clustering yielded 36 multi-membered clusters, along with 14 outliers. Figure 1 shows the resulting clusters; each cluster is represented as a row, with each of the colored boxes within the row representing the resistance score (Table 2) toward the corresponding drug of the cluster's centroid (i.e., average phenotype), according to the legend. At right, representative sequences are shown for each cluster, with non-WT amino acid identities shown at selected positions. A listing of mutations at all positions for each representative sequence is provided as Supplementary Information (Table S1, Additional File 1). For two clusters (5 and 9), more than one representative sequence was needed due to the genotypic diversity.

Figure 1
figure 1

Optimal phenotypic clustering of clinical data set. The optimal set of clusters obtained by using constrained k-medoids clustering with integer programming. 36 multi-membered clusters and 14 single-member "clusters", or outliers, were obtained. Each row represents one cluster. The second column indicates the cluster size. The next 9 columns represent the cluster centroids' phenotypic drug resistance scores, colored according to the legend. The columns at right indicate mutations in the sequence selected to represent the cluster at selected positions. Because isolates with mixtures at any of the specified positions were not allowed to represent a cluster, certain single-membered clusters do not have a representative "sequence." The representative sequences chosen for clusters 29, 31, 34, and 36 show no mutations at the positions listed here, but they have substitutions at other positions (Table S1, Additional File 1).

Generally, the largest clusters were those in which (a) there was no resistance (or very mild resistance) to any drug, (b) there was high resistance to all drugs, (c) there was high resistance toward all drugs except DRV, to which there was moderate resistance, (d) there was high resistance toward all drugs except DRV and TPV, (e) there was resistance toward only NFV and RTV, and (f) there was high resistance to APV, ATV, NFV, RTV, and SQV.

The clusters demonstrate that there is often cross resistance of sequences toward many drugs. Generally, sequences are most commonly resistant to RTV and NFV, followed by ATV and SQV, then APV, IND, and LPV, and finally TPV, and DRV. In general, resistance to DRV implies resistance to nearly all other drugs, with a few exceptions: Three clusters showed moderate to high levels of resistance against all drugs except TPV (clusters 5, 8, and 12), and two clusters showed moderate to high levels of resistance against all drugs except SQV (clusters 11 and 15). In both cases, the representative sequences of the clusters each had at least one mutation that has been associated with hypersusceptibility toward the particular drug in a previous study in which mutations were the independent variables and fold-change was the dependent variable[26]. These mutations include L10F, G48V, I50V, I54L, and L76V in the case of the clusters with unique susceptibility to TPV and I47A in the case of the clusters with unique susceptibility to SQV.

One may ask if grouping 398 sequences into 36 phenotypic clusters and 14 outliers shows that HIV is exploring a large or small part of the available phenotypic space. To address this question, we repeatedly generated sets of 398 random points within the same nine-dimensional scaled space of our data set and clustered them using the same constraints applied to the true data set. The average minimum number of clusters needed over 300 trials was 375, with the smallest number of clusters needed being 357. Clearly, the fact that only 50 clusters (including outliers) were needed to partition the actual data within the constraints demonstrates that HIV protease is exploring a very small portion of possible phenotypic space. In fact, due to the constraints used in the clustering, the volume of 9-dimensional phenotypic space occupied by each cluster must be less than the smaller of either the volume of a hypersphere of radius C or a hypercube of length 2C. Using our constraint values, the smaller of these is the former, with a value of ~2.1 volume units. The volume of clinically-relevant phenotypic space can be calculated from the maximum and minimum scaled values in Table 1 to be 1800 volume units. Therefore, only (2.1*50)/1800 = ~6% of phenotypic space, at best, has been explored by the considered isolates, compared to (2.1*375)/1800 = ~44% for a random data set of equal size.

If a drug is removed from the data set, the minimal number of clusters needed to represent the phenotypic diversity must be less than or equal to the minimal number needed with that drug included. One way to measure the additional phenotypic diversity provided by each drug is to remove each drug in turn and re-cluster using the k-medoids approach under the same distance constraints. Drugs that, upon removal, greatly reduce the number of required clusters have phenotypes that vary somewhat independently from the other drugs. Drugs that, upon removal, do not greatly reduce the number of required clusters have phenotypes that vary predictably with (though not necessarily in a correlated manner with) the remaining drugs. When this analysis was carried out, it was found that removal of TPV reduced the number of needed clusters by the most (from 50 to 31), suggesting that TPV's response toward sequences varies somewhat independently from other drugs. In other words, TPV might show varied, graded responses toward certain groups of sequences toward which other drugs show relatively constant responses. Removal of ATV, SQV, or APV also reduced the number of needed clusters by over 10 (from 50 to 37, 38, and 38, respectively). Removal of LPV, DRV, NFV, RTV, or IDV reduced the number of required clusters the least (to 44, 44, 43, 43, and 41, respectively) suggesting that their scaled resistances either vary predictably with those of the other drugs or do not vary appreciably in general.

Phenotypic clustering allows for potentially improved prediction of unknown drug phenotypes given phenotypic information for other drugs

Our results indicate that a small portion of the full phenotypic space has been explored by the virus, assuming a representative data set; consequently, one may be able to successfully predict resistance to a given inhibitor given resistance data toward other inhibitors, without knowing any genotypic information. To test this hypothesis, we used a cross-validation procedure in which each sequence from the data set was removed in turn and the sequence's resistance toward each drug was estimated based on a clustering assignment using the other eight resistance phenotypes (see Methods). Pairs of sequences that were clustered together in the original clustering remained together an average of 99.3% of the time across all n runs of the validation, not counting runs in which a member of the pair was excluded in turn, demonstrating the stability of the clustering during the cross-validation procedure. The results of the cluster-based prediction are summarized in Table 3.

Table 3 Cluster-based prediction of phenotypic resistance relative to controls.

Two controls were used for comparison and are described in the Methods. Control 1 ("Random"), which randomly reported a value from the distribution of scaled resistances in the data set toward the particular drug, was able to correctly categorize resistance 21%-36% of the time, depending on the drug. The RMSE's of the actual scaled resistance values were often over a whole unit away, meaning that it would often predict no resistance when there was in fact resistance, and vice versa. NFV and RTV were classified correctly most often; the clustering suggests that this may be because they were more likely to exhibit either no resistance or complete resistance, providing a less graded distribution overall from which to sample.

Control 2 ("Average"), which guessed the "unknown" phenotype to be the average of the other 8 known phenotypes for the isolate, performed much better overall than Control 1, categorizing resistance correctly for more than half of the sequences for ATV, APV, IND, LPV, and SQV. Its strong performance is additional evidence for the high level of both correlation between drug responses and cross resistance. Performance was worse for (1) NFV and RTV, which are often inactive to viruses toward which other drugs are effective, as Figure 1 indicates, (2) DRV, which, according to Figure 1, often remains effective toward viruses resistant to other drugs, and (3) TPV, which, as shown above, has less phenotypic similarity to other drugs.

Compared to either control, the cluster-based prediction correctly classified a higher percentage of viruses for every drug, although the improvement over Control 2 was modest in some cases, with the RMSE's being marginally higher in some cases as well, suggesting that when the cluster-based classification was incorrect, it was quite different. The improvement in classification was largest for NFV, RTV, and DRV. Classification rates overall were well over 50% correct with RMS errors being fairly small (generally <= 0.5 units away). The notable exception is TPV, again supporting TPV's uniqueness.

The relatively large number of sequences susceptible to all drugs in our data set might bias the prediction accuracy of certain methods to be higher than what would be expected from a data set that contained a more even distribution of all multidrug phenotypes. To control for this, we redid the above analysis after having left out the sequences corresponding to the two clusters shown in Figure 1 that show no or very little resistance to all nine drugs (clusters 36 and 34, with 77 and 71 members, respectively). Not surprisingly, Control 1 performs much better with RTV and NFV, as now, nearly all sequences in the data set are resistant to either drug. Also unsurprisingly, Control 2 performs worse because the two clusters that were removed contained sequences whose responses to all drugs were highly correlated. The cluster-based classifier still has the highest classification accuracy, but again, the RMSE values were sometimes greater than those for Control 2. Nevertheless, these results show that an understanding of major phenotypic resistance patterns can allow for reasonable prediction of a sequence's resistance toward one drug given resistance information toward other drugs, and the strong performance of the controls under certain circumstances further highlights the underlying structure in the resistance patterns.

The accumulation of HIV protease mutations results in a "path" in phenotypic space

Principal component analysis (PCA) was used to project the nine-dimensional, columnwise-centered drug-resistance phenotypes of all sequences onto the two dimensions along which there is most variation. Figure 2 is a plot of the sequences in this two-dimensional space, colored by the total number of amino acid differences from consensus-B wild type protease (considering all 99 amino acid positions). The first two principal components are able to capture approximately 90% of the variation in the data, again suggesting that there are large correlations between drug responses toward the sequences. As indicated in Table 4, the first principal component indicates resistance toward all drugs (i.e., complete cross resistance), with slightly less resistance toward TPV and DRV, relative to their means. The second principal component indicates resistance toward NFV and RTV, less resistance to ATV, SQV, and IDV, and low resistance or even increased susceptibility toward APV, LPV, DRV, and especially TPV, relative to each drug's mean resistance value.

Figure 2
figure 2

Projection of the phenotypic data onto its first and second principal components. Points are colored by the total number of amino acid substitutions relative to the consensus B WT sequence, according to the scale at right; a mixture at a position (including those containing the WT amino acid) is counted as one substitution. The phenotypes and genotypes of selected sequences are indicated. The 9-digit shorthand phenotypic code used to describe the sequences indicates the resistance score (Table 2) to each of the 9 PIs in the order shown in Fig. 1: RTV, NFV, ATV, APV, IDV, LPV, SQV, TPV, DRV. All "outlying" sequences are fully listed in Supplementary Information (Fig. S4, Additional File 1).

Table 4 The nine principal components in scaled phenotypic space.

Interestingly, the points in Figure 2 form a "path" through phenotypic space. Such "horseshoe"-shaped paths are often indicative of a non-linear ordering or underlying gradient in the data[94]. Here, the path clearly tracks the genotypic mutations accrued by the sequences. Sequences with few mutations appear to have resistance toward NFV, RTV, ATV, SQV, and IDV, but little resistance to APV, LPV, DRV, or TPV (i.e., the phenotypic path "veers upward" in the principal component space), while sequences with many mutations are resistant to all drugs (far right in the principal component space). Three sequences along the path are selected in Figure 2 and their corresponding scaled phenotypes and genotypes are listed to the right of the plot. The point selected on the intermediate portion of the path represents a sequence that includes the mutations M46I and L90M, which have been shown to be highly correlated[59] and to be associated with resistance to NFV, IDV, and RTV, and other drugs to a lesser extent[56]. The point selected at the right end of the path represents a sequence that shows at least moderate resistance to all drugs, and includes the mutations V82T, I84V associated with resistance to TPV[18], and L33F, I47V, and I54M, associated with resistance to both TPV[18] and DRV[20], in addition to containing mutations that harbor resistance toward first-generation drugs.

As a whole, Figure 2 supports the historical "path" of drug development, in that it is relatively easy to become resistant to first-generation drugs with relatively few mutations (RTV, NFV, SQV, etc.), but many accumulated mutations appear to be necessary to confer resistance to the newer drugs, such as darunavir[16, 19]. Whether or not this pathway is due to history and treatment regimens or whether it is a fundamental consequence of the structural features of the drugs and the viable evolutionary space of HIV-1 protease requires further study.

A handful of sequences lie "off" the pathway. Three such sequences are indicated in Figure 2, and several more are listed in Fig. S4 (Additional file 1). The top and bottom sequences indicated in Figure 2 are both uniquely susceptible to SQV and have the mutation V82L which has been associated with increased SQV susceptibility[26]. The middle sequence shows low levels of resistance across all nine drugs. All three of these sequences fall off the pathway because of their non-negligible levels of resistance toward one or more second-generation drugs while maintaining susceptibility to one or more first-generation drugs. Additional outliers are shown in the Supplementary Information (Additional File 1).

Phenotypic Similarity Does Not Imply Genotypic Similarity

Figure 3a is a plot of scaled phenotypic distance vs. genotypic distance for all (398*397)/2 = 79003 sequence pairs, using all amino acid positions to compute genotypic distances. Not surprisingly, sequences that are genotypically similar are phenotypically similar; there are no points in the upper-left corner of the plot. However, there are many sequences that are very different genotypically and yet have similar scaled resistance phenotypes (there are many points in the lower-right corner), suggesting that HIV-1 may arrive at the same multidrug resistance phenotype via rather varied genotypes. Figure 3b is again a plot of all pairwise phenotypic distances vs. their genotypic distances, except now, only the resistance-associated positions specified in the Methods have been included in calculating genotypic distance. While the upper left corner of this plot is still sparse, this plot indicates that polymorphic or accessory positions not considered in genotypic distance may still affect resistance profiles in the absence of mutations commonly associated with drug resistance (i.e. there are pairs of sequences with a genotypic distance of zero in Figure 3b but a moderate phenotypic distance). Again, there are still sequences that are genotypically very different yet show similar resistance phenotypes.

Figure 3
figure 3

Pairwise phenotypic distance vs. pairwise genotypic distance for all pairs of sequences. (a) Scaled phenotypic distance vs. genotypic distance with all positions considered in calculating genotypic distance and (b) with only resistance-associated positions used in calculating genotypic distance. The density of points is colored according to the scale at right. The sequence pairs corresponding to two points are indicated. The 9-digit shorthand used to describe the two pairs of sequences indicates the resistance score (Table 2) to the PIs in the order used in Fig. 1: RTV, NFV, ATV, APV, IDV, LPV, SQV, TPV, DRV.

Mutations from two sample pairs of sequences from the lower-right quadrant of each figure are shown. In Figure 3b only the mutations contributing to the genotypic distance are shown. As can be seen, very different genotypes can generate similar resistance patterns. For example, the sequences shown in the lower box at the right of Figure 3a show high levels of resistance toward all drugs; each sequence has a subset of documented drug resistance mutations, such as V32I, L33F, M46I, I47V, F53L, G73S, V82A, and L90M in the case of the first sequence and M46L, I54V, V82F, and I84V in the case of the second sequence, but the sequences have few mutations in common (K20R, E35D, M36I, L63P, A71V, and I93L), most of which are considered highly polymorphic accessory mutations[95]. The variety of mutations through which the protease is able to achieve similar multidrug clinical phenotypes demonstrates that phenotypic similarity does not imply genotypic similarity. Recall here that two sequences that are both sufficiently above the clinical fold-change cutoff for resistance for a given drug are both considered phenotypically identical toward that drug, due to the capping of scaled resistance values above a threshold. Therefore, while they are phenotypically similar from a clinical perspective, they may possess quite different (but both large enough to be considered resistant) raw fold-change values toward a given drug.

Another way to understand the genotypic variation for a given phenotypic pattern is to analyze the genotypic diversity within each phenotypic cluster. For each individual phenotypic cluster obtained in the above analysis, we used a k-medoids approach to identify representative genotypes for that cluster. Through constraints, a more genotypically diverse phenotypic cluster would require more sequences to represent it. Figure 1 shows the representative sequences chosen for all phenotypic clusters. As can be seen, two clusters (5 and 9), even though they are of similar sizes to others, require multiple representative genotypic sequences. Multiple representative sequences for a cluster suggest multiple genotypic paths to the phenotype.

To quantify phenotypic and genotypic diversity within clusters, resampling was carried within each cluster as described in the Methods. Table 5 summarizes the results for all clusters with more than 6 members. The p-values for intracluster phenotypic distance ("P Pheno") show significantly low variation, but hard constraints in the clustering enforced phenotypic similarity so this low variation is by design. It is also not surprising that the genotypes of non-resistant clusters are also statistically similar (bootstrap studentized statistics for clusters 34 and 36 are -11.3 and -13.3), as none of these sequences would be expected to bear a resistance-associated mutation, so they should all effectively be "wild-type". However, among multidrug resistant phenotypes, there is either no more or no less genotypic variation between members within a cluster than there is between any two random sequences in the data set (insignificant "P_Geno" values), or there is more genotypic variation than would be expected by random sampling in the cases of clusters 5 and 7 (P_Geno < 0.01; bootstrap studentized statistics are 2.26 and 2.16). Furthermore, on average, pairs of sequences from the same cluster generally share less than 50% of their mutations (using resistance-associated positions listed in the Methods); the one exception is the cluster containing sequences resistant to all drugs (cluster 1), whose members share 54% of their mutations on average; indeed the average intra-cluster genotypic distance for this cluster is in some cases less than that for clusters containing fewer mutations on average, suggesting that a higher number of mutations may not mean greater genotypic variation, and also indicating that the most highly resistant sequences might need to have some "key" mutations in common. When removing from the data set one from each pair of 28 sequences from the same patient at two different time points and reclustering, the most highly resistant cluster still had >50% shared mutations on average and a lower intra-cluster genotypic distance than some other resistant clusters, although it now required two representative sequences, suggesting that some - but not all - of this similarity may be due to including data at different time points from the same patient. This idea is further addressed in the Discussion. Nevertheless, while a larger data set would allow for a more rigorous control for the number of mutations within a cluster when computing p-values and for the exclusion of data from the same patients at multiple time points, thus allowing for fairer comparisons, this simple analysis suggests again that in general, phenotypic similarity does not imply genotypic similarity, and certain multidrug phenotypes may be achieved by more varied genotypes than others.

Table 5 Statistical analysis of phenotypic and genotypic variability within each cluster containing 6 or more members.

Feature selection uncovers important positions and mutations for cluster assignment

Finally, we sought to rigorously determine sets of amino acid positions and mutations that were most informative of membership in the phenotypic clusters. Figure 4a shows the results of greedily selecting one position at a time such that at each step (going left to right), the (approximate) mutual information (MI) between the chosen set of features and the cluster assignment is maximized. Only those positions that had significant MI with the output are included. The red bars indicate the MI between an individual position and the cluster assignment, with the yellow star indicating the threshold for statistical significance (p = 0.01). The blue bars indicate the joint MI between the subset selected thus far and the cluster assignment. Note that positions are not strictly selected in decreasing order of individual MI. Because mutations at certain positions may be highly coupled with positions already in the feature set, less individually informative positions may contribute to a more informative set of positions. This technique therefore chooses highly non-redundant features that are still informative of the output. Finally, the black bar shows the total information content of the output, the cluster assignments.

Figure 4
figure 4

Stepwise selection of HIV-1 protease positions and mutations that are important for cluster assignment. (a) Stepwise selection of HIV-1 protease positions (from left to right) such that at each step, the mutual information between amino acids at positions selected so far and the cluster assignment is maximized. Red bars indicate (bias-adjusted) MI between each individual position and cluster assignment. ‘x’s’ are the standard deviation of MI estimation for two independent variables, and asterisks indicate the threshold for statistical significance (p=0.01). Blue bars are the estimated joint MI between the subset chosen and the cluster assignment. The black bar indicates the total information content of the cluster assignments. (b) Stepwise selection of the most informative amino acid identities at specific positions for assignment into phenotypic clusters. (c) and (d): Stepwise selection of particular amino acid identities whose collective presence or absence are maximally informative of membership specifically into cluster 1 (c) and cluster 36 (d).

Figure 4a indicates that several positions have significant MI with the final cluster assignment, especially positions 54, 90, 84, 46, 33, 20, 82, 32, 88, and 71. This is consistent with findings that these positions are known to mutate in the presence of drug resistance, either as primary or accessory mutations[4, 47, 48]. Collectively, these positions are computed to be nearly as informative of ultimate cluster assignments as the entire set of positions considered. The fact that position 54 is chosen as the most informative feature is not surprising, given the large range of drug-resistant mutations commonly found at this position and their varied effects toward certain drugs as either primary or secondary mutations; I54L, I54M, I54V, etc., can have different consequences toward drugs such as TPV, DRV, and APV[4, 95] Also interesting is the redundancy of position 10 and, to a lesser extent, position 71; although position 10 has a high mutual information with the cluster output, it does not provide additional information once the identities at the ten positions listed above are known. Position 71 provides some additional information but is also quite redundant. These results are consistent with the amino acids at positions 10 and 71 both being highly correlated with those at other positions such as 54, 90, 82, 84, and others[54, 55, 59], as it is believed that mutations at these positions can be compensatory in nature[54, 55, 96]. Finally, one should note that the approximate joint MI calculated between all of the positions and the output is still quite less than the true information content of the output, suggesting that amino acids considered at all positions still may not result in perfect prediction of these output data. This is likely due to the true importance of higher-order information (i.e. patterns of three or more amino acids occurring together) in contributing to ultimate phenotypes - the importance of which has been noted previously[61] - as well as noise in the measurement and clustering of the phenotypic data, thus highlighting the inherent difficulty of accurately predicting phenotype from genotype in these complex systems. The limitations of the second-order approximation also result in the approximated total joint mutual information between the features and the output (blue bars) failing to be monotonically increasing as they would be were an exact calculation feasible, again highlighting the complex relationship between various protease positions and phenotype.

Figure 4b shows the specific amino acid identities calculated to be most informative of ultimate cluster assignment. Here, key resistance mutations are chosen that cause broad resistance to many of the older drugs, such as L90M and I84V. At positions that can bear several identities, such as 54, 46, and 82, the selection of the wild type amino acid suggests the importance of the lack of any mutation at these positions in determining cluster assignment.

Figures 4c and 4d show sample results for mutations that are informative of assignment into specific clusters - cluster 1 (c), the most resistant cluster, and cluster 36 (d), the completely nonresistant cluster. All other results for clusters with 8+ members are shown in Figure S5 (Additional File 1). Figure 4c indicates that the amino acid identities most informative of membership into the "most" resistant cluster include several mutations that have been associated with resistance to DRV[97] including V11I, L33F, V32I, L89V, and G73S, as well as mutations such as I84V and L90M that are associated with broad cross resistance toward other PIs.

Finally, many of the informative residues in the non-resistant cluster (Figure 4d) are actually wild type amino acids. This suggests that the lack of mutations at these positions correlates with low levels of resistance. Additionally, several mutations listed are at accessory positions such as 10, 63 and 36, suggesting that mutations at these positions are reasonable markers for "any" resistance in general. It is important to note that while this method highlights which mutations are most informative of cluster assignment, it does not identify whether it is the presence or the absence of the mutation that is associated with cluster membership.

Discussion

This study highlighted major patterns of phenotypic resistance across all nine clinically-approved HIV-1 PIs. Cluster analysis yielded several phenotypic patterns, including clusters showing resistance to all drugs, all but one specific drug (such as TPV, SQV, or DRV), a large subset of drugs, a small subset of drugs, and only one drug (such as NFV or ATV). Through choosing representative sequences for each phenotypic pattern, we have corroborated previously reviewed observations[4, 27, 29] that mutations such as L33F, V82A, I84V, and L90M are associated with broad cross resistance, while others, such as D30N and I50L are associated with resistance to only one drug and still others such as I47A and I54L are linked with hypersusceptibility toward a given drug. While we have uncovered a variety of phenotypic patterns, not every possible resistance pattern was sampled, suggesting that cross resistance and other factors cause highly correlated drug responses, assuming our data set is representative. Indeed, our considered isolates occupy only a small portion (~6%) of the available, clinically-relevant phenotypic space. For example, no cluster shows a moderate or high level of resistance toward DRV without resistance to several other drugs, including APV and LPV. Whether this result is due to patient treatment histories or the intrinsic properties of the drug--protease interactions requires further study. If the latter is at least partly the case, it corroborates the observation that DRV may have a higher genetic barrier to resistance[16, 19].

TPV's response toward sequences often shows little relationship to other drugs' responses. The relative lack of cross resistance to TPV may make it particularly useful[14] in conjunction with other inhibitors to "cover" the mutation space of the virus. TPV's differing response profile may follow from its unique structural characteristics. It is the only clinically-approved inhibitor that does not use a water molecule to mediate hydrogen bonds with the flap regions of the protease, suggesting the importance of developing structurally diverse drug molecules toward a target as a strategy to combat resistance[98].

The representative sequences of four clusters (29, 31, 34, and 36) had no mutations at the 21 positions considered in computing genotypic distance for this purpose, and yet their phenotypes were not identical on average. This suggests a potential role for mutations at other positions that may not be associated with primary drug resistance. A rigorous study that analyzes the differences in mutation frequencies in such clusters and considers their impacts on the susceptibilities of individual cluster members, while beyond the scope of the current work, would be interesting potential future work, especially when more data are available.

We demonstrated that phenotypic clustering may allow for prediction of resistance to a particular drug based only on resistance information toward other drugs and no genotypic information. While our goal was not to develop a prediction method that is superior to the available genotypic-based methods specific to each drug, especially as it may be rare to have multidrug phenotypic data available, it is interesting to assess how well our "genotype-blind" method performs when compared to genotype-based methods. Rigorous comparisons to mean standard error values in other studies are difficult due to different scaling and capping procedures used here for phenotypic standardization. Nevertheless, some studies used a Pearson correlation coefficient (R) between predicted and actual log-fold-change as a measure of accuracy. R values for PIs available at the time of selected studies ranged from 0.85-0.97[69], 0.65-0.93 (across multiple methods)[67], and 0.78-0.89[64]. From the cross-validation procedure used to generate Table 3, our "genotype-blind" method gave R-values ranging from 0.84-0.94 using all 398 data set members, with the exception of TPV, although these numbers may be artificially high due to our capping of extreme values. Predictions of resistance to TPV had an R value of only 0.45, consistent with the observed difficulty in predicting TPV resistance based on the phenotypes shown toward other drugs. Finally, our reported classification accuracies are lower than those reported for genotype-based predictions, but this is partly because we use five categories as opposed to the binary or 3-way classifications commonly used. If we adopt a naive binary classification scheme (scaled resistance < 1.0 is not resistant; scaled resistance >= 1.0 is resistant), our cluster-based classification accuracies using the n-fold cross validation procedure for the entire data set range from 85%-95% excluding TPV(79%), compared with 85%-95% for binary classification schemes reported in the literature [65, 72, 74] (TPV and DRV were not part of these studies). It is interesting to note that while not the major goal of our paper, we have shown that with the exception of TPV, it may be possible to approach comparable drug resistance prediction accuracy without any genotypic information; this level of accuracy demonstrates the restricted phenotypic space occupied by the virus.

Our analysis was limited by the number of accessible isolates that have each undergone phenotypic resistance testing against all nine inhibitors. A large priority for future work is acquiring enough data such that the number of clusters is robust to the data set size such that one could be confident that all or nearly all phenotypic patterns have been sampled. One strategy is to pool isolates phenotyped by different assays to bolster the amount of data; indeed, a preliminary clustering was carried out in which the data analyzed here were combined with 196 isolates phenotyped using the Antivirogram (Virco, Mechelin, Belgium)[99] assay, but differences between the assays may have subtle but important effects on the interpretation of scaled resistance values, even when using cutoffs specific to each assay, creating potential artifacts in the clustering (we obtained 67 clusters with the combined data set, a larger number than expected given the pattern shown in Fig. S3). More data would allow for larger cluster sizes in general, and therefore a higher confidence in associating certain genotypic features with cluster assignments; one could also look for differences in phenotypes between virus subtypes if such data were concurrently available. Additionally, more data may allow for cluster sizes to accurately represent the relative frequencies of phenotypes within the population and would allow us to exclude isolates containing mixtures at key positions; such an exclusion would have been too restrictive with the amount of data currently accessible.

Finally, larger clusters would also allow us to account for and potentially exclude sequences that may be from the same patient at different times, allowing for more robust conclusions to be made about the genotypic variability within a cluster. Preliminary analyses were conducted in which one sequence each from 28 sequences pairs from the same patient in our data set was arbitrarily excluded (even if the pair differed significantly in genotype), yielding a 370-member set. Qualitative results of genotypic variability remained similar, in that several resistant clusters showed as much or more genotypic diversity than randomly chosen data set members, although again, the most resistant cluster showed a higher percentage of shared mutations between cluster members on average even though it now required two representative sequences. 48 clusters were needed to cluster the "unique-patient" data set as opposed to 50 for the original data set, suggesting that data from the same patient taken at different time points can provide additional phenotypic diversity. 98% of sequence pairs grouped together in the smaller data set were grouped together in the original data set, showing that the overall clustering remained very similar.

Since the time the manuscript had been originally drafted, we obtained approximately 50 more isolates, and we have carried out very preliminary analyses of a larger (n = 453) data set including these new sequences. 52 clusters were needed to group the data using the same constraints with the original data set, and the phenotypic patterns of most clusters were identical or highly similar; 86% of sequences pairs that had been grouped together originally remained together in the clustering of the larger data set. We also used our original (n = 398) clusters to predict resistance to each drug for each of the new isolates, using the other drugs' resistance values to select the closest centroid (i.e., the same procedure used in the n-fold cross validation). Scaled resistance scores (0-4) were predicted correctly from 66%-82% of the time, depending on the drug; interestingly, predictions for TPV (67%) and DRV (82%) were better than seen in the n-fold cross validation, while those for NFV (66%) and RTV (76%) were worse. Prediction accuracy may be affected by the points in time at which the data were obtained, as resistance patterns may change over time.

Treatment histories were not entirely available for the current data set; acquiring such information and analyzing future data in their context can provide additional insights. For example, one could determine the extent to which treatment histories affect the "path" seen in Figure 2 and the dependence on individual multidrug resistance phenotypes on past treatment; such analyses could highlight the extent to which treatment histories affect the genotypic variation within a phenotypic cluster.

While the methodology and analyses were applied here to the HIV-1 protease system, the framework is generally applicable to any system for which there are phenotypic data across multiple drugs. In addition to continuing to analyze HIV-1 protease as the available data grow, another natural next step is to apply these methods to the HIV-1 nucleoside or non-nucleoside reverse transcriptase inhibitor systems and to compare the patterns of cross resistance within those systems with the ones obtained in the present study. By rigorously studying phenotypic resistance patterns of multiple systems, one may begin to address more general ideas, including whether cross resistance has equally affected all target systems and whether potential genotypic diversity within phenotypic clusters is a general feature of target systems.

Conclusions

To our knowledge, this study provided the first cluster-based analysis of the clinically-explored multidrug phenotypic space of HIV-1 protease, uncovering major multidrug patterns of resistance, cross resistance, and potential hypersusceptibility. We showed that while genotypic similarity implies clinical phenotypic similarity, the converse is not necessarily the case. We also provided genotypic determinants of phenotypic patterns. Rather than consider each drug in turn, as others have done, we have accounted for their relationships and collapsed the vast nine dimensional space into a smaller one through clustering, allowing us to consider genotypic features that are associated with a simultaneous nine-drug response. We have therefore provided a new perspective on existing drug resistance patterns and their associated genotypic features. Such a framework will be useful as new therapies emerge and will require evaluation in the context of existing drug resistance.