Background

AIDS (Acquired Immunodeficiency Syndrome) is one of the most severe pandemic diseases, and approximately 35.5 million people were infected in the year 2012 [1]. It has been almost three decades since the first case of AIDS was found in US and the cause of AIDS was identified as HIV (Human Immunodeficiency Virus) [2]. Currently, a total of 26 licensed drugs are used in anti-AIDS therapy [3]. These drugs target different steps during the HIV life cycle, including viral entry, reverse transcription, integration and maturation. HIV protease (PR) is the enzyme responsible for processing viral precursor proteins after budding of virus from the host cell during the maturation stage of the viral life cycle [4]. PR inhibitors block the proteolytic activity, preventing formation of the infectious virus [5, 6]. HIV reverse transcriptase (RT) converts the viral RNA genome into DNA during the early stages of the HIV life cycle. The nucleoside analog zidovudine (AZT), which inhibits RT, was the first FDA approved anti-AIDS drug [7, 8]. The HIV RT inhibitors can be categorized into two classes: Nucleotide analog reverse transcriptase inhibitors (NRTIs) and non-nucleoside reverse transcriptase inhibitors (NNRTIs). NRTIs are structural analogs of nucleotides, and compete with the enzyme's natural substrates during the reverse transcription step. NNRTIs specifically target a separate site on HIV-1 RT to decrease its enzymatic activities [9].

During typical anti-AIDS treatment, which is often referred to as highly active antiretroviral therapy (HAART), three or more antiretroviral drugs chosen from different categories are given to patients. Such treatment extends the lifespan of the patients [10].

However, since HIV is a member of the retrovirus family [11], its genomic information is carried by RNA [11, 12]. Due to the lack of proofreading by RT [13] and the high replication rate of as many as 109 viral particles daily [14], drug resistance is one of the most severe challenges for successful long-term AIDS therapy [15, 16]. Drug pressure causes the selection of resistant viral strains, which can replicate in the presence of drugs [17, 18]. This drug resistance can cause the failure of antiviral therapy. Two strategies have been pursued to overcome the challenge of drug resistance. First, in the clinic, genotyping of the infecting virus is used to guide the choice of effective drugs for therapy. Drug resistance can be predicted from genotype data by a variety of algorithms [1922], including our approach of applying a structure vector from Delaunay triangulation [23, 24]. Second, research to understand the molecular mechanisms of drug resistance is important and could help in the design of new drugs for improved anti-AIDS therapy.

Several possible mechanisms have been described for drug resistance [25, 26]. Laboratory studies can only be performed on a small number of mutants. However, a huge number of possible mutants can occur, since HIV has a high mutation rate of about 10-4 to 10-5 mutations per nucleotide and cycle of replication [26] and a naturally polymorphic genome. Taking HIV PR as an example, mutations of more than thirty different residues have been associated with PR inhibitors [16]. Moreover, multiple mutations accumulate as the virus evolves higher levels of resistance [27, 28]. For instance, we have studied a PR mutant with 20 substitutions, which shows more than 1000-fold worse binding to inhibitors darunavir (DRV) and saquinavir (SQV) compared to wild-type PR [29]. Therefore, considering the huge number of possible mutants, can a tractably small number of mutants be identified as the most representative of high level resistance? Answering this question could save both time and money, and facilitate the study of drug resistant mechanisms.

One approach to selecting a small number of meaningful mutants uses the Mean shift clustering, which was first introduced in 1975 by Fukunaga and Hostetler [30] for the purpose of seeking the mode of a density function in the given sample set. Fukunaga and Hostetler [30] also suggested that mean shift clustering is an instance of gradient ascent by using decreasing distance functions, which often referred as a kernel, from a given point to a point in the sample set. This algorithm became more widely used after 1995 when Cheng [31] developed a more generalized formulation. By clarifying the relationship between mean shift and the optimization, the algorithm could potentially be applied on clustering and global optimization problems by declaring each mode as representative of one cluster and assigning each data point to the mode it converges to. Applications of the mean shift algorithm range over image/video segmentation, image representation/retrieval, discontinuity-preserving smoothing [32, 33], higher level tasks like appearance-based clustering [34, 35], tracking including blob tracking [36] and face tracking [37], shape detection and recognition [38]. Subsequently, applications of this algorithm were extended to other fields like biology. These applications include analysis of structural variation in genomes [39], DNA microarray analysis [40], and time-warped gene expression analysis [41].

In this paper, we have proposed a new algorithm based on the non-parametric iterative mean shift and our recently reported protein encoding method to extract the most representative drug resistant mutants from the Stanford HIV database [42].

Results

Mean shift clustering, multiple regression and quantile analysis were performed on the data for both HIV-1 PR and RT mutants whose sequences and structures were encoded by Delaunay triangulation.

Mean shift clustering on HIV protease inhibitor resistance

After each of the mutated sequences was represented by a 210-dimensional vector, we performed the mean shift clustering on the drug resistance data to select the most representative mutants. Data were analyzed for the PR inhibitors atazanavir (ATV), nelfinavir (NFV), ritonavir (RTV), indinavir (IDV), lopinavir (LPV), tipranvir (TPV) and saquinavir (SQV). The results show that the larger the bandwidth, the smaller number of mutants was selected. The plot for the PR inhibitor ATV is given as a representative example in Figure 1.

Figure 1
figure 1

The relationship between the bandwidths and the number of selected mutants for PIs. The bandwidth is plotted against the number of selected mutants. The trend line is shown in blue. Plots show regression for drug resistance to ATV.

Mean shift clustering on HIV reverse transcriptase inhibitor resistance

Similarly, mean shift clustering was performed on the drug resistance data for HIV-1 RT inhibitors. The bandwidth and the selected numbers of mutants are compared for the RT inhibitors, including the NRTIs lamivudine (3TC) (Figure 2), abacavir (ABC), zidovudine (AZT), stavudine (D4T), didanosine (DDI) and tenofovir (TDF) (Figure 2), and the NNRTIs nevirapine (NPV) (Figure 3), delaviridine (DLV), and efavirenz (EFV).

Figure 2
figure 2

The relationship between the bandwidths and the number of selected mutants for NRTIs. The bandwidth is plotted against the number of selected mutants. The trend line is shown in blue. Plots show regression for drug resistance to 3TC.

Figure 3
figure 3

The relationship between the bandwidths and the number of selected mutants for NNRTIs. The bandwidth is plotted against the number of selected mutants. The trend line is shown in blue. Plots show regression for drug resistance to NPV.

Multiple regression on HIV protease inhibitor resistance

Afterwards, a multiple regression was applied to the selected mutants to evaluate the selected results. The R2 values for relative resistance were plotted against the number of selected mutants as shown in (Figure 4) for the PR inhibitors ATV, NFV, RTV, IDV, LPV, TPV and SQV. The x-axis is the number of selected mutants, while the y-axis is the R2 value after applying multiple linear regression on selected protein sequences with their drug resistant values.

Figure 4
figure 4

The relationship between the multiple regression results and the number of selected mutants. The R2 value is plotted against the number of selected mutants. The trend line is shown in blue. Plots show regression for resistance to drugs: (A) ATV, (B) NFV, (C) RTV, (D) IDV, (E) LPV, (F) TPV, and (G) SQV.

Multiple regression on HIV reverse transcriptase inhibitor resistance

Multiple regression analysis was performed similarly on genotype-phenotype data for drugs inhibiting HIV-1 RT. The R2 values for relative resistance were plotted against the number of selected mutants as shown in for the RT inhibitors including NRTIs 3TC, ABC, D4T, DDI, TDF and AZT (Figure 5), and NPV, DLV and EFV for NNRTIs (Figure 6).

Figure 5
figure 5

The relationship between the multiple regression results and the number of selected mutants. The R2 value is plotted against the number of selected mutants. The trend line is shown in blue. Plots show regression for resistance to drugs: (A) 3TC, (B) ABC, (C) D4T, (D) DDI, (E) TDF and (F) AZT.

Figure 6
figure 6

The relationship between the multiple regression results and the number of selected mutants. The R2 is plotted against the number of selected mutants. The trend line is shown in blue. Plots show regression for resistance to drugs: (A) NPV, (B) DLV and (C) EFV.

Bandwidth selection and multiple regression on HIV-1 PR and RT inhibitor resistance

The following experiments were performed to test the accuracy of the selected mutants with different R2 results. According to the results of the above experiments, it could be inferred that the larger the bandwidth is, the fewer representative mutants are selected, and therefore the R2 would be lowered. Moreover, with different R2 values, the selected mutants of lower R2 need to be a subset or have a large intersection with the selected mutants of the higher R2. Based on the above results, in this experiment, the higher R2 was set to be above 0.80 while the lower R2 was set to be 0.60. With this goal, the overlap group was then calculated.

The overlap group of mutants is a significant fraction of those selected for the lower R2, which suggests the procedure selects meaningful representative mutants correctly. The fractional overlap ranges from 0.79 to 0.94 for HIV PR inhibitors, and 0.89-0.94 for NNRIs (Tables 1, 2). For NRTIs, the results in Figure 5 show that when the number of resistant mutants increases, the R2 value does not increase smoothly. There are many ripples in the plots, making it difficult to select the R2 cutoff in this experiment. Therefore, analysis of NRTIs was not possible.

Table 1 The number of selected mutants and R2 for HIV-1 PR Inhibitors.
Table 2 The number of selected mutants and R2 for HIV-1 RT NNRTIs.

Quantile information analysis on HIV-1 protease inhibitor resistance

In order to further analyze the mutants selected by mean shift, quantile information analysis was performed and the result indicates that the proposed algorithm could successfully cluster the datasets, and pick the potentially most drug resistant mutants from the cluster centers (Tables 3, 4). In the tables, the numbers are given for selected/total number in each bin, and R2 used here is around 0.70.

Table 3 Number of selected mutants in each bin for PIs.
Table 4 Selected ratios in each bin for PIs (%).

Bin I includes the mutants with least resistance to each inhibitors, while Bin × has the mutants with the highest resistance to the inhibitors. As shown in Table 4, the selected ratio in bin × is larger than that of bin I. This result suggests that the mutants vary more in the drug resistant category than in the non-drug resistant one.

Quantile information analysis on HIV-1 reverse transcriptase inhibitor resistance (NRTIs)

In order to further analyse the mutants selected by mean shift, all the drug resistant mutants were grouped and separated into 10 bins based on their drug resistance value. Both the total number of mutants and the selected number of mutants were counted and recorded in each corresponding table (Tables 5, 6). In the tables, the numbers are selected/total number in each bin, and R2 used here is around 0.70. Similar to PIs results, as shown in Table 6, the selected ration in bin × is larger than that of bin I.

Table 5 Number of selected mutants in each bin for NRTIs.
Table 6 Selected ratios in each bin for NRTIs (%).

Quantile information analysis on HIV-1 reverse transcriptase inhibitor resistance (NNRTIs)

In order to further analyze the mutants selected by mean shift, all the drug resistant mutants were grouped and separated into × bins based on their drug resistance value. Both the total number of mutants and the selected number of mutants are counted and recorded in each corresponding table. In the tables, the numbers are selected/total number in each bin, and R2 used here is around 0.70. The Tables 7, 8 show the total number of mutants in the bin before and after selection.

Table 7 Number of selected mutants in each bin for NNRTIs.
Table 8 Selected ratios in each bin for NNRTIs (%).

Analysis of multi-drug resistance information for the most highly resistant mutants of HIV-1 PR to NRTIs

In order to further analyze the mutants selected by mean shift in the most drug resistant category (bin X), those mutants having resistance to multiple drugs were picked and compared. The results show that the more inhibitors a mutant is resistant to, the fewer representative mutants appear (Figure 7). The number of mutants representing high resistance to three or more PIs falls to low values of one to three, which becomes eminently verifiable by in vitro experiments.

Figure 7
figure 7

The relation between total number of mutants and number of resistant inhibitors for PIs. The red line shows the result for the original data in bin X; while the blue line shows the result for the selected mutants in bin X.

Analysis of multi-drug resistant information for the most highly resistant mutants of HIV-1 reverse transcriptase

Similar results are also obtained for NNRTIs. As shown in Figure 8, the number of mutants representing high resistance to two NNRTIs falls to values of 9-12, and when all three drugs are considered, only four mutants are representative of multidrug resistance. These low numbers of mutants can be verified by in vitro experiments.

Figure 8
figure 8

The relation between total number of mutants and number of resistant inhibitors for NNRTIs. The red line shows the result for the original data in bin X; while the blue line shows the result for selected mutants in bin X.

Discussion

The serious problem of drug resistance arising during therapy of HIV-infected individuals can cause failure of the treatment. Many scientists are working on revealing the drug resistance mechanisms using a variety of experimental techniques. However, since there are an extremely large number of mutants, it is difficult to choose representative mutants for detailed research in the laboratory.

In this experiment, we have developed new selection algorithm based on a simple graph representation of protein structure to solve this problem. The protein structure is 3-D and can be efficiently represented by Delaunay triangulation [44]. Based on this encoding method, a mean shift was applied to select the most representative mutants. Multiple linear regression was performed to evaluate the selection results.

This selection algorithm works well on selecting drug resistant mutants from both HIV PR and RT inhibitor genotype/phenotype data. Among all the mutants, around 250 most representative mutants were selected with numbers in the range of 215 to 360 [PIs+NNRTIs] mutants selected for the different drugs. Such selection was based on the kernel bandwidth, and the goal R2 value. In this experiment, the R2 value was set to be above 0.60 to be considered as a successful selection. During the experiments, after selection, the multiple linear regression was applied on these selected mutants' drug resistance values, and the R2 values fall in the range of 0.65 to 0.83, indicating excellent correlation. This high correlation suggests that the selected number of mutants can be further decreased if a lower target R2 value was applied.

Identifying a small number of representative mutants will enable laboratory studies of the molecular mechanisms of resistance, which is currently impossible due to the huge number of possible mutants.

Materials and methods

Datasets and data preparation

All the genotype-phenotype datasets were downloaded from the Stanford HIV drug resistance database [42](http://hivdb.stanford.edu/pages/genopheno.dataset.html). The proposed algorithm was tested on both HIV-1 PR and HIV-1 RT resistance data sets. For HIV-1 PR, the inhibitors atazanavir (ATV), nelfinavir (NFV), ritonavir (RTV), indinavir (IDV), lopinavir (LPV), tipranvir (TPV) and saquinavir (SQV) were tested. While for HIV RT, NNRTIs nelfinavir (NPV), delaviridine (DLV), efavirenz (EFV), and NRTIs lamivudine (3TC), abacavir (ABC), zidovudine (AZT), stavudine (D4T), didanosine (DDI) and tenofovir (TDF) were tested.

All the datasets were pre-processed using the methods and the cutoff values described previously in[24]. The results of the expansion for each of the HIV-1 PR inhibitors were: a total of 16846 sequences were obtained from 1622 isolates with assays for IDV resistance; a total of 16269 sequences from 1322 isolates for LPV; a total of 10228 sequences from 744 isolates for TPV; a total of 17118 sequences from 1640 isolates for SQV; a total of 12084 sequences from 1012 isolates for ATV; a total of 17545 sequences from 1674 isolates for NFV; and a total of 16652 sequences from 1589 isolates for RTV.

For each of the HIV-1 RT inhibitors the expansion resulted in: a total of 11367 sequences were obtained from 746 isolates with assays for NPV resistance; a total of 11299 sequences from 732 isolates for DLV; a total of 11354 sequences from 734 isolates for EFV; a total of 4850 sequences from 633 isolates for 3TC; a total of 4846 sequences from 628 isolates for ABC; a total of 4847 sequences from 630 isolates for AZT; a total of 4845 sequences from 630 isolates for D4T; a total of 4849 sequences from 632 isolates for DDI; and a total of 2004 sequences from 353 isolates for inhibitor TDF.

Encoding structure and sequence with Delaunay triangulation

The sequence and structure of the protein were represented using a graph-based encoding as described in [43]. Delaunay triangulation was used to define a graph which spanned the protein structure and defined structurally adjacent pairs of amino acid residues. Adjacent pairs of amino acids were summarized into a vector of the 210 unique pairs of the 20 standard amino acids by calculating the distance for each adjacent pair in the structure and tabulating by the types of amino acids in that adjacent pair. Only the sequences of the mutated proteins are needed and only one protein structure is necessary. As a result, all mutants are represented as vectors of the same dimensionality, which is a desired property for most of the pattern recognition algorithms. The X-ray crystal structures 3OXC for HIV-1 PR, and 2WOM for HIV-1 RT (from http://www.pdb.org) were used as templates for Delaunay triangulation.

Regression analysis for drug resistance prediction

The genotype-phenotype datasets provide an experimentally measured drug resistance value, with respect to a certain type of drug, with each genotype. The mutations relative to a standard sequence are denoted as x 1 , x 2 , . . . x N ; x i 210 where N is the total number of mutations and R210 is the structure vector. Also the corresponding drug resistance values are denoted as the real numbers y 1 , y 2 , . . . , y N ; y including 0 for the resistance value of the wild type virus. We then seek a linear model between the x i 's and y i 's by minimizing the cost function E:

E : = i = 1 N ( y i - A x i - b ) 2
(1)

with respect to the 210 dimensional vector A and scalar b.

Furthermore, in order to better utilize the available data set, we performed a k-fold cross-validation (in this work, k = 5). Specifically, the training set of size N is randomly divided into k groups. Among them, k-1 groups are utilized for constructing the linear model as in Equation (1). Then, the linear model is used to predict the drug resistance for the remaining group with N/kmutations. The predicted resistances are compared with the measured ones and the R2 values are recorded. Finally, the average and standard deviation of the k R2 values are computed.

Mean shift clustering and bandwidth selection

The mathematical deviation of mean shift algorithm was first introduced by Fukunaga and Hostetler [30], then adapted by Cheng [31], and later extended by Comaniciu, Meer, and Ramesh [45]. The procedure of the mean shift is that, for each data point in the feature space, a gradient ascent procedure is performed until convergence. The stop points of the procedure are the local maxima of the kernal density function, which could also be considered as the center of the clustering.

Given N data points { x 1 , x 2 , . . . , x N } 210 , which could be considered as the kernel density function with Gaussian kernel K ( t ) = e - t / 2 for t ≥ 0:

p ( x ) = i = 1 N π i 1 Z i K ( d ( x , x i ; i ) ) ,

Where π i ( 0 , 1 ) is the mixing proportion of point i (satisfying i = 1 N π i = 1 ), i is its covariance matrix (positive definite), Z i = 2 π i 1 / 2 is a normalization constant and d ( x , x i ; i ) = ( x - x i ) T i - 1 ( x - x i ) is the Mahalanobis distance.

Among all the data points, the dense regions of these could be treated as the local maxima of p(x) and could be found by seeking stationary points p ( x ) x = 0 . The mean-shift update is applied with the rule:

p ( n | x ) = exp ( - 1 2 | | ( x - x n ) / σ | | 2 ) n = 1 N exp ( - 1 2 | | ( x - x n ) / σ | | 2 ) , x f ( x ) = n = 1 N p ( n | x ) x n

This rule corresponds to a fixed point iteration to find the expected value for the centre of a Gaussian kernel, and is computationally more efficient than a gradient based numerical optimization for this problem. The rule maps any point x 210 to a weighted mean of the points in the dataset denoted as f(x). The difference f(x)-x is the mean shift vector and is clearly of zero magnitude at convergence.

The mean shift algorithm is non-parametric and the resolution of the clustering is determined by the kernel bandwidth σ. The initial step is to find the range of the bandwidth. Following that, by choosing different bandwidths, different numbers of mutants were selected. A multiple regression was performed to evaluate the selected results.

Quantile information analysis

All the drug resistant mutants were grouped and separated into 10 bins based on their drug resistance value. For example, about ATV, their resistance values range from 0 to 700. Therefore, those mutants with resistance value between 0 and 70 were put into bin I, those with resistance value between above 70 and below 140 were put into bin II, and so on.

After splitting all the data into ten bins, both the total number of mutants and the selected number of mutants were counted and recorded in each corresponding table. For each bin, the number of mutants before and after the selection was calculated and compared. Moreover, the selected ratio is also calculated.

k-fold validation

In order to fully use all the data, a k-fold cross-validation was performed in all the experiments for all the drugs. Specifically, we randomly choose (k-1)/k of all the sequences (some are drug resistant, while others are non-drug resistant) for training the classifier and the remaining 1/k data are used for testing. These tests used k = 5. Independent randomly selected k-folds were chosen throughout the study to avoid bias in the results. The apparent polymorphism in the original sequence data requires extra care when generating k-fold data sets for testing or training. When a sequence was removed from a k-fold in generating a testing or training dataset, all derived instances of that sequence were removed as well. This ensures that the individual k-fold datasets are truly independent from each other and thus ensures that the estimated accuracies are meaningful. The R2 values were averaged over the k-folds.