Background

Presence of the carcinogenic human papillomavirus (HPV) is necessary for the development of the invasive carcinoma of the genital tract [1]. Persistent infection with carcinogenic HPV causes virtually all cervical cancers. According to the latest global estimates this cancer is the second most common in women after breast cancer [2]. Almost half a million new cases of cervical cancer occur each year among women world-wide causing 274.000 deaths, 85% of them happening in underdevelopment countries[3].

The HPV genome has three different regions: two coding (E – early and L – late expression) and a regulatory non-coding region. The early region codifies regulatory, transforming and replication proteins, among which E6 and E7 are known to act like oncoproteins in high risk virus types [4, 5]. The late region (L) contains two coding genes, L1 and L2, which encode viral capsid proteins.

Among more than 100 types of HPV known today, approximately 30 infect the genital tract. Within these, HPV16 and HPV18 are the two types with the highest oncogenic power. A prospective way of fighting cervical cancer is with an anti-HPV vaccine. Phase III vaccine trials are being developed by Merck, GlaxoSmithKline and the National Cancer Institute [5, 6]. Besides, research is ongoing on different aspects of HPV biology, such as the mechanisms of down-regulation through which HPV causes cell transformation, the evaluation of biomarkers for risk progression, the role of environmental co-factors and the determinants of immune response to the viral infection [1]. However, to have a better understanding of the HPV, it is key to gain an improved insight from an evolutionary perspective [7, 8]. The possibility for HPV recombination was first suggested by several facts: The biological viability of artificial HPV strains with chimerical proteins [9]. The appearance of some HPV16 variants that seemed mosaics between different established types [10]. The isolation of a novel HPV type (HPV77) with an unusual pattern of sequence similarity over the E6, E7 and L1 regions [11, 12]. The plurality of HPV types [13] and also by the frequent observed co-infection (M. Angulo, personal observation). The latter will be specially important in AIDS patients which are often co-infected with very diverse mixtures of human papillomavirus (HPV) [1417].

Because of HPV extreme diversity, the occurrence of recombination was initially discarded, in part because of the technical difficulties for aligning extremely diverse sequences, and in part because of the less accurate methods available for the researchers until the past decade. Nevertheless, recently reported phylogenetic incongruence at the putative high-risk ancestor node, showing that one or more presumed old recombination events should explain a non monophyletic evolution of oncogenic HPVs [18, 19], has provided new convincing support of recombination in α-PVs. In addition, a recent rigorous analysis, using several recombination estimation methods, has provided fresh evidence of ancient recombination in papillomavirus, especially for the L2 gene [20].

However, the methods used to assess the presence of recombination signals were either phylogenetic based or substitution based. No model-based method was used. Hence, the difficulty of aligning all currently sequenced PVs imposes an additional challenge. Here we addressed the existence of recombination in HPV using a very efficient composite likelihood method [21, 22]. The advantages of this kind of method, which is a model-based method, over the model-free ones, are well-known [23] including the fact that with model-free ones the true level of recombination that has occurred is greatly underestimated.

We have centered on human alpha PV sequences, which alignment is much more reliable. Our goal was to get recombination estimates of different genes (E6, E7, L1 and L2) in different groups. We defined three groups (GI, GII and GIII) attending to their phylogenetic relationships but also to their epidemiology and clinical outcome. The identification of the specific recombinant sequences or the recombination break points is a more complex problem and requires different algorithms and software, being out of the scope of the present paper.

Results

Evolutionary model and rate variation

Table 1 shows the best-fit models of nucleotide substitution selected for each data set. Different models were selected for the different genes and for the different groups. The simplest models were found within the HPV16 group and the most complex ones (GTR) for the gene L2 in GI and GII groups. L2 had also the most complex model within the HPV16 and the GIII groups (Table 1).

Table 1 Evolutionary models for the different data sets

In addition, rate variation among sites has been detected in several of the data sets, though only in few ones the shape value of the gamma distribution was below one indicating an important rate heterogeneity [24]. The significant rate variation (below one) was mainly detected for GII and GIII groups (Table 1).

Population recombination

Using a gene conversion model with a Jukes-Cantor nucleotide evolution model [25] and assuming no rate variation we found significant recombination in all genes and in all groups (Figure 1). The highest value of recombination was found associated with E6 and the highest number of groups with recombination was linked to L2. Recombination was detected in the genes L1, L2 and E6 only for the GI group (high risk). For the gene E7 recombination was detected only within HPV16 type within which, however, no other recombinant gene were detected.

Figure 1
figure 1

HPV population recombination estimates under a gene conversion model assuming a Jukes-Cantor evolution model with 2 alleles. For each gene and from left to right: Bars with upper right lines: HPV-16. Empty bars: GI. Light shaded bar: GII. Bars with horizontal lines: GIII. *: Recombination test was significant.

When using the best-fit models of nucleotide substitution and considering the estimated rate variation the obtained results were qualitatively similar but with a bit higher estimates and a new signal for L1 in the GII group (cf. Figures 1 and 2).

Figure 2
figure 2

HPV population recombination estimates under a gene conversion model with complex substitution models and rate variation among sites. For each gene and from left to right: Bars with upper right lines: HPV-16. Empty bars: GI. Light shaded bar: GII. Bars with horizontal lines: GIII. *: Recombination test was significant.

Given the estimated recombination values and the number of sequences at each group, the expected number of recombination events associated with each data set can be computed [26]. For example, for HPV-16 with a recombination value of 13 for the gene E7 (Figure 2) and n = 8 sequences, the expected total number of recombination events in the history of this sample is 34. The expected numbers of recombination events for the different data sets with detected significant recombination are shown in Table 2.

Table 2 Expected number of recombination events

Simulation experiment

A set of simulations was performed using the parameter values corresponding to the evolutionary model, base composition and rate variation among sites for the E6 gene in the group GI (Table 1) to obtain some control samples of DNA sequences. This gene and group were selected because this combination had the highest estimated significant value (Figures 1 and 2 and Table 2). We set the recombination value to zero so that the simulated set of sequences was obtained without recombination. We then estimated from these sequences the population recombination value using the composite likelihood. The average recombination value for the simulated set of sequences was 0.08 ± 0.03 and the percentage of false positive recombination tests was 1%. The expected number of recombination events for this average number is 0.26.

Discussion

The estimation of the best-fit model of nucleotide substitution is relevant in phylogenetics [27]. However, model-based approaches for estimating recombination do not rely on a specific phylogeny and in consequence, they are expected to be robust to model misspecification. This seems to be the case when estimating recombination using the composite likelihood [21, 22]. Nevertheless, we have provided the best-fit models of nucleotide substitution under the Akaike information criteria (AIC) and have used them to estimate the population recombination rate. As expected, the model complexity had not a major effect on the estimation.

The existence of the recombination signal in the DNA sequences of HPVs is important because the genes tested are commonly used to build HPV phylogenies [28] and it is known that recombination can mislead phylogenetic inferences [29]. Furthermore, the detected recombination is in agreement with recently reported phylogenetic incongruence at the putative high-risk ancestor node showing that one or more presumed old recombination events should explain a non monophyletic evolution of oncogenic HPVs [18, 19]. The confirmation of ancient papillomavirus recombination has also been recently thoroughly argued by a statistical and phylogenetic recombination detection study [20].

In contrast to that previous study, we have used a model-based method (the coalescent composite likelihood estimator) to infer recombination from HPV DNA sequences. Model-based methods are known to be preferred over substitution and phylogenetic ones [23, 3032]. Thus, the composite-likelihood estimator maximizes the chances of detecting recombination avoiding, however, the inference of recombination when it is absent (false positive detection).

Regarding the existence of model complexity and rate variation among sites in the HPV samples, it is known that the amount of divergence and rate variation in the data could affect recombination estimation in some cases [32]. However, rate variation i.e. variation in the rate of nucleotide substitution along the sites in the DNA sequence should have no effect on the recombination estimates obtained using the composite-likelihood estimator as has been previously shown [22]. However, the Pairwise program assumes a simple Jukes-Cantor model with two alleles [21]. Therefore, to account for possible effects of model complexity and rate variation we also estimated the recombination rate using the extended model possibilities of Kpairwise [22] and confirmed that model complexity and rate variation had no qualitative effect onto recombination estimates.

Moreover, we have designed an experiment to check the possibility of false positive detection due to recombination artifacts because of the model complexity and the high diversity underlying the data. As shown above, there was not significant recombination estimation under the parameters considered.

In addition, we were able to estimate the expected number of recombination events in those cases with significant recombination detection. As expected, the higher numbers were found for group GI, which incorporates a major number of branches from PV phylogeny [19]. Importantly, not all recombination events are detectable [32], thus those expectations just provide an upper-bound below which the real number of detectable events should relay [30].

Although we have detected significant recombination signal at all genes and groups in one combination or another, perhaps the most important result is that recombination was detected at intra-type level in HPV16. This may indicate that recombination is occurring at a relative high frequency in current carcinogenic HPV types and variants. HPV recombination should not be exceptional nowadays since the frequency of co-infection with more than one HPV type is not a rare event, and new recombinant types could be currently being generated. Provided that the oncogenicity of specific HPV intra-type variants appear to vary geographically and also with the ethnic origin of the population studied [33], more research is necessary to assess whether such a variation could relate to different recombinant forms. Moreover, the majority of the vaccines under investigation, both the therapeutic and the prophylactic ones, are based on the use of these genes or their products, to obtain the prevention and the treatment of the infections of the more prevalent high risk (types 16 and 18) and low risk types (6 and 11) [6/and references there in, 34/and references there in]. From the present work it is clear that a better knowledge of HPV evolutionary relationships will be important concerning the optimal number of types to include in vaccines as well as the possibility of cross-reactive immunity among HPV types [2, 5]. Also, to obtain consensus and ancestor HPV sequences, this could be used in vaccine design to minimize genetic differences between vaccine strains [35]. The existence of recombination should also be of interest to pharmacogenomic studies, i.e. to learn how genetic variation influence response to drugs.

Conclusion

We provide new support to the recent evidence of recombination in HPV. In addition, we perform an evolutionary characterization, estimating best-fit models of nucleotide substitution and rate variation among sites, of some important HPV DNA sequence sets. Using simulations, we have shown that the detected recombination signals should not be artifacts. Thus, we found that the gene with recombination in most of the groups is L2 but the highest values were detected in L1 and E6. Gene E7 was recombinant only within the HPV16 type. The topic deserves further study because recombination is an important evolutionary mechanism that could have high impact in both pharmacogenomics (i.e. the effect of genetic variation on response to drugs) and vaccine development.

Methods

Sequences, groups and models of nucleotide substitution

HPV sequences for the genes E6, E7, L1 and L2 were obtained from the public database at Los Alamos National Laboratory [36] (GenBank accession numbers are given in Table 3). We classified these sequences according to phylogenetic criteria [19] but also by epidemiological criteria and clinical outcome [37]. The groups set as follows: Group I (GI) included the 14 most common high-risk types (16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 73, 82; n = 14 sequences including just one variant of type 16). Group II (GII) included 6 low risk types (6, 11, 40, 42, 43, 44; n = 8 sequences including the 3 variants of type 6, see Table 3). Group III (GIII) included 3 low risk types plus 5 undetermined risk types which cluster together [19] (61, 72, 81, 62, 71, 83, 84, 89; n = 12 or 11 sequences including 5 or 4 variants of type 71, see Table 3. For this group, L1 has 13 sequences because of 4 variants from type 71 plus 2 additional variants from 72 and 81 types). Finally, we consider per se the group HPV-16, which included HPV16 variants (n = 8 sequences, Table 3). All selected sequences pertain to the genus Alpha [38]. Sequences were aligned with ClustalX [39] and then corrected by hand. The best-fit model of nucleotide substitution was selected under the Akaike information criteria (AIC) with Modeltest v3.6 [40], using maximum likelihood (ML) estimates from PAUP* [41].

Table 3 Accession numbers for HPV sequences including genes E6, E7, L2 and L1

Recombination estimation

To study HPV recombination we used the composite likelihood estimator [42] and its permutation test [21], which is one of the most powerful techniques to detect the recombination signal from DNA sequences [22]. This method is a model-based population genetic approach, which allows for both a linear recombination and a gene conversion model. The composite likelihood estimator is implemented by the program Pairwise from the package Ldhat, freely available at [43], and also by the extension Kpairwise[22] which allows for complex nucleotide models and rate variation among sites and is freely available at [44]. We considered using a gene conversion model as more adequate since the PV genome is circular. Recombination was estimated as the population recombination rate, which is 4Nr where N is the effective population size and r the recombination rate per gene.

Given the recombination values estimated and the number of sequences at each group the expected number of recombination events E(R) associated to each data set can be computed by using formulae 5 in Hudson and Kaplan [26],

E ( R ) = ( 4 N r ) i = 1 n 1 1 i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGfbqrcqGGOaakcqWGsbGucqGGPaqkcqGH9aqpcqGGOaakcqaI0aancqWGobGtcqWGYbGCcqGGPaqkdaaeWbqaamaalaaabaGaeGymaedabaGaemyAaKgaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4MaeyOeI0IaeGymaedaniabggHiLdaaaa@420E@
(1)

, where n is the number of sequences in the sample.

Simulations

To check that HPV sequences do not generate recombination false positives under the composite likelihood estimator due to their particular combination of model complexity and rate variation, we simulated 100 DNA samples of 15 sequences each and longitude 500 bp using a coalescent model [45]. We used the software recoal1.7 from David Posada and available upon request from him [46]. Specifically, we set the simulations with the parameter values corresponding to the evolutionary model, base composition and rate variation for the gene and group which had the highest estimated significant value. We also set the recombination value to zero. We then used the obtained sequences to estimate the average recombination value and the percentage of false positive recombination tests.