Background

In treating infection with hepatitis C virus, knowledge of a patient's viral genotype informs the choice of appropriate therapy [13]. Although the HCV subtype afflicting a patient is not currently used to make clinical treatment decisions, knowing the viral subtype is important for studies of its origin, transmission, and evolution [14]. For example, new emerging variants can be characterized better when they can be assigned an unequivocal subtype classification [5]. Molecular epidemiology analyses rely on information about sequence variation at the subtype level [4, 5]. Vaccine-design strategies are informed by the diversity of HCV variants and the antigenic determinants (epitopes) therein [6, 7]. The risk of hepatocellular carcinoma, a frequent complication for HCV infection, might be assessed better in light of HCV subtype [8]. Thus, effective methods for both genotype and subtype classification are important tools to manage HCV infections.

Techniques to infer phylogenies combine an optimality criterion with an algorithm to search for the best tree. Optimality criteria quantify how well the tree describes the data, and are either distance-based or character-based [9, 10]. An algorithm can quickly construct a single tree that minimizes all the pairwise distances among taxa. However, this approach is less able to use information from different taxa to model variation in evolutionary rates across sites than the optimality criterion of maximum likelihood ([9], p. 175). Search algorithms are deployed by character-based methods to find trees that best explain the data, given an evolutionary model with known assumptions. The search algorithms of character-based methods take more time to evaluate alternative candidate trees than rapid distance-based methods. Perhaps for this reason, many more distance-based than character-based phylogenies of HCV genotypes have been published. However, maximum-likelihood phylogenetic inference is known to outperform distance-based methods when such complications as substitution rate heterogeneity or covariation between sites are present [9, 10]. Formal comparisons between topologies are thus more appropriate for maximum-likelihood phylogenies than for the approximations that result from distance-based methods.

This study evaluates phylogenies derived from coding (NS5B) and non-coding (5' UTR) regions of whole-genome HCV sequences for consistent classification of viral subtypes into distinct genetic groups, or clades, with the aim of evaluating their suitability for genotype and subtype classification. Concordance with the whole-genome phylogeny is desired. Nucleotide characters in NS5B are over five times more abundant than in the 5' UTR, though only a small portion of this region is amplified for subtyping. To compensate for this, we also considered a smaller, oft-studied portion of NS5B that we call the "Okamoto region" (from nt 8282 to 8610 in the H77 reference genome) for its ability to represent the phylogeny of NS5B and the entire HCV genome. We tested the hypothesis that phylogenetic trees obtained from different genomic regions of HCV differ significantly. We also compared tree topologies for their ability to group genotypes and subtypes consistently into clades.

Results

Phylogenetic inferences

Among the 38 whole-genome HCV sequences representing 18 confirmed subtypes as summarized in Table 1, the most general substitution model, the general time reversible model (GTR, also known as REV) with a discrete gamma approximation for rate heterogeneity, was consistently supported as superior among the twelve nucleotide substitution models evaluated (not shown). Models adjusted for rate heterogeneity consistently fit the data better than models that assume a fixed evolutionary rate across sites (not shown). Substitution models with fewer parameters or an assumption of equal base compositions performed significantly worse than GTR, regardless of whether or not the sequences analyzed contained protein-coding regions. Adding a parameter for the estimated proportion of invariant sites significantly improved the substitution model, yielding parameters as shown in Table 2. The same model was selected when the AIC was adjusted to compensate for a low ratio of sample data to parameters (not shown). Thus, GTR with a gamma distribution of evolutionary rates per site and accommodation of invariant sites (GTR+Γ+I) is the best substitution model for HCV variation among those considered, and was used for maximum-likelihood phylogeny inference.

Table 1 Confirmed subtypes and accession numbers of HCV genomes studied.
Table 2 Substitution model (GTR+Γ+I) parameters and alignment properties.

The 5' UTR is represented by the smallest number of aligned nucleotide sites (300 nt; the 5' most 42 nt were excluded from analysis because of extensive gaps throughout the available sequence data), followed by the Okamoto region of NS5B (329 nt), then the polyprotein (9177 nt), and the whole genome (9791 nt, Table 2). The proportion of invariant nucleotide sites for the 5' UTR is 2/3, much lower than for the protein-coding regions, for which less than 1/3 of sites do not vary (Table 2). The 5' UTR is known to be less variable than protein-coding regions of HCV [3, 6, 11, 12].

Tree topologies from the entire HCV genome and the polyprotein are identical (Figs. 1a, b and 2a, b). The tree from the Okamoto region of NS5B resembles trees from the whole genome and the polyprotein, except for rearrangements in the ordering of deeply rooted branches (Figs. 1d and 2d). Trees from sequences that include protein-coding regions clearly group subtypes from the same genotype into clades, while the tree from the non-coding terminus conflates subtypes of genotypes 1 and 6 with subtypes 4a and 5a, and subtypes of genotypes 1 and 6 cannot be distinguished (Figs. 1c and 2c). Thus, the phylogenetic trees of the 5' UTR are less able to group subtypes from the same genotype together into clades than trees from protein-coding sequences (Figs. 1 and 2), regardless of the method used for phylogenetic inference. Parsimony analysis yields comparable results, with similar trees for the whole genome, polyprotein, and the Okamoto region of NS5B, while the tree from the 5' UTR contains a basal polytomy that does not resolve genotypes 1,4, 5, or 6 (not shown).

Figure 1
figure 1

Neighbor-joining phylogenies. Unrooted neighbor-joining phylogenetic trees from (a) complete HCV genome, (b) polyprotein, (c) 5' UTR, and (d) the Okamoto region of NS5B. Due to our focus on the consistency of subtype classification and the relative branching topology among subtypes, each tree is scaled independently.

Figure 2
figure 2

Maximum-likelihood phylogenies. Unrooted maximum likelihood phylogenetic trees from (a) complete HCV genome, (b) polyprotein, (c) 5' UTR, and (d) the Okamoto region of NS5B. Taxon labels indicate HCV genotype and subtype from Table 1. Due to our focus on the consistency of subtype classification and the relative branching topology among subtypes, each tree is scaled independently.

Hypothesis tests

Log-likelihood scores and SH-test results for alternative trees are summarized in Table 3. All tests yield the same outcomes, regardless of whether or not RELL optimization was used. Comparisons of alternative trees with the 5' UTR data fail to reject the null hypothesis of no difference in likelihoods (P > α; see Methods). Comparisons among alternative trees with data from the Okamoto region of NS5B indicate that the 5' UTR tree has a significantly different likelihood (P < 0.0001) than trees obtained from NS5B, polyprotein, or whole-genome data, which are statistically indistinguishable (P > α). Comparing parsimony trees from 300-nt windows in NS5B with trees from the 5' UTR via the incongruence length difference test [13], which uses the difference in tree lengths as a test statistic, rather than the likelihood difference, yielded the same pattern of significant differences (not shown).

Table 3 Shimodaira-Hasegawa test results from 10,000 bootstrap replicates.

Consistency and homoplasy indices

Increasing window sizes represent the CI as an increasingly smooth function, as more nucleotides better approximate the whole-genome phylogeny than fewer nucleotides. However, increasing window size yields poorer resolution in the 5' UTR (Fig. 3a) because fewer windows are able to represent this region. Contrary to expectations, the rescaled homoplasy index is not constant. Despite large fluctuations within the 5' UTR, the rescaled homoplasy index is generally greater in the 5' UTR than in other regions of the HCV genome and particularly NS5B (Fig. 3b). After correcting for the substitution rate in this manner, the consistency of sites with the whole-genome phylogeny is lower in the 5' UTR than in NS5B.

Figure 3
figure 3

Consistency and homoplasy indices. Moving-window averages of (a) character consistency with the whole-genome phylogeny for windows of 100 (red), 300 (blue), or 500 (black) nucleotides and (b) proportion of informative sites (red) and rescaled homoplasy index (black) for windows of 100 nucleotides as a function of the window midpoint in the whole-genome alignment. Regions corresponding to the 5' UTR (left) and NS5B (right) are indicated with grey bands, with a white band in the middle of NS5B to indicate the 329 nt Okamoto region.

Discussion

An earlier investigation of phylogenetic relations among 27 complete HCV genomes used maximum likelihood and careful determination of the appropriate nucleotide substitution model, and reported a star-like phylogeny among the six known HCV genotypes [12]. The best substitution model was also found to be the most general. In the earlier study, the 5' UTR was found to have lower phylogenetic signal, lower evolutionary rate, and greater phylogenetic noise than alternative regions of the HCV genome, including NS5B [12]. Our observations concur with those previously reported. Methodological refinements in our approach include the use of information-based model selection criteria to determine the best nucleotide substitution model, more complete HCV genomes, the revised nomenclature for subtypes [5], and formal comparisons between alternative topologies for the purpose of subtype determination.

The tree from the Okamoto region of NS5B is a significantly better fit to the HCV whole-genome and polyprotein data than the 5' UTR tree, regardless of the optimality criterion used for phylogenetic inference. Trees obtained from the 5' UTR perform worse at classifying HCV subtypes into clades of the same genotype than do trees from the whole genome, polyprotein, or the Okamoto region of NS5B. Discordant topologies of maximum-likelihood phylogenetic trees obtained from the 5' UTR and NS5B have been described for a subset of HCV genotypes [14, 15]. The inconsistent ordering of deeply rooted branches among trees from protein-coding regions indicates a basal polytomy whose resolution is contingent on the data available, which accords with the star-like phylogeny of all six known HCV genotypes previously reported elsewhere [3, 5, 12, 16].

The same evolutionary model (GTR with a discrete-gamma distribution of rate variation) used here has been utilized previously for likelihood phylogenies of the hepatitis B virus [17] and, with accommodation of invariant sites, for both HIV [18] and HCV [12]. Instantaneous substitution rates (normalized to the G-U rate) are greater among sites in the non-coding 5' UTR than in the regions that encode proteins, despite the fact that overall sequence conservation is greater in the UTR (Table 2). In particular, the instantaneous substitution rate between cytidine and uridine is much greater for the 5' UTR than for protein-coding regions. The accelerated C-U (or C-T for DNA sequences) substitution rate has previously been reported and discussed for protein-coding regions [19], though the rate is even greater for the non-coding terminus than for regions having codon usage constraints. Spontaneous deamination of cytosine to uracil may inflate the C-U substitution rate.

Conservation of single-stranded RNA secondary structure in both coding and non-coding regions of HCV has already been reported [15, 2023]. The high C-U rate bias may additionally be explained by the formation of non-canonical base pairs between guanosine and uridine in single-stranded RNA molecules, which is consistent with selection to conserve secondary structure, because a mutation from cytosine to uridine is less disruptive to secondary structure formation than other point mutations [24]. The may also be explained by the fact that all rates are rescaled such that the G-U rate is unity. A low G-U substitution rate thus inflates other rates. A mutation between G and U is disruptive to RNA secondary structure, because it eliminates the possibility of bases pairing without a compensatory mutation elsewhere. Overall, the elevated C-U substitution rate seen for the 5' UTR probably results from several interacting factors.

Though the same evolutionary model applies to the non-coding 5' UTR and the Okamoto region of NS5B, the two regions are subjected to different constraints. While coding sequences have codon-usage constraints and selective pressure for amino-acid mutations to escape detection by the host immune system, the UTR must preserve long-range interactions with complementary nucleotides at the other terminus of the viral genome if cyclization of the genome is essential to viral replication [6, 20]. Because of these differences in selective regimes, it should not be surprising that phylogenies of the two differ.

HCV diagnostic technologies include serologic (antibody based) and genetic (sequence based) techniques to detect infected samples [4, 6, 25]. Population screens are the most commonly deployed genetic HCV tests, which benefit from low false-positive rates because they utilize the conserved 5' UTR as targets for PCR amplification. However, it is clear both from the results of this study and from previous investigations that the 5' UTR does not contain sufficient information to resolve subtypes [2631]. Phylogenetic signal in protein-coding regions, such as NS5B, provides a useful alternative [12, 32], but few commercial assays exploit this information at present. The "gold standard" for subtype determination is direct sequencing, which has a lower cost for reagents but requires more time than commercial assay kits [4, 25].

There exist further complications to subtype classification, including coinfection [30, 33, 34], recombination [35, 36], within-host evolution [37, 38], and compartmentalization of genotypes into different cell types [39]. Diagnostic assays that are informed by the 5' UTR will be less able to accommodate these difficulties than methods that are able to resolve subtypes.

Conclusion

Ultimately, HCV infection outcome results from an interaction between the virus and its host. The current standard of care is limited in efficacy, and treatment outcome is contingent on viral genotype [13, 6, 25, 34]. To improve HCV therapies, perform effective public-health surveillance for new variants and modes of transmission, and further vaccine development efforts, detailed information about the interacting genotypes is needed. Diagnostic methods that assign viral subtype classifications are thus greatly desired. Such methods perform better when they are not informed by sequence variation from the non-coding 5' UTR, and should instead favor protein-coding regions, such as the Okamoto region of NS5B.

Methods

Phylogenetic inference

We used multiple methods for phylogenetic inference, including neighbor joining (NJ), maximum parsimony (MP), and maximum likelihood (ML) [9, 10]. This was done to evaluate whether the inferential technique has an influence on the ability of the resulting phylogenies to resolve subtypes into clades. We used PAUP*, version 4.0b10 [40] for phylogenetic inference. Neighbor-joining trees were constructed with the F84 distance metric [41] and the BioNJ algorithm [42]. For parsimony analyses, uninformative invariant characters were excluded and gaps were treated as a fifth character state.

To select an appropriate nucleotide substitution model, we used FindModel, an independ-ent, online implementation of ModelTest [43]. This approach uses an information-based goodness-of-fit criterion, in the sense that the best model minimizes the quantity of bits required to encode both the model and the model-encoded data for electronic transmission [4446]. Such an approach includes a penalty term for the number of parameters, and thus facilitates comparing models with varied numbers of parameters [44]. The fit of each model to the data was evaluated both with and without a four-category discrete approximation to a gamma distribution of substitution rates per site. Because FindModel does not test models with invariant sites, we also used ModelTest (version 3.6) to evaluate nucleotide substitution models with invariant sites [43]. Akaike's information criterion (AIC) was used to quantify the suitability of alternative models having varied numbers of parameters to fit the data [47].

Hypothesis tests

To evaluate the significance of differences in ML phylogenies obtained from different regions of the HCV genome, we used the Shimodaira-Hasegawa (SH) test [48] as implemented in PAUP*, version 4.0b10 [40]. The null hypothesis of the SH test is that none of the trees evaluated has a likelihood that differs significantly from any other. Rejecting the null hypothesis indicates a significant difference in likelihood scores, and thus in tree topologies [49].

For a pair of trees defined a priori, the SH test computes the difference in their likelihoods (Δ). This difference is compared with the null distribution of likelihood scores, obtained by building trees from character data generated by iterative bootstrap resampling with replacement of the nucleotide sites. A computationally efficient optimization (RELL) may be applied, which simply adds together per-site likelihoods over the resampled sites. Otherwise, the tree parameters are optimized on the resampled data (FULL). The resampled likelihood differences are denoted Δ i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuqHuoargaqbamaaBaaaleaacqWGPbqAaeqaaaaa@2FA5@ , where i indexes the replicate, and they are subsequently transformed by subtracting the mean resampled difference <Δ'>, a procedure called centering. The original difference in likelihoods is compared with the null distribution in a one-tailed, non-parametric manner, whereby the rank of Δ is evaluated against the centered, sorted Δ' distribution. If the rank of Δ is found to lie outside the interval of the null distribution between 0 and the (1-α) × 100 percentile, the difference in likelihoods is significant with (1-α) × 100% confidence, and the null hypothesis is rejected in favor of the alternative. (The acceptable type I, or false positive, error rate per test is denoted α.)

Here the tree topologies are ML phylogenies that represent different regions of the HCV genome. The reference alignment of 38 HCV whole-genome sequences representing 18 confirmed subtypes (Table 1) was obtained from the LANL HCV database [50]. We conducted SH tests with data from the 5' UTR, the Okamoto region of NS5B, and whole genome. Topologies were paired such that the ML tree T x MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemiEaGhabaGaey4fIOcaaaaa@3072@ inferred from the data of region x (either the 5' UTR or Okamoto region) was compared with the ML tree T y MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemyEaKhabaGaey4fIOcaaaaa@3074@ from data of region y representing each other region (either 5' UTR, Okamoto region, polypeptide, or whole genome, provided yx), yielding the likelihood difference Δ ≡ L x ( T x MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemiEaGhabaGaey4fIOcaaaaa@3072@ ) - L x ( T y MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemyEaKhabaGaey4fIOcaaaaa@3074@ ), where L x ( T y MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGubavdaqhaaWcbaGaemyEaKhabaGaey4fIOcaaaaa@3074@ ) is the likelihood of the ML tree from region y evaluated with data from region x. We randomly resampled 10,000 replicate data sets for each pair of trees and compared the original difference in likelihoods with the null distribution that resulted. The type I error rate was reduced to accommodate six hypothesis tests (α = 0.05/6 = 0.00833). This reduction preserves the experiment-wide false-positive rate by making each comparison more stringent.

Consistency and homoplasy indices

To understand better phylogenetic inconsistencies over the HCV genome, we computed the character consistency index (CI) for each site in PAUP with the whole-genome phylogeny, and summarized CI with a moving-window (running) average over 100, 300, and 500 nt. The 100 nt window size was used subsequently because it allows for clear visualization of the 342 nucleotides that constitute the 5' UTR. Because the consistency and homoplasy indices (HI) are complementary (CI+HI = 1), character consistency is high when homoplasy is low, and vice versa. Thus, we expect lower homoplasy to result from fewer informative sites. Further, homoplasy decreases rapidly with decreasing substitution rates. To control for variation in the number of informative sites across the genome, we rescaled the homoplasy index against the square of the proportion of informative sites in the window region. This was done because, in the limit of short branch lengths, the number of informative sites should be proportional to the substitution rate r, while the number of homoplasies should be proportional to r2. The result was subsequently normalized against the maximum, to facilitate comparison with the proportion of informative sites. As a result, if all parts of the HCV genome are equally informative, one can expect the rescaled homoplasy index to be roughly constant over the viral genome.