Aligning to the sample-specific reference sequence to optimize the accuracy of next-generation sequencing analysis for hepatitis B virus

Background Hepatitis B virus (HBV) quasispecies are crucial in the pathogenesis of chronic liver disease. Next-generation sequencing (NGS) is powerful for identifying viral quasispecies. To improve mapping quality and single nucleotide variant (SNV) calling accuracy in the NGS analysis of HBV, we compared different mapping references, including the sample-specific reference sequence, same genotype sequences and different genotype sequences, according to the sample. Methods Real Illumina HBV datasets from 86 patients, and simulated datasets from 158 HBV strains in the GenBank database, were used to assess mapping quality. SNV calling accuracy was evaluated using different mapping references to align Real Illumina datasets from a single HBV clone. Results Using the sample-specific reference sequence as a mapping reference produced the largest number of mappable reads and coverages. With a different genotype mapping reference, the consensus sequence derived from the Real Illumina datasets of the single HBV clone showed 21 false SNV callings in polymerase and surface genes, the regions most divergent between the mapping reference and this HBV clone. A ~6 % coverage of most of these false SNVs was yielded even with a same genotype mapping reference, but none with the sample-specific reference sequence. Conclusions Using sample-specific reference sequences as a mapping reference in NGS analysis optimized mapping quality and the SNV calling accuracy for HBV quasispecies. Electronic supplementary material The online version of this article (doi:10.1007/s12072-015-9645-x) contains supplementary material, which is available to authorized users.


Introduction
Next generation sequencing (NGS), also known as ultrahigh throughput sequencing, is a powerful tool for discovering diseases with novel mutations and for detecting traces of pathogenic microorganisms [1,2]. It has also been used for sequencing human and microbial genomes and for identifying species.
Hepatitis B virus (HBV) infection causes a multitude of clinical conditions ranging from acute hepatitis to cirrhosis and hepatocellular carcinoma [3][4][5]. HBV is classified into genotypes A-J with an inter-genotypic divergence of at least 8 % in the complete genome sequence [6]. HBV genotypes B and C are prevalent in Taiwan [7]. Many studies have suggested that HBV genotypes show not only geographical distribution and ethnic specificity but are also associated with disease progression and responses to interferon treatment [8,9]. HBV highly replicates its genome and daily releases more than 10 11 virions. Because HBV viral polymerase reverse transcriptase (RT) has no proofreading activity, HBV has higher mutation rates than other DNA viruses and complex quasispecies [10]. Viral quasispecies have been implicated in the development of drug resistance, the exacerbation of chronic hepatitis B (CHB), and the development of liver cancer [10].
NGS has been used to characterize single nucleotide variations (SNVs) and their dynamics in HBV polymerase RT genes in serum from patients who had undergone antiviral treatment [11][12][13]. HBV sequence-reads from NGS were aligned to one reference sequence from either a consensus genotype sequence in public viral databases [12,13] or a major viral sequence identified using polymerase chain reaction (PCR)-director sequencing [11]. However, the mapping reference for optimizing the results in NGS analysis has never been identified.
In this study, we used various types of mapping references: genotype sequences identical to and different from the NGS sample, and a sample-specific reference sequence derived from its NGS dataset to investigate the mapping quality and the accuracy of the SNV callings for the full HBV genome. We also illustrated the effect of divergence between the mapping reference and NGS datasets on mapping quality and the accuracy of the SNV callings.

Patients
Informed consent was obtained from each patient. Eightysix patients between the ages of 34 to 75 were enrolled. All patients were treatment-naïve and had serum HBV DNA C200,000 IU/mL and detectable hepatitis B surface antigen for at least 6 months. Patients with hepatitis C or D infection, human immunodeficiency virus infection, or autoimmune hepatitis were excluded. Serum samples of all the patients were stored at -80°C until used. Clinical characteristics of the 52 genotype-C patients and 34 genotype-B patients are shown in Table 1.
Extracting and amplifying full HBV genomic DNA HBV DNA from 200 lL of serum was extracted using the Viogene Blood and Tissue Extraction Mini DNA Extractor kit (Viogene BioTek, New Taipei City, Taiwan). Fragments of the HBV DNA full genome were amplified using PCR with nine primer sets (Supplementary Table 1) and High Fidelity DNA polymerase (Thermo Fisher Scientific, Pittsburgh, PA, USA). The PCR condition was 5 min at 94°C followed by 40 cycles of 1 min at 94°C, 1 min at 50°C, and 1.5 min at 72°C. All amplified PCR products were run in a 1 % agarose gel electrophoresis and were purified using the GEL/PCR Purification Mini Kit (Favorgen Biotech, Ping-Tung, Taiwan). Nine PCR fragments were mixed and were subjected to NGS.

Viral genome sequencing using massively parallel NGS
Massively-parallel NGS with multiplexed tags was carried out using a genome analyzer, as previously described [11]. In brief, HBV DNA was fragmented using sonication and was cut into suitable sizes. These fragments were purified and were then end-repaired and A-tailed using DNA Polymerase I Klenow Fragment (3 0 ? 5 0 exo-) (New England Biolabs, Ipswich, MA, USA). DNA fragments ligated with indexed adapters were amplified using 10-18 cycles of PCR reaction. The DNA library was quantified by Qubit fluorometer (Qubit dsDNA HS assay, Thermo Fisher Scientific, Life Technology) and real time PCR (KAPA Library Quantification Kit IlluminaÒ platforms, KAPA biosystems). Experion Automated Electrophoresis System (Bio-Rad Laboratories, Hercules, CA, USA) was used to validate the size of the library. After it had been validated, the library was sequenced (HiSeq TM 2500; Illumina, San Diego, CA, USA).

Genomic analysis of NGS data
For NGS high-throughput data, low-quality bases of raw reads were first trimmed using Seqtk (https://github.com/ lh3/seqtk), which uses a modified Mott trimming algorithm. All parameters were default settings except maximally trimming down to 1 bp to remove as many lowquality bases as possible but still keeping paired-end information. At the alignment stage, the trimmed reads were then mapped to the mapping reference genome using BWA (BWA-MEM) [14] with a -M parameter setting and 16 threads to get correct and consistent mapping statistics. Because almost all genome mappers, like BWA, were designed for linear genomes, they are not well suited for circular genomes like HBV genomes, especially when reads spanning the end of the genome have worse mapping performance. Therefore, we manually concatenated the end parts, 600 bases, ranging from the beginning of the 5 0 end to the 3 0 end to avoid overhanging reads. The mapping results were then processed using SAMTools [15] to remove reads that mapped to multiple positions and reads that had poor mapping quality scores.

Direct Sanger sequencing and NGS of HBV clones
Serum HBV DNA of two patients with CHB were extracted and amplified using PCR. The designed primers were modified [7] to amplify the full-length HBV genome and then were cloned into yT&A plasmid (Yeastern Biotech, Taipei, Taiwan). HBV full genomes of Clone_N6 (KJ790199) and Clone_H44 (KJ790200) were sequenced using a direct Sanger sequencer (Applied Biosystems, Life Technologies, Taipei, Taiwan). For NGS of the two HBV clones, the HBV full genomes in plasmids were amplified using primers with restriction site sapI [16], then self-ligated using T4 ligase to produce a circular form HBV genome, and amplified into nine fragments that were subjected to NGS as described in ''Viral genome sequencing using massively parallel NGS''.

Mapping references chosen at the alignment stage during NGS data analysis
To optimize the NGS data analysis, five different mapping references were used at the alignment stage to compare the mapping results. Four full-length HBV genome sequences, FJ787477 (genotype B, Asia), JN315779 (genotype C, Asia), KJ790200 (Clone_H44; genotype B, Taiwan), and KJ790199 (Clone_N6; genotype C, Taiwan) in the GenBank database were used. Furthermore, a sample-specific reference sequence, the consensus sequence obtained from the NGS reads of each sample through alignment with its same genotype mapping reference (FJ787477 or JN315779), was also used as a mapping reference sequence for the NGS datasets.

HBV genotyping
The HBV genotype was determined using a melting curve analysis with LightCycler hybridization probes as previously described [17]. The derived consensus sequence of NGS reads were aligned to the standard full-length HBV genomes (genotype A-H) from the GenBank database to confirm the genotyping results. HBV genotype was identified using phylogenetic analysis software (Mega 6.0) [18].
Evaluating the mapping quality of simulated datasets from 158 HBV strains in the GenBank database with the alignment to different genotype sequences To evaluate the quality of mapping results with the alignment to different genotype references, 158 HBV complete genomes (34 genotype A, 33 genotype B, 39 genotype C, and 52 genotype D strains) with 4 common genotypes from the GenBank database were collected. The simulated NGS reads were produced from each strain using sequence alignment/mapping software (SAMtools wgsim) [15]. To analyze these 158 simulated NGS datasets, the full genome of each HBV strain was also used as a mapping reference for alignment, respectively. The mapping results-mappable reads, properly paired reads, broken paired reads, and singleton reads-were evaluated.

Calculating nucleotide divergence
Nucleotide divergences were calculated using DNA sequence polymorphism software (DNasp 5.10.1) [19]. One hundred sites for window length were set when we calculated the divergences.

Statistical analysis
Continuous variables were compared using Student's t test for two independent groups. The changes in mapping quality between different mapping references were compared using paired t tests. The frequencies and distributions of categorical variables were compared using the Chi square tests or Fisher's exact tests.

Results
The mapping quality of the NGS dataset was optimized when using a sample-specific reference sequence as the mapping reference We compared the mapping quality of real Illumina datasets of viral genomes from 52 patients with genotype C HBV and 34 patients with genotype B HBV using different mapping references. The sample-specific reference sequence had the best quality, followed by the Taiwanese strain with the same genotype, the Asian strain with the same genotype, the Taiwanese strain with a different genotype, and the Asian strain with a different genotype ( Table 2; Supplementary Table 2). In patients with genotype C HBV, the sample-specific reference sequence had the best mapping quality: the largest number of mappable reads, properly paired reads, and broken paired reads, and the lowest number of singletons. In addition, using a sample-specific sequence as a reference yielded the largest average coverages (38,362 ± 75,502), minimum coverages, and maximum coverages per nucleotide, and the smallest number of nucleotides with fewer than 30 coverages. Moreover, when using a sample-specific sequence, 97 % of the nucleotides in the full genome had more than 1000 coverages. Using a Taiwanese genotype C mapping reference yielded larger mappable reads (p \ 0.0001) and average coverage per nucleotide (p \ 0.0001) than did using the Asian genotype C mapping reference. A similar pattern of mapping quality was detected in NGS datasets from 34 genotype B patients aligned to different HBV mapping references (Supplementary Table 2).
Simulated NGS datasets for which a sample-specific sequence was used as a reference yielded improved mapping quality Simulated NGS reads were produced from 158 HBV full genomes (genotypes A-D HBV) obtained from the Gen-Bank database using the Wgsim read simulator [15]. To analyze these simulated NGS datasets, each full genome was also used as mapping reference for alignment. The mapping qualities categorized by subgenotypes were shown in Supplementary Table 3. The improved mapping qualities of simulated NGS datasets aligned to the same genotype sequence, especially the same subgenotype sequence in genotype B, showed a higher percentage of mappable reads and properly paired reads and a lower percentage of singletons than did datasets aligned to different genotypes.
Datasets aligned to different genotypes yielded false SNVs in the consensus sequence derived from NGS reads of a single HBV clone The mapping results of NGS reads from a Taiwanese genotype C HBV Clone_N6 using different mapping references: sample-specific reference sequence JN315779 (Genotype C, Asia) and reference sequence FJ787477 (Genotype B, Asia) were assessed. When compared with the sequence of Clone_N6 derived from direct sequencing, the consensus sequence contained 21 false SNVs when using a different genotype mapping reference (Fig. 1). Consensus sequences derived from the same dataset were identical to the sequence of Clone_N6 using either the same genotype reference or a sample-specific reference sequence as mapping references.

Regions of high divergence between single HBV clone and mapping references involved the false SNVs of NGS analysis
The percentage of mappable reads from the Clone_N6 NGS dataset was 98.7 using a sample-specific reference sequence, 97.0 % using JN315779, and 96.8 % using FJ787477. The coverage and percentage of false SNVs in the consensus sequences derived from Clone_N6 (genotype C) NGS reads when aligned to different HBV mapping references are shown in Table 3. When the NGS reads were aligned to a different genotype strain (FJ787477), the derived consensus sequence showed 21 false SNVs with 2-3 log reductions of coverage and an enormous change in nucleotide percentages. Interestingly, these false SNVs were located exactly at the three highest divergence regions, viz., nt929-942 (P gene/RT domain), nt2733-2741 (P gene/terminal protein domain), and nt2980-3015 (P gene/spacer domain; pre-S1 region), with a divergence of [18 % between Clone_N6 (genotype C) and the mapping reference FJ787477 (Genotype B, Asia) (Fig. 2a).
Using a mapping reference with the same genotype (JN315779) for Clone_N6 NGS reads, even the derived consensus sequence was identical to the Clone_N6 sequence, and decades of SNVs were still detected with the coverage percentage up to 6 % ( Table 3). The sequence divergences of each sliding window between Clone_N6 and mapping references with JN315779 were all below 10 % in the HBV full genome (Fig. 2a). Using a mapping reference with the sample-specific reference sequence showed the best results of NGS analysis, with a percentage of false SNVs \0.3. This false SNV rate was considered mismatch error values and could be ignored in NGS analysis [11]. Furthermore, we found 14 false SNVs in the consensus sequence derived from Clone_H44 (genotype B) NGS reads when mapping to reference JN315779 (Genotype C, Asia). All the false SNVs located in nt940-942 (P gene/RT domain), nt2759-2775 (P gene/terminal protein domain), and nt2783-2790 (P gene/terminal protein domain), showed high divergence ([15 %) between Clo-ne_H44 and JN315779 ( Supplementary Fig. 1A).

High prevalence of SNVs in 52 patients associated with a high divergence between their derived HBV consensus sequences and mapping references
We obtained all the sequence divergences of each sliding window by comparing each HBV-derived consensus sequence from 52 patients infected with genotype C HBV with mapping references JN315779 (Genotype C, Asia) and FJ787477 (Genotype B, Asia). Mean sequence divergences of genotype C consensus strains relative to two different genotype mapping references were shown as two demarcating curves (Fig. 2b). The profiles were similar to those in Fig. 2a. When using FJ787477 as a mapping reference, all the mean sequence divergences of each window except those within nt2137-2250 of the core gene were  significantly higher than those referenced to JN315779 (p \ 0.0001). Interestingly, the derived consensus sequence at 94 nucleotide positions of the HBV full genome in at least one of 52 patients with genotype C HBV contained inconsistent variants when using different genotype mapping references (Supplementary Table 4). Seventeen inconsistent variants occurred in more than 20 % and ten inconsistent variants in more than 40 % of these patients (Table 4). These inconsistent SNVs were located at high divergence regions (over 13 %): nt939-942, nt1353-1362, and nt2980-3015, and they were probably false SNVs (Fig. 2b). In addition, mean sequence divergences between 34 genotype B-derived consensus sequences from CHB patients and each of two different genotype mapping references are shown as two demarcating curves in Supplementary Fig. 2A, in which the profiles are similar to those in Fig. 2b. When comparing different genotype mapping references, 111 nucleotides in the HBV full genome of derived consensus sequences expressed inconsistent variants in at least one patient (Supplementary Table 5). When using JN315779 as a mapping reference, all the mean sequence divergences except those within nt368-481, nt685-790, nt1207-1313, nt1802-2000, and nt2133-2584 were significantly higher than those referenced to FJ787477 (p \ 0.0001) (Supplementary Fig. 1B).
The mean divergence of any consecutive 100-nucleotide segments in genotypes A-D We collected 158 HBV full genomes (genotypes A-D) from the GenBank database. Within the same genotype, the mean divergence of any consecutive 100-nucleotide segment was B8.0 % (Fig. 3). Between any two different Fig. 2 a Sequence divergence between Clone_N6 (genotype C) and mapping references, FJ787477 (genotype B, Asia) and JN315779 (genotype C, Asia), respectively. Arrows indicate three regions with the highest divergence over 18 % at nt929-942, nt2733-2741, and nt2980-3015. b Mean sequence divergence between derived consensus sequences from NGS reads of 52 patients with genotype C chronic hepatitis B and mapping references, FJ787477 (genotype B, Asia) and JN315779 (genotype C, Asia). A schematic diagram of the hepatitis B virus complete genome and four genes are shown in the bottom panel and the positions correspond to the x-axis of (a, b) genotypes, there were high divergences in most regions of the HBV full genome, except the core gene between genotypes B and C.

Discussion
Various NGS platforms, including the Illumina HiSeq, Illumina Genome Analyzer, Illumina Miseq, Roche 454, and SOLiD4, differ in engineering configurations and sequencing chemistry. In the present study, we used the Illumina Hiseq 2500 system to analyze HBV full genome in viral quasispecies; it has the advantages of short run times, long read lengths, and high data quality. With an average read length of 110 bp and an average coverage C25,000 after quality trimming, the depth and quality of the sequencing results are admissible for analyzing viral quasispecies.
At the alignment stage, mapping short reads against a reference genome is typically the first step in analyzing such NGS data, and it should be as accurate as possible. The mapping reference is crucial for mapping quality and for the quality of the SNV calls. Previous HBV-related NGS analyses used the consensus genotype sequence [12,13] or a sequence from direct sequencing of PCR products [11] as mapping references. This is the first study to compare the mapping quality and the accuracy of the SNV calls for NGS analysis of the HBV full genome using different mapping references. When a sample-specific reference sequence was used, the mappable reads were 89.4 % of total reads with highest properly paired reads and broken paired reads and lowest singleton reads, which indicated that mapping quality was substantially improved. Coverage is important for the quality of SNV calls. High coverage regions or bases tend to have a higher call quality. Using a sample-specific reference sequence as mapping reference increased the average coverage for each nucleotide. When simulated NGS datasets were aligned to the same genotype sequence, especially the same subgenotype sequence in genotype B or from the same country, mapping qualities were also improved.
HBV quasispecies are importantly implicated in the pathogenesis of chronic liver diseases. For example, several mutations of the HBV genome are crucial for developing HCC and cirrhosis [20][21][22][23]. Therefore, accurate SNV calls are important for analyzing the HBV full genome in NGS. Along with the verification using cloned sequences, we identified numerous false SNVs in a major strain when mapping with different viral genotypes to the preceding sample (Table 3). These false SNV calls were contributed by fewer mappable reads, a profound reduction in coverages, and the change of nucleotide composition at one site. In addition, we also confirmed that there were probably many false SNVs in derived consensus sequences from patients when using a different viral genotype as a mapping reference (Table 4). More than 20 % of the patients had  (Table 4; Supplementary Tables 4, 5), which suggested that a different genotype sequence was an inappropriate mapping reference for HBV NGS analysis. HBV mutants have been demonstrated in patients with acute fulminant or chronic infection [10]. Mutations in the RT region of viral P gene have been associated with the resistance to antiviral drugs [24,25]. Mutations on the PreS1 or PreS2 promoter were correlated with the occurrence of HCC [26,27]. While using different genotype mapping references, we found that false SNV calls of major strains were located in high-divergence regions, such as the P-gene/RT domain, P-gene/terminal protein domain, and preS1 region in the HBV genome for genotypes A-D (Fig. 2). We suggest using a sample-specific reference sequence as a mapping reference for NGS analysis in studies of P-gene variants for drug-resistance [24,25] and S-gene variants for the pathogenesis of HCC [26,27]. Emerging evidence supports the notion that certain drugresistant HBV minor strains are crucial for the progression of liver diseases and are predictors of subsequent treatment failure [23,28]. They may accumulate and eventually dominate under a long-term selection effect during antiviral treatment [23]. Fig. 3 Comparison of mean sequence divergences between different hepatitis B virus (HBV) genotype populations. A total of 158 HBV strains were collected from the GenBank database (34 genotype A, 33 genotype B, 39 genotype C, and 52 genotype D) and analyzed. a Mean sequence divergences within the same genotype were expressed. b, c Mean sequence divergences between different genotypes were expressed. A schematic diagram of the hepatitis B virus complete genome and four genes are shown in the bottom panel and the positions correspond to the x-axis of (a-c). Geno. A genotype A, Geno. B genotype B, Geno. C genotype C, Geno. D genotype D By taking advantage of an abundance of genetic information when using NGS, more accurate and detailed SNVs of HBV quasispecies can be obtained than with previous methods, such as INNO-LiPA and clonal HBV sequencing. However, even when we used the same genotype sequence as a mapping reference, decades of false SNVs still existed with a coverage up to 6 % (Table 3). Using a mapping reference with a sample-specific reference sequence yielded the best NGS analysis results with a coverage percentage of false SNVs\0.3 %, which was considered a mismatch error value and could be ignored for NGS analysis in the Illumina platform [11]. Therefore, using a sample-specific reference sequence to align the NGS dataset is crucial for accurate SNV calls of viral quasispecies.
In conclusion, we conclude that a sample-specific reference sequence, which provides the highest mapping quality and SNV call accuracy, should be used as the mapping reference in the NGS analysis of emerging HBV variants, especially for the studies of the P gene and S gene.