An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar
How viruses evolve within hosts can dictate infection outcomes; however, reconstructing this process is challenging. We evaluate our multiplexed amplicon approach, PrimalSeq, to demonstrate how virus concentration, sequencing coverage, primer mismatches, and replicates influence the accuracy of measuring intrahost virus diversity. We develop an experimental protocol and computational tool, iVar, for using PrimalSeq to measure virus diversity using Illumina and compare the results to Oxford Nanopore sequencing. We demonstrate the utility of PrimalSeq by measuring Zika and West Nile virus diversity from varied sample types and show that the accumulation of genetic diversity is influenced by experimental and biological systems.
KeywordsViral sequencing Amplicon sequencing Intrahost evolution Zika West Nile SNP calling
RNA viruses, including HIV, influenza, West Nile, and Zika, pose significant threats to public health worldwide. Part of this burden stems from their ability to rapidly evolve within hosts . Generation of intrahost genetic diversity allows virus populations to evade host immune responses [2, 3, 4], alter the severity of disease , and adapt to changing environments [6, 7]. Studying virus populations, both within naturally infected hosts and during experimental evolution, can therefore lead to breakthroughs in our understanding of virus-host interactions and novel approaches for outbreak response [8, 9, 10, 11].
In many cases, however, accurately measuring intrahost RNA virus diversity using deep sequencing remains a significant challenge. Multiple factors, such as virus titer, sample preparation, sequencing errors, and computational inferences, can bias measures of genetic diversity [12, 13, 14, 15, 16]. Moreover, for many clinical samples, low ratios of viral to host RNA often necessitate enrichment of viral nucleic acid to recover sufficient templates for deep sequencing . This is especially true for Zika virus, where low viremias (< 1000 copies/μL of RNA) are often detected during natural and experimental infections [18, 19, 20, 21]. PCR amplification of virus nucleic acid is a common approach to overcome this challenge [4, 22, 23], although it can introduce biases by altering the composition of intrahost genetic variants [14, 24]. Therefore, to ensure accuracy, comprehensive validation of deep sequencing approaches should accompany diversity measures from biological samples.
We previously developed a multiplex primer design tool (“Primal Scheme”) coupled to a laboratory protocol (“PrimalSeq”) to sequence RNA viruses directly from clinical samples in a way that is cheap, accurate, and scalable under resource-limited conditions . Versions of PrimalSeq have been used to sequence the majority of Zika virus genomes from the epidemic in the Americas [19, 25, 26, 27], yellow fever virus in Brazil , and West Nile virus in the USA . PrimalSeq has also been used to characterize Zika virus during infection of non-human primates [21, 30]. While PrimalSeq was shown to be superior to other methods for obtaining consensus sequences [25, 26], it has yet to be validated for measuring intrahost diversity.
In this study, we benchmarked PrimalSeq for sequencing diverse virus populations, highlighting its limitations and providing recommendations for accurately measuring intrahost single-nucleotide variants (iSNVs) from both Illumina and Oxford Nanopore data. We used these results to develop comprehensive laboratory protocols and a computational tool (iVar), and further tested PrimalSeq to characterize Zika virus populations generated from cell culture, mosquito, non-human primate, and human clinical samples. We demonstrate the utility of PrimalSeq for other viruses by designing an amplicon scheme for West Nile virus and measuring genetic diversity from field-collected mosquitoes and birds. Our data show that virus diversity can be significantly impacted by the experimental and biological systems, and we provide a framework to uncover the underlying mechanisms. PrimalSeq and iVar provide a scalable platform for viruses other than Zika and West Nile that can be applied to discover ecological, epidemiological, and immunological drivers of virus evolution in a variety of systems.
Virus concentration and sequencing depth impact intrahost variant calling
We previously developed Primal Scheme (Quick et al.; primal.zibraproject.org), a multiplex primer design tool for amplicon-based sequencing of RNA virus genomes directly from clinical samples . Our Zika virus PrimalSeq protocol generates 35 overlapping amplicons of ~ 400 base pairs from two multiplexed PCR reactions, an approach similar to “RNA jackhammering,” which was developed to sequence HIV . The process of PCR amplification to generate sufficient templates for high-throughput sequencing, however, may bias the measurements of intrahost virus diversity through differential amplification efficiencies for divergent virus haplotypes present in a population [32, 33].
For our initial evaluation, we selected three of the 35 Zika virus primer sets that flanked at least five variable genome positions (amplicons 5, 24, and 33). We then made two sets of mixed virus populations: (1) altering the ratios of mixed viruses, while keeping the overall input concentration constant at 1000 virus RNA copies (Fig. 1b) and (2) maintaining a constant ratio of 14% of virus #2 while altering the input concentrations of virus RNA used for cDNA synthesis (Fig. 1c). For each test, we measured the frequencies of the 18 iSNVs between virus #1 and #2 (Fig. 1a). We generated the amplicons independently three times and sequenced each using the Illumina MiSeq platform (“technical replicates”).
We found that the measured mean iSNV frequencies were accurate from populations containing 50%, 25%, 14%, 7%, and 3% of virus #2 (Fig. 1b). At 1.5% of virus #2, the standard deviation of our measured mean iSNV frequency (0.2–1.2%) fell below the expected frequency, indicating that we could not measure the true iSNV frequency at that dilution (Fig. 1b). This demonstrates that the lower limit of accurate iSNV detection for PrimalSeq in this scenario is between 1.5 and 3%. When we altered input concentrations of a population containing 14% virus #2 from 100,000 to 10 virus RNA copies (10-fold serial dilutions), we found that the variances of measured frequencies became significantly higher from concentrations containing 100 or less copies (Levene’s test for variance, p < 0.05; Fig. 1c). Therefore, input virus concentrations can dramatically alter iSNV detection, as others have also discovered . We conclude that a minimum of 1000 virus RNA copies should be used with PrimalSeq to accurately measure iSNVs greater than 3% frequency.
Sequencing coverage depth is another important factor for iSNV detection , so we sought to define the level of sequencing coverage needed to accurately measure iSNVs. From our samples containing 1000 virus RNA copies with 97% virus #1 and 3% virus #2, we sequenced the targeted genome regions in triplicates to a coverage depth of ~ 3000×. We randomly downsampled these datasets to generate coverage depths of 1000, 600, 400, 200, 100, and 50× (Fig. 1d). We found that the variances of iSNV frequencies became significantly higher from coverage depths lower than 400× (i.e., at 200, 100, and 50×; Levene’s test for variance, p < 0.05; Fig. 1d). Thus, we conclude that a minimum sequencing coverage depth of 400× is required to maintain iSNV measurement accuracy at the lower limit of frequency detection (3%) and input concentration (1000 virus RNA copies).
Primer mismatches impact intrahost variant frequency measurements
We amplified the Mix10% virus population (1000 RNA copies) independently three times and sequenced each replicate to a minimum coverage depth of 1000× using the Illumina MiSeq. We measured iSNV frequency at each site and calculated the mean iSNV frequency of all iSNVs within an amplicon to estimate the computed virus #2 haplotype frequency (Fig. 2b). We found that iSNV frequencies measured from amplicons without primer mismatches were significantly closer to the expected value of 10% than amplicons with one or more mismatches in the primer regions (Welch’s t test, p < 0.05, Fig. 2c). Moreover, we found that mismatches closer to the 3′ end of the primer were more likely to lead to inaccurate frequency measurements (Pearson r, p < 0.05, Fig. 2d). Overall, our data demonstrate that the accuracy of intrahost virus diversity measures is highly impacted by primer mismatches during PCR. Thus, when iSNVs are detected from amplicons with mismatches in the primer binding sites, the resulting diversity data from those amplicons should be interpreted with caution.
Removal of false positive intrahost variants with replicate sequencing
Measurements of virus intrahost genetic diversity are sensitive to PCR and sequencing errors [12, 14, 16]. These factors, combined with others such as virus concentration and sequencing coverage (Fig. 1), can lead to erroneous iSNV detection (i.e., false positives) and bias measures of genetic diversity. To improve accurate iSNV detection, we examined the distribution of false positive iSNV calls and investigated methods to remove them during analysis. We found that (1) the distribution of false positive iSNVs more closely matched the profile of sequencing errors than PCR errors and that (2) the majority of false iSNV > 3% could be removed by replicate sequencing (Fig. 3).
Knowing the general distribution of false positive iSNVs, we sought to remove them post sequencing. Based on previous investigations [12, 35, 36], we proposed to remove false positive iSNVs by (1) amplifying and sequencing each sample as technical replicates (at least twice) and (2) only calling iSNVs detected in all replicates. Using our Mix10% virus population, we analyzed each replicate in isolation or in combination and calculated the mean iSNV frequencies (Fig. 3c). From individually sequenced replicates, we found 1–2 false positive iSNVs per sample were within the frequency distribution of our true iSNVs (Fig. 3c, panel “1 replicate”), demonstrating that a simple frequency cutoff will either leave false positive or remove true positive iSNVs. When considering replicates in combination, however, we found that the percent of sites with a false positive iSNV call (above 0.1%) dropped from ~ 16% (Fig. 3c “1 replicate”) to ~ 9% (Fig. 3c “2 replicates”). More importantly, we found that all of the false iSNVs that passed the duplicate filter had frequencies below the 3% limit of accurate iSNV measurements (Fig. 1b). This allowed us to use a secondary frequency cutoff (3%) to remove the remainder of the false positives, while maintaining all of the true (10%) iSNVs. We found that the addition of a third technical replicate only resulted in a moderate reduction of sites with false iSNVs above 0.1% (9 to 6%) and did not help us to decrease the frequency cutoff filter (Fig. 3c “3 replicates”). Using pseudo replicates (i.e., using the same replicate more than once, instead of using technical replicates) to filter variants, we found that this did not lead to an improvement in eliminating false positives (Additional file 2: Figure S1). This finding shows that the elimination of false positives when using technical replicates is not due to an apparent increase in sequencing coverage, but rather is due the independent nature of each replicate sequencing library. We conclude that PrimalSeq can be used for accurate iSNV detection above 3% when using at least two technical replicates.
The accuracy of PrimalSeq is comparable to metagenomic sequencing
High intrahost variant false discovery rate using Oxford Nanopore sequencing
To further explore Oxford Nanopore sequencing for measuring intrahost virus diversity, we examined if we could (1) differentiate between mixed genotypes within a virus population and (2) computationally remove false positive iSNVs. Though the mean iSNV frequencies measured from Nanopore were not significantly different from Illumina (Fig. 4) and we could assign reads to the correct haplotype (i.e., virus #1 or #2), we found it difficult to differentiate between true and false positive iSNVs (Fig. 5). For this evaluation, we used our Nanopore data generated from the Mix10% Zika virus population (Fig. 4a).
First, using a reference database containing virus #1 and #2, plus two divergent Zika viruses (French Polynesia, 2007; Uganda, 1947), we determined if we could differentiate between virus haplotypes in a mixed population. We found that 92.38% of the aligned reads mapped to virus #1 and 7.35% mapped to virus #2, the roughly expected 90%:10% proportions (0.27% of reads mapped erroneously to haplotypes not present in the mixture). Overall, the results indicate that nanopore sequencing reads are useful for identifying highly divergent haplotypes within a mixture—as might be expected for some co-infections —despite a high error rate. This approach, however, will be less useful for detecting co-infections if the divergence between the haplotypes is small or the haplotypes are unknown.
To attempt to differentiate between true and false positive iSNVs, we limited our analysis to regions only covered by perfect primer matches and analyzed 54 true positive and 4173 true negative sites, as we did above for the Illumina data (Fig. 3). We filtered the sequencing data using technical replicates and a 3% frequency cutoff, which we demonstrated above could be used to remove false iSNV calls in our Illumina data (Fig. 3c). Using these filters, we found that > 17% of the 4173 invariant sites had false iSNV calls in the Nanopore data, even when including all three replicates (Fig. 5a). This is because the majority of the false positive iSNVs had measured frequencies as high, or higher, than the 10% true positives (Fig. 5c), leading to a > 93% false discovery rate (Fig. 5a). To investigate if the false discovery rate could be reduced using additional data within the sequencing reads, we trained a logistic regression model incorporating iSNV frequency and strand bias  as features, and the presence or absence of known iSNVs as the response variable (Fig. 5b). Based on this analysis, we found that using a frequency and strand bias filter resulted in a higher true or false iSNV discriminatory power, as shown by its greater area under the curve, than the two features independently (Fig. 5b). Using this filter for individual replicates, we were able to reduce the number of false positive iSNVs from ~ 900 to 1000 (3% cutoff) to ~ 50–70 (frequency + strand bias) and the false discovery rate from ~ 95 to ~ 55% (Fig. 5a). By including replicate sequencing (either 2 or 3), we could further reduce the false discovery rate to < 40% (Fig. 5a). Despite this significant reduction, the remaining false positive iSNVs still had high frequencies in our dataset (~ 5–25%, Fig. 5d). It should be noted that because we are comparing two divergent viruses, the false discovery rates will likely increase when sequencing virus populations with fewer true iSNVs; however, applying the frequency and strand bias filter will still provide higher true or false iSNV discriminatory power. These findings show that estimating intrahost virus genetic diversity using the Oxford Nanopore platform will require additional technological and computational innovations for anything other than simple scenarios of co-infections with diverse virus haplotypes.
Accurate analysis of amplicon-based sequencing data using iVar
Our computational package, iVar, contains functions broadly useful for viral amplicon-based sequencing that cannot be accomplished using currently existing tools. We programmed iVar in C++ with minimal dependencies, and created the following functions to accurately call intrahost variants and generate virus consensus genomes from sequencing data across multiple replicates: (1) trimming of primers and low-quality bases, (2) consensus calling of virus sequences, (3) intrahost variant calling of iSNVs, insertions, and deletions, followed by a filtering step that uses variants called across multiple replicates to exclude false positives, and (4) identification of mismatches in primer sequences and exclusion of the corresponding reads from alignment files. We also created two pipelines using iVar to call intrahost variants from samples with or without known reference sequences, and prepared with or without technical replicates (with no limit on the number of technical replicates). When using iVar in combination with PrimalSeq, from our empirically-derived data we found that the following guidelines produced robust and reproducible results: (1) only call intrahost variants detected in two or more technical replicates greater than 3% frequency (Figs. 1a and 3c) and (2) remove reads from amplicons with mismatched primers to normalize comparisons of intrahost populations (Fig. 2).
We incorporated several functions into iVar for accurate intrahost variant calling that are currently not available in other software packages (Fig. 6). First, iVar removes primer sequences from aligned reads in an input BAM file, based on a BED file with primer positions. This allows iVar to accurately trim primer sequences irrespective of potential mismatches in the aligned region of the sequencing reads and primer sequences. Following the trimming of primer sequences, iVar uses a sliding window approach to remove low quality bases based on phred score thresholds that can be specified by the user. During the trimming process, iVar stores primer sequences that were trimmed off as auxiliary data for each read in each input BAM file. Second, for virus consensus sequence generation, iVar uses the output of mpileup taking into account ambiguous nucleotides and a minimum threshold for base coverage that can be specified by the user. Third, to detect iSNVs, deletions, and insertions, iVar uses the output of mpileup taking into account a minimum threshold for base quality and a minimum threshold for variant frequency. iVar then uses the intrahost variants called across multiple technical replicates to exclude variants that may have been introduced into individual replicates due to amplification, library preparation, and/or sequencing errors (Fig. 3c). Fourth, to identify primer sequences with mismatches, iVar calls variants on an alignment of primer sequences and identifies those with mismatches to the reference. Reads with auxiliary data that matches these identified primers are selectively removed from the alignment. This ensures that varying primer binding efficiency will not bias the frequency of the intrahost variants called with iVar (Fig. 2). Thus, iVar provides an inclusive software package that integrates a set of critical functions for accurate primer and quality trimming, consensus calling, and intrahost variant detection from data generated using amplicon-based sequencing, including PrimalSeq.
We benchmarked iVar against the pre-existing tools VarScan2 , MAFFT , Geneious , Trimmomatic , and cutadapt  to validate the trimming, consensus sequence generation, and intrahost variant calling functions in iVar. We found that iVar performed as well as, or better than, each of these tools (Additional file 2: Figures S4-S6). We used two simulated datasets and two clinical Zika virus samples sequenced using PrimalSeq to validate iSNV calling, and found an almost perfect correlation between iVar and VarScan2 (Spearman’s ρ = 1; Additional file 2: Figure S4). We also found zero nucleotide differences in the consensus sequences called using iVar and Geneious at all four thresholds (0%, 25%, 50%, and 90%) and across the four datasets (Additional file 2: Figure S5). We found that iVar was better than cutadapt at trimming primer sequences in amplicon-based sequencing datasets (Additional file 2: Figure S6). This is because iVar uses primer positions specified in a BED file to soft clip the primer regions after alignment, whereas cutadapt trims sequencing reads by comparing the primer nucleotide sequence with the nucleotides at the 5′ end of each read, before alignment. As a result, iVar was able to trim sequencing reads that might not start, or end, exactly at the beginning of the primer sequence (Additional file 2: Figure S6). Since cutadapt uses the actual primer sequences, which are assumed to be anchored at the 5′ or 3′ end to do the trimming, it misses these cases. We trimmed the length of the longest primer sequence (22 bp) from the 5′ end of all the sequenced reads using the “HEADCROP” option in Trimmomatic. This approach, however, is crude and will result in a loss of 22 bp from the 5′ end of all sequenced reads (Additional file 2: Figure S6). Thus, iVar contains functionality that is critical for performing primer and quality trimming, consensus calling, and variant calling from datasets generated using amplicon-based protocols.
PrimalSeq and iVar can be used to measure intrahost virus genetic diversity from primary samples
Start with at least 1000 RNA copies of the virus for the initial cDNA synthesis step.
Prepare the RNA from virus populations for sequencing in duplicate.
Sequence each library to a depth of at least 400× at each genome position using the Illumina platform.
Only call iSNVs greater than 3% frequency that are detected in both replicates (a lower frequency may be achievable with higher RNA quantities).
For multi-sample comparisons of genetic diversity, omit genome regions amplified with primers that contain iSNVs within the binding sites.
Several factors can alter the accuracy of measuring intrahost virus diversity. In particular, we found that input virus concentrations, sequencing coverage depths, and primer mismatches can have profound effects on iSNV estimations. Using the recommendations above, however, we could consistently and accurately detect iSNVs at 3% frequency and higher. We predict that the lower limit of iSNV detection can be improved with a higher effective sampling depth (i.e., more input virus and deeper coverage) .
Given no primer mismatches to the virus sequences, we found that measures of iSNVs frequencies from PrimalSeq were nearly as accurate as an untargeted metagenomics approach . Because iVar remove the primer sequences from downstream analysis and use overlapping amplicons, frequency measures of iSNVs within the primer regions themselves are not skewed. Instead, iSNVs within primer regions can alter the measured frequencies of other iSNVs within that particular amplicon. In these cases, results should be interpreted with caution, and we incorporated a step in iVar to mask out such regions for comparative analyses. It is plausible that using primers with degenerate nucleotides at mismatched iSNV positions could help alleviating this bias .
False iSNV calls significantly influence measurements of intrahost virus diversity . We found that the positive association of false positive iSNVs with sequencing read lengths better fit the profiles of Illumina sequencing errors, rather than PCR errors [16, 48]. In fact, we estimate that the Illumina MiSeq error rate (~ 0.9% ) is ~ 60× greater than the error rates during PCR in our approach (~ 0.02% ). Therefore, PrimalSeq likely does not add significantly more error, and by extension false iSNVs, than what was already inherent to the Illumina sequencing platform. Indeed, we found that PrimalSeq was comparable to PCR-free metagenomic sequencing in estimating intrahost virus diversity.
The ease and portability of Oxford Nanopore technologies, particularly the MinION, are revolutionizing the way we sequence viruses, including its use in near real-time outbreak tracking [19, 28, 37]. Our data indicate, however, that the Nanopore platform is not yet adequate for detection of minor alleles and measures of intrahost diversity. While it may provide value in tracking frequency changes of known iSNVs over time, we found that the high error rates (10–15% [39, 40]) makes it difficult to differentiate between true and false iSNV calls. We found that stringent post-filtering, such as combining iSNV frequencies and strand bias across comparing replicate samples, significantly reduce false positive iSNV calls, but there is still a high false discovery rate. Effectively using Nanopore sequencing for intrahost virus diversity measurements will require higher sequencing accuracy and base calling, exploitation of co-occurring variants (i.e., haplotyping) , or utilization of different molecular approaches, including the 1D2 method (where template and complementary strands of each fragment are sequenced) , tandem repeat consensus techniques [36, 51] or unique molecular identifiers .
For viruses that utilize multiple hosts, like mosquito-borne viruses, being able to compare results from many samples types is critically important. A lack of standardization, however, means that the field does not yet have a consensus to whether the mosquito vector or the vertebrate host contributes the most to virus genetic diversity [53, 54, 55, 56, 57, 58, 59, 60, 61, 62]. The development of PrimalSeq and iVar allows for such measurements to be performed across diverse environments, sample types, and experimental designs. Using PrimalSeq, for example, we found that in vitro Zika virus diversity was significantly greater in human cells, when compared to mosquito cells. However, we found that these results were reversed during in vivo studies. Furthermore, we did not detect significant differences in field-collected mosquito and vertebrate samples for both Zika and West Nile virus. A caveat for the field samples, however, is that we do not know the reference sequence and cannot account for consensus-changing mutations introduced through intrahost bottlenecks and genetic drift [63, 64, 65]. In addition, a limitation for all of our samples is that we can only compare diversity measurements from iSNVs greater than 3% frequency, and iSNVs below this threshold may be important for the virus population structure and phenotype [5, 11, 66]. Even still, our incongruent results among experimental designs help to explain why there is still debate about the relative impact of vectors and hosts on virus evolution, and further use of PrimalSeq will help to resolve these issues.
We demonstrate that PrimalSeq can accurately measure intrahost virus genetic diversity if properly validated. We benchmarked our highly multiplexed and streamlined amplicon-based sequencing method using a series of experiments with mixed virus populations, developed an all-inclusive computational analysis tool (iVar), and showcase its utility by measuring intrahost virus diversity from cells, mosquitoes, primates, birds, and humans. Furthermore, using our free online primer designer, Primal Scheme (primal.zibraproject.org) , PrimalSeq can be modified for use with a wide range of viruses. Overall, our detailed laboratory and computational approaches presented here can reveal important insights about intrahost virus evolution directly from clinical or experimental samples in a way that is cheap, accurate, and scalable.
Mixed virus populations
Zika virus RNA from isolates PRVABC59 (Puerto Rico 2015, Genbank KX087101, “virus #1”) and FSS13025 (Cambodia 2010, Genbank KU955593, “virus #2”) were quantified by qRT-PCR (as previously described ). The consensus sequences from PRVABC59 and KX087101 were determined using untargeted metagenomics (see below) and a strict > 99% majority nucleotide threshold at each site. Sites that were mixed (i.e., containing an iSNV > 1% frequency) were not used to evaluate iSNVs at known frequencies (Fig. 1). Using quantified virus RNA copies, the two viruses were mixed to achieve the desired total RNA copies (one half required amount because 2 μL of RNA was used for cDNA) and ratios of PRVABC59:FSS13025. Metagenomic sequencing of a 10:1 mixed virus population (i.e., 10% FSS13025) was used to verify our mixing approach (Fig. 4). Each mixed virus population was sequenced in triplicate using the metagenomic and amplicon approaches described below.
Laboratory-infected cells, mosquitoes, and primates
Zika virus was collected from in vitro and in vivo experiments to compare intrahost diversity between mosquitoes (Ae. aegypti) and primates (humans and macaques, Additional file 1: Table S2). All in vitro and in vivo experiments were conducted using Zika virus isolate PRVABC59 (Puerto Rico, 2015, KX087101). All Zika virus RNA was quantified by qRT-PCR, as described .
Aag2 (derived from Ae. aegypti embryos ) and HeLa (derived from human cervical epithelial cells, ATCC CCL-2) cells were infected using a multiplicity of infection of 0.01 and supernatant was harvested 5 days post infection. Both cell lines were maintained using Minimal Essential Medium (Sigma-Aldrich) supplemented with 10% (v/v) fetal bovine serum, l-glutamine, sodium bicarbonate, and antibiotics (penicillin and streptomycin). Aag2 and HeLa cells were incubated with 5% CO2 at 27 °C and 37 °C, respectively.
Ae. aegypti mosquitoes were infected with Zika virus as previously described . In brief, colonized mosquitoes originating from Los Angeles, California, in 2016 feed on viremic mice inoculated with 5 log10 Vero plaque-forming units of Zika virus (PRVABC59). At 14 days post infection, individual mosquitoes were collected and homogenized. Viral RNA was extracted from 50 μL of mosquito homogenate using the using the MagMax Viral RNA Extraction Kit and eluted 50 μL of elution buffer (Buffer EB, Qiagen). Indian origin rhesus macaques (Macaca mulatta) were inoculated subcutaneously with 3 log10 Vero plaque-forming units of Zika virus (PRVABC59) and plasma was collected 5 days post infection, as described [69, 70]. RNA was extracted from at least 300 μL of rhesus macaque plasma using the MagMax Viral RNA Extraction Kit and was eluted in 60 μL of elution buffer. RNA extracts from laboratory infected mosquitoes and macaque plasma used for this study had been thawed previously at least one time.
Field-collected mosquitoes and clinical samples
Clinical and entomological samples were collected during the 2016 Florida Zika virus outbreak  to compare intrahost Zika virus diversity between naturally infected humans and mosquitoes (Additional file 1: Table S2). Human clinical samples were obtained for diagnostic and surveillance purposes and excess human sera were used for this study. RNA was extracted using the RNAeasy kit (Qiagen) and eluted into 50–100 μL using the supplied elution buffer. Entomological samples were collected by the Miami-Dade Mosquito Control for surveillance of Zika virus activity. Ae. aegypti mosquitoes were collected using BG-Sentinel mosquito traps (Biogents AG) and sorted into pools of up to 50 females per trap. The pooled mosquitoes were stored in RNAlater (Invitrogen), RNA was extracted using the RNAeasy kit (Qiagen), and Zika virus RNA was quantified by qRT-PCR . RNA from Zika virus positive pools used in this study contained 13–39 individual mosquitoes; however, considering that ~ 1 in 1600 were infected , it is highly unlikely that any pool contained > 1 infected mosquito.
Culex quinquefasciatus mosquitoes (up to 50 per trap) and dead American crows were collected by the San Diego County Vector Control Program during 2015. RNA was extracted using the RNAeasy kit (Qiagen) and screened for the presence of West Nile virus RNA using standard qRT-PCR.
Quantification of virus RNA copies
Zika virus RNA copies were quantified using a qRT-PCR assay targeting the NS5 protein coding region of the genome using the BioRad One-step qRT-PCR for probes kit. In a 20-μL reaction, 2 μL of virus RNA was added to 10 μL of iTaq universal probes reaction mix, 0.5 μL of iScript RT, 6 μL of nuclease-free water, 0.5 μL of the forward primer (5′-AGTGCCAGAGCTGTGTGTAC-3′; genome positions 9007–9027), 0.5 μL of the reverse primer (5′-TCTAGCCCCTAGCCACATCT-3′; genome positions 9097–9117), and 0.5 μL of the 6-FAM labeled probe (5′-GGCAGCCGCGCCATCTGGT-3′; genome positions 9078–9096). The reactions were then amplified on a thermocycler with the following conditions: 50 °C for 10 min, 95 °C for 3 min, and followed by 40 cycles of 95 °C for 10 s and 57 °C for 10 s (fluorescence read at the end of the 57 °C step). To calculate the number of virus template copies using standard curves, we include 10-fold dilutions of partial Zika virus RNA genomes spanning the primer sites (107 to 100 copies per reaction). The Zika virus RNA standards were constructed by PCR amplifying a 848 bp segment of the Zika virus NS5 protein coding region (genome positions 8644 to 9492) with the following primers: forward containing a T7 promoter region (5′-TAATACGACTCACTATAGGGAGATCAGGCTCCTGTCAAAACCC-3′; underlined = T7 promoter sequence; genome positions 8644–8664) and reverse primer (5′-AGTGACAACTTGTCCGCTCC-3′; genome positions 9472–9492). The amplified cDNA was converted into RNA to be used as standards using the Invitrogen MEGAscript T7 Transcription Kit.
The accuracy of measuring virus RNA copies by targeting one small genome region, in this case positions 9007–9117, is dependent on relatively equal proportions of the virus genome present in the sample. To address this, we used untargeted metagenomic sequencing of our 1000 RNA copy stocks of virus #1, virus #2, and three replicates of the Mix10% population. The normalized coverage shows that depth is consistent across the virus genome (Additional file 2: Figure S3). The normalized coverage changes are consistent among virus samples and replicates, suggesting that coverage depth is more dependent on intrinsic factors of the virus genome influence replication efficiency (i.e., GC content ) rather than significant RNA degradation leading to the loss of a fraction of the virus genome. Hence, we are confident that our qRT-PCR results are relatively informative for determining the virus RNA copy numbers across the whole genome.
For all Zika virus samples, 1000 virus RNA copies were used for sequencing, unless otherwise specified (e.g., Fig. 1c). For all West Nile virus sample, 10,000 virus RNA copies were used. Normalizing input copy numbers allowed us to more accurately compare sequencing results.
PCR amplification of the virus genomes
Virus RNA (2 μL) was reverse transcribed into cDNA using Invitrogen SuperScript IV VILO (20 μL reactions). Virus cDNA (2 μL) was amplified in 35× ~ 400 bp fragments from two multiplexed PCR reactions using Q5 DNA High-fidelity Polymerase (New England Biolabs) using the conditions previously described . For the data shown in Fig. 1, the mixed Zika virus populations were amplified in one multiplexed reaction containing primer sets 5, 24, and 33. A detailed protocol can be found in Additional file 3 and the Zika and West Nile virus primers can be found in Additional file 1: Tables S3 and S4, respectively.
Amplicon-based Illumina sequencing
A detailed protocol for our amplicon-based sequencing methods can be found in Additional file 3. Protocol updates will be released online at http://grubaughlab.com/open-science/amplicon-sequencing/  and https://andersen-lab.com/secrets/protocols/ . Virus amplicons from the two multiplex PCR reactions (above section) were purified using Agencourt AMPure XP beads (Beckman Coulter) and combined (25 ng each) prior to library preparation. The libraries were prepared using the Kapa Hyper prep kit (Kapa Biosystems, following the vendor’s protocols but with one fourth of the recommended reagents) and NEXTflex Dual-Indexed DNA Barcodes (BIOO Scientific, diluted to 250 nM). Agencourt AMPure XP beads (Beckman Coulter) were used for all purification steps. The libraries were quantified and quality-checked using the Qubit (Thermo Fisher) and Bioanalyzer (Agilent). Paired-end 250 nt reads were generated using the MiSeq V2 500 cycle or V3 600 cycle kits (Illumina).
Untargeted metagenomic Illumina sequencing
We followed the general outline of a previously developed protocol for untargeted sequencing of the mixed viral populations . In brief, cDNA was generated as described for the amplicon-based methods. Second-strand cDNA was generated using Escherichia coli DNA ligase and polymerase (New England Biolabs). The cDNA was purified by Agencourt AMPure XP beads (Beckman Coulter) prior to library preparation using Nextera XT (Illumina) following the vendor’s protocols, but with less reagents. Specifically, for tagmentation (12.5 μL reaction), we concentrated our cDNA to 4 μL using a DNA speedvac and used 5 μL of Tagment DNA Buffer (one half recommended) and 1 μL of Amplicon Tagment Mix (one fifth recommended). After incubation, the reaction was stopped using 2.5 μL of Neutralize Tagment Buffer (one half recommended). The libraries were indexed and amplified using one half of the Nextera PCR reagents and primers in a 25-μL reaction. Agencourt AMPure XP beads (Beckman Coulter) were used for the final purification step (purified twice at a ratio of 0.7:1 beads to sample). The libraries were quantified and quality-checked using the Qubit (Thermo Fisher) and Bioanalyzer (Agilent). Paired-end 251 nt reads were generated using the MiSeq V2 500 cycle kit. The paired-end reads were aligned to a provided reference genome using BWA , the reads were quality trimmed (Phred quality score < 20) using Trimmomatic , and iSNVs were called based on frequency from the bam files using Geneious v9.1.5 . No other iSNV filters, such as strand bias, were used to better compare to the amplicon-based Illumina data.
Illumina data processing and variant calling using iVar
Validation of iVar
iVar was validated against existing tools. (Additional file 1: Table S7, Additional file 2: Figure S4-S6). We validated iSNV calling in iVar against the `mpileup2snp` and `mpileup2indel` commands in VarScan2 (v2.3.9)  using four datasets—two simulated datasets and two clinical Zika virus samples, sequenced using PrimalSeq. We ran both tools with no thresholds and with a quality threshold of 20 and a minimum frequency threshold of 3%. We validated the consensus calling in iVar against the consensus calling available in Geneious (v11.1.4)  using four datasets—two simulated datasets and two clinical Zika virus samples, sequenced using PrimalSeq. We did the consensus calling at four different thresholds—0% (majority), 25%, 50% (strict), and 90%. We counted the mismatches between the resulting consensus sequences from iVar, Geneious and the reference sequence by performing a multiple sequence alignment using MAFFT (v7.388) . While counting mismatches, we ignored mismatches when one sequence had a gap and the other had a “N,” since in either case. We validated primer trimming and quality trimming in iVar against anchored adapter trimming in cutadapt (v1.16)  and against Trimmomatic . iVar uses a sliding window starting from the 5′ end and checks if the average quality within the window drops below the threshold. As soon as the quality drops below the threshold, it trims the sequence by soft clipping the read. This is different from the algorithm cutadapt uses to do quality trimming, but is similar to the sliding window approach used by Trimmomatic. The data and code used for the validation are at github.com/andersen-lab/paper_2018_primalseq-ivar .
Oxford Nanopore sequencing and analysis
Using the same PCR amplicons used for amplicon-based Illumina sequencing, we sequenced three replicates of the mixed Zika virus population (90% virus #1, 10% virus #2) using the Oxford Nanopore GridION sequencer. Native, 1D barcode libraries (SQK-NSK007, Oxford Nanopore Technologies, UK) were prepared according to previously published methods , with three amplification replicates corresponding to barcodes 1, 2 and 3. The pooled sequencing library was sequenced on an R9.4 version flowcell (FLO-MIN106, Oxford Nanopore Technologies, UK). Reads were basecalled using Albacore 2.3.1 using the command-line read_fast5_basecaller.py -c r94_450bps_linear.cfg -i fast5 -o fastq -r -t 12. Reads were subsequently demultiplexed with Porechop 0.2.3_seqan2.1.1 using default (lenient) settings (github.com/rrwick/Porechop). A total of 2.4 million reads were generated which after alignment and trimming covered 95.66% of the reference genome (Genbank KX087101). For the purposes of assigning to genotypes (i.e., unique virus haplotypes), reads were assigned to individual strains using BWA-MEM  against a custom reference database comprising four Zika virus genomes: Genbank KX087101 (virus #1), KU955593 (virus #2), EU545988 (an Asian lineage virus isolated in 2007) and NC_012532 (MR766, an African lineage virus). Counts for each assignment were retrieved, ignoring multi-mapping reads using the shell command bwa mem -x ont2d | samtools view -h -F 256 - | samtools view -h -F 2048 - | cut -f 3 | sort | uniq -c. Next, each replicate was aligned to the PRVACB59 reference genome with BWA-MEM using setting -x ont2d. Primer binding sites and any residual adaptor sequence were masked in the resulting BAM alignment using the align_trim script from the Zibra pipeline . Allele frequencies and putative iSNVs (ignoring insertions or deletions) were extracted from BAM files using a Python script freqs.py (included in the accompanying code repository: github.com/nickloman/zika-isnv ). This script utilizes the pileup functionality of samtools via the pysam Python interface module (github.com/pysam-developers/pysam). Only predicted variants with more than 10 supporting forward and 10 supporting reverse reads were considered. The logistic regression model was trained and tested under a 10-fold cross validation scheme using the train function with the parameters method = “glm” and family = “binomial” from the caret (github.com/topepo/caret/) library in R. Class probabilities for the ROC curve were captured from the same function and plotted using ggplot2 (github.com/tidyverse/ggplot2).
where p is the iSNV frequency and the mean Sn from all evaluated sites within the virus genome was used to determine the population complexity (Fig. 7b). Distance was calculated by the sum of all of the iSNV frequencies per population (Fig. 7c).
We thank Barney Graham (VRC; NIAID/NIH) for support with the non-human primate studies, the Florida Department of Health for providing clinical samples, Miami-Dade Mosquito Control for providing collected Ae. aegypti pools, San Diego County Vector Control Program for providing West Nile virus samples, and Glenn Oliveira, Mark Zeller, Refugio Robles, Emily Spencer, Dylan Grubaugh, and Sophie Taylor for technical support.
NDG was supported by NIH training grant 5T32AI007244-33. JQ is supported by a grant from the NIHR Surgical Reconstruction and Microbiology Research Centre (SRMRC). KKAVR is supported by the Office of Research Infrastructure Programs/OD (P51OD011107) to CNPRC and NIH R21AI129479. SI and SFM are supported by NIH NIAID R01AI099210. LLC was supported by startup funds from the UC Davis Department of Pathology, Microbiology and Immunology and the Pacific Southwest Regional Center of Excellence for Vector-Borne Diseases funded by the U.S. Centers for Disease Control and Prevention (Cooperative Agreement 1U01CK000516). BJM was supported by Abt Associates and a consortium of vector control districts in California: Coachella Valley, Orange County, Greater Los Angeles County, San Gabriel Valley, West Valley, Kern, Butte County, Tulare, Sacramento-Yolo, Placer, and Turlock. The rhesus macaque studies were supported by NIH 1R21AI129479–01 & Supplement, California National Primate Research Center pilot research grant P51OD011107 and FDA HHS HHSF223201610542P. NJL is supported by a Medical Research Council Bioinformatics Fellowship as part of the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) project. KGA is a Pew Biomedical Scholar, and is supported by NIH NCATS CTSA UL1TR001114, NIAID contract HHSN272201400048C, NIAID R21AI137690, NIAID U19AI135995, and The Ray Thomas Foundation.
Availability of data and materials
All additional files can be found at github.com/andersen-lab/paper_2018_primalseq-ivar  and raw sequencing files can be found at console.cloud.google.com/storage/browser/andersen-lab_project_ivar-primalseq. The laboratory protocols generated from this study can be found in Additional file 3. Our computational tool, iVar, is licensed under an open source license compliant with OSI (GPL-3.0), is installable via bioconda ("conda install ivar"), and the source code is available at github.com/andersen-lab/ivar . The version of the code used in this paper is available at https://doi.org/10.5281/zenodo.2471612. Protocol updates and additional primer schemes can be found at grubaughlab.com/open-science/amplicon-sequencing/  and andersen-lab.com/secrets/protocols/ . The validation analyses from this study can be found in Additional file 2, github.com/andersen-lab/ivar-validation/, github.com/nickloman/zika-isnv , NCBI Bioproject PRJNA438514 (illumina data) , and ENA project PRJEB30574 (nanopore data).
The study was conceived and coordinated by NDG, KG, NJL, and KGA. The samples were provided by BJM, ALT, LMP, DEB, SG, NG, KKAVR, SI, SFM, and LLC. Library preparation and sequencing was performed by NDG, JGDJ, and JQ. The variant calling pipeline (iVar) was designed and built by KG and NLM. The data was analyzed and interpreted by NDG, KG, NJL, and KGA. The manuscript was written by NDG, KG, NJL, and KGA with input from all co-authors. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Research on human subjects was conducted in compliance with existing regulations relating to the protection of human subjects and was evaluated and approved (#IRB-15-6664) by the Institutional Review Board/Ethics Review Committee at The Scripps Research Institute. Clinical samples were obtained from the Florida Department of Health (DOH) and Antibody Systems Inc. Samples collected in Florida were collected under a waiver of consent granted by the Florida DOH Human Research Protection Program. The work received a non-human subjects research designation (category 4 exemption) by the Florida DOH since this research was performed with leftover clinical diagnostic samples involving no more than minimal risk. Hence, written informed consent was not obtained. All samples were de-identified prior to receipt by the study investigators. The experimental methods used comply with the Helsinki Declaration.
Research involving Indian origin rhesus macaques was conducted at the California National Primate Research Center, and experimental infections of mice upon which Ae. aegypti mosquitoes fed were performed at the University of California, Davis, School of Veterinary Medicine. Both institutes are fully accredited by the Association for the Assessment and Accreditation of Laboratory Animal Care International. Animals were cared for in accordance with the National Research Council Guide for the Care and Use of Laboratory Animals and the Animal Welfare Act. Animal experiments were approved by the Institutional Animal Care and Use Committee of UC Davis (protocols #19211 and #19695 for rhesus macaques, protocol #19404 for mice). All macaques samples used in this study were from approved studies ; and none were generated specifically for this work.
Consent for publication
NJL has received travel and accommodation expenses from Oxford Nanopore Technologies to attend meetings, and an honorarium to speak at an internal company meeting. NJL has previously received free-of-charge reagents and consumables in support of research projects from Oxford Nanopore Technologies. The other authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 13.Illingworth CJR, Roy S, Beale MA, Tutill H, Williams R, Breuer J. On the effective depth of viral sequence data. Virus Evol. 2017;3:vex0303. Available from: https://academic.oup.com/ve/article/3/2/vex030/4629376?searchresult=1.
- 18.Magnani DM, Rogers TF, Beutler N, Ricciardi MJ, Bailey VK, Gonzalez-Nieto L, et al. Neutralizing human monoclonal antibodies prevent Zika virus infection in macaques. Sci Transl Med. 2017;9. Available from:. https://doi.org/10.1126/scitranslmed.aan8184.
- 19.Faria NR, Quick J, Claro IM, Thézé J, de Jesus JG, Giovanetti M, et al. Establishment and cryptic transmission of Zika virus in Brazil and the Americas. Nature. 2017; Available from:. https://doi.org/10.1038/nature22401.
- 20.Barzon L, Pacenti M, Franchin E, Lavezzo E, Trevisan M, Sgarabotto D, et al. Infection dynamics in a traveller with persistent shedding of Zika virus RNA in semen for six months. Euro Surveill. 2016;21. Available from. https://doi.org/10.2807/1560-7917.
- 22.Zanini F, Brodin J, Thebo L, Lanz C, Bratt G, Albert J, et al. Population genomics of intrapatient HIV-1 evolution. Elife. 2015;4. Available from:. https://doi.org/10.7554/eLife.11282.
- 25.Metsky HC, Matranga CB, Wohl S, Schaffner SF, Freije CA, Winnicki SM, et al. Zika virus evolution and spread in the Americas. Nature. 2017;66:366.Google Scholar
- 26.Grubaugh ND, Ladner JT, Kraemer MUG, Dudas G, Tan AL, Gangavarapu K, et al. Genomic epidemiology reveals multiple introductions of Zika virus into the United States. Nature. 2017;90:4864.Google Scholar
- 27.Black A, Potter B, Dudas G, Feldstein L, Grubaugh ND, Andersen KG, et al. Genetic characterization of the Zika virus epidemic in the US Virgin Islands: bioRxiv; 2017. p. 113100. [cited 2017 May 11]. Available from: http://biorxiv.org/content/early/2017/03/03/113100.abstract.
- 28.Faria NR, Kraemer MUG, Hill S, de Jesus JG, de Aguiar RS, Iani FCM, et al. Genomic and epidemiological monitoring of yellow fever virus transmission potential. Science. 2018;361:894-99. Available from. http://science.sciencemag.org/content/361/6405/894.
- 29.Hepp CM, Cocking JH, Valentine M, Young SJ, Damian D, Sheridan K, et al. Phylogenetic analysis of West Nile Virus in Maricopa County, Arizona: Evidence for dynamic behavior of strains in two major lineages in the American Southwest. PLOS ONE. 2018;13:e0205801 Available from: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205801.
- 31.Worobey M, Watts TD, McKay RA, Suchard MA, Granade T, Teuwen DE, et al. 1970s and “Patient 0” HIV-1 genomes illuminate early HIV/AIDS history in North America. Nature. 2016 [cited 2016 Oct 26]; Available from:. https://doi.org/10.1038/nature19827.
- 47.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17:10–2.Google Scholar
- 50.Ebler J, Haukness M, Pesout T, Marschall T, Paten B. Haplotype-aware genotyping from noisy long reads: bioRxiv; 2018. p. 293944. [cited 2018 Jul 13]. Available from: https://www.biorxiv.org/content/early/2018/04/03/293944.abstract
- 72.Grubaugh ND. Amplicon sequencing: Grubaugh Lab. [cited 2018 Dec 19]. Available from: http://grubaughlab.com/open-science/amplicon-sequencing/
- 73.Andersen KG. Protocols: Andersen Lab. [cited 2018 Dec 19]. Available from: https://andersen-lab.com/secrets/protocols/
- 74.Gangavarapu K, Andersen KG. iVar: Github. [cited 2018 Dec 19]. Available from: https://github.com/andersen-lab/ivar
- 77.Gangavarapu K, Grubaugh ND, Andersen KG. Additional files and data for iVar and PrimalSeq: Github. [cited 2018 Dec 19]. Available from: https://github.com/andersen-lab/paper_2018_primalseq-ivar
- 78.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM: arXiv; 2013. Available from: http://arxiv.org/abs/1303.3997
- 79.Loman NJ. Additional files and data for calling iSNV using MinION sequencing: Github. [cited 2018 Dec 19]. Available from: https://github.com/nickloman/zika-isnv
- 80.Grubaugh ND. BioProject: PRJNA438514: NCBI. [cited 2018 Dec 19]. Available from: https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA438514
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.