Viral borne diseases exert a significant impact on human health with millions of individuals infected yearly and diseases such as HIV/AIDS ranking as a leading cause of death worldwide ( Critical to the design of effective vaccines and therapeutics to combat this burden is a comprehensive map of the genetic composition of viral populations and the characterization of the selective pressures that shape these populations. Compiling such a map is a challenge: in viral infections the low fidelity of the genome replication process and various evolutionary pressures can result in a single infected host harboring a heterogeneous population of genetic variants[1]. Previous studies[29] have shown that many viruses including Dengue, HCV, HIV, Influenza, Polio and West Nile all maintain diverse populations within a single host. This genetic diversity means that the population may already contain variants that are advantageous in the face of challenges such as host immune responses and drug treatments. As such, understanding the extent and composition of intra-host population diversity, even at low frequencies, can be very important in evaluating disease progression, transmission, and response to changes in therapy.

However, identifying intra-host variation depends on an alignment of the sequence data. In cases where the reads cover identical sequences, multiple alignment can be used to solve this problem[10], but this is currently not feasible for whole genome sequences as no platform exists to sequence whole genomes as single reads at high throughput. Alternatively, the sequencing data can be aligned to a reference, as is commonly done for variant detection in human and other organisms[11, 12]. However, the use of a reference genome that is too genetically distant from the sample population will yield inaccurate read alignments and, as previously reported, substantial data loss; both factors decrease the ability to detect biological variants[13, 14]. Because viral consensus can vary substantially between patients it may be difficult to find an existing reference that allows unbiased and complete alignment of reads[2, 15]. One solution to this problem is to start the analysis by de novo assembly of each patient sample, allowing use of the patient consensus as the reference for variant detection[2]. Since the sample consensus will be near the centroid of the intra-host variation, it should be optimal for alignment of all sequence data to a single reference.

Further, the consensus sequence itself is of value. Notably, the majority of publicly available genomic sequences were captured using bulk Sanger sequencing strategies and as such a single consensus assembly is the only data available to compare against to derive biological insights. A consensus serves as a single datum that represents the entire underlying population or some subset of the population and thereby enables the identification of dominant genetic mutations that vary between two populations or subsets of the same population[2, 15, 16]. Lastly, for samples infected with unknown viruses a reference guided mapping strategy will not be applicable and a de novo approach is required[2, 17, 18].

There are two major frameworks for de novo genome assembly. Overlap-layout-consensus based methods[1921] first identify reads that share good suffix-prefix alignment. This operation divides the input reads into disjoint sets, termed contigs. Then, multiple sequence alignment is performed for each contig to derive the consensus sequence, an approximation to a target genome fragment. The relative positions and orientations of these contigs are estimated by paired reads that land in different contigs. The de Bruijn graph based methods[13, 2227] first decompose input reads into k mers, denoted as vertices, then create a directed edge between any pair of vertices if the last (k−1) bases of the source vertex is identical to the first (k−1) bases of the target vertex. The graph is then simplified by chain compaction to shorten paths that have a unique entry and exit, and edited to remove small tips and bubbles that are likely attributed to sequencing errors. Finally, contigs are generated and oriented during graph traversal, guided by paired reads and coverage information.

Assembly algorithms devised specifically for the sequencing of less diverse haploid and diploid genomes by short reads tend to fare poorly on data derived from variant populations such as viruses, both because of the difficulty of separating continuous variation from error, and because of other process-related challenges[2]. In addition, sample preps of RNA viral genomes currently require reverse transcription and usually require amplification. These protocols tend to result in highly variable coverage along the length of the genome and may introduce large amounts of contamination from other RNA species present in the starting material. These artifacts of the process tend to further confuse existing assembly algorithms, which rely on depth of coverage to both indicate copy number of sequences and distinguish errors from true variants.

Recent work utilizing an overlap-layout-consensus strategy has shown that it is possible to generate high quality de novo assemblies from relatively deep coverage data produced by massively parallel pyrosequencing of diverse populations[2]. However, this method adapts poorly to other platforms such as the Illumina and Ion Torrent, due to a much larger amount of data produced, which increases both the computational complexity, and the difficulty of merging divergent genotypes and handling process error. An alternative strategy was recently reported by Iqbal et al.[13] These authors demonstrated de novo assembly of diploid genomes or a population consisting of a small number of eukaryotic genomes using a de Bruijn graph strategy, but the amount of diversity inherent in viral populations is beyond the target range for this algorithm.

Here we present VICUNA, a de novo assembler of very high but variable coverage short read data from a population of diverse but non-repetitive genomes. VICUNA is an overlap-layout-consensus based assembly algorithm. Unlike assemblers optimized for large repetitive genomes, VICUNA aggressively merges similar sequences, and has the capacity to retain low frequency single nucleotide and length polymorphisms. We validated VICUNA on 12 viral population samples. These 12 samples were obtained from patients infected with Dengue Virus (DENV), Human Immunodeficiency Virus (HIV) and West Nile Virus (WNV), which represent a spectrum of intra-host population variation. VICUNA recovers the full target regions with high fidelity in all samples. The algorithm captures low frequency non-dominant length variants, specified by a tunable threshold. In a handful of these samples, we recovered alternate consensus containing large deletions (≥ 500bp). VICUNA runs on workstations or blades, and hence is readily accessible to research labs with limited compute resources. Although our immediate target application is the sequencing of RNA viruses from infected hosts, we anticipate that our method may also have utility for a range of applications which pose similar challenges, including rRNA sequencing and whole genome metagenomic analyses.

Results and discussion

VICUNA assembly strategy

We assume a typical viral population consists of an unknown number of genomic variants that vary in their frequencies. A subset of these variants dominates the population and minor variants also exist. We assume that the phylogenetic relatedness within variant clusters is greater than that between clusters. Notably, the premise of a heterogeneous population comprised of closely and distantly related groups of sequences occurring in varied frequencies is applicable to other biological investigations such as metagenomics. The goal of generating a set of consensus sequences that represents a viral population is confounded by high genomic mutation rates, which results in frequent single nucleotide and length polymorphisms between genomes. This paralyzes the key error correction component of de Bruijn graph based assemblers. VICUNA applies the basic rule that a contig is produced to represent a spectrum of genomic fragments such that their pairwise genetic distances, defined as one hundred minus the percentage of sequence alignment identity, is bounded by p, a user specified parameter. When p is larger than the pairwise distance between dominant variants in the population, a subset of contigs would represent these while other contigs would capture the more distant ones. With this idea in mind, VICUNA outputs a set of contigs that are represented as a multiple sequence alignment (MSA) of their constituent reads. Each MSA contains substitutions and indels, which enable the capture of polymorphism in the data. Meanwhile, a single consensus sequence can be readily calculated by taking the dominant nucleotide base in each alignment column with ties broken arbitrarily. Notably, the MSAs are constructed incrementally starting from single reads to the final contigs rather than aligning reads back to the consensus post-assembly. When the viral target is known, as is the case in the examples provided in this study, a final contig representing the population mean can be built by merging contigs spanning the target region, selecting the ones that have the most read support. Nonetheless, when the target is unknown such as a microbiome sample or in cases of unknown infections, we can recover a set of contigs that represents the spectrum of the genomes in the population. This enables the identification of samples from patients co-infected with multiple viruses.

VICUNA consists of four key steps: read trimming, contig construction & clustering, contig validation, and contig extension & merging (Figure1). In addition, the algorithm includes two optional modules (see Additional file1), one to identify target and non-target reads prior to assembly, and a second to guide contig merging using a relevant reference genome if one is available.

Figure 1
figure 1

Schematic of the VICUNA assembly algorithm.

Read trimming

Primer sequences or adapters may still be attached to the fragments pre-sequencing. End of reads that match any substring of primers are removed. In this step, we also remove low complexity reads that likely result from process artifacts[28].

Contig construction & clustering

Identifying similarities among input reads is the ab initio step in building contigs. To avoid the compute-expensive pairwise alignment, existing methods typically identify common k mers (de Bruijn graph based methods) or maximal substrings (overlap-layout-consensus based methods) among input reads, achieved using efficient data structures such as de Bruijn graphs[24, 25], suffix tree/array[29], and Burrows-Wheeler transformation[30]. All use exact string matching strategy as the first pass to identify potential reads with good similarities.

To build contigs at the population level, we need to account for scenarios where diversity varies considerably from one genomic region to another, and the abundance of sequencing errors differs considerably due to highly uneven read coverage. These characteristics of the data pose the following challenges for existing methods. Exact string matching strategies may have difficulty balancing specificity vs. sensitivity in identifying similar reads: whereas a shorter k mer may considerably reduce specificity, sensitivity can suffer with a longer k mer. Also, the degree of vertices in de Bruijn graph increases considerably in both highly diversified and highly covered genomic regions, increasing the required compute resources.

To address these challenges, we adopt a min hash and an inexact string matching based similarity searching methods. The min hash is space and run-time efficient and was originally applied to cluster billions of web documents[31], and it was recently applied to DNA sequences[32]. VICUNA decomposes each read into k mers, each hashed uniformly to a large integral space (e.g. 264). The min hash value is used to represent each read. Based on the proof[31] that the similarity between any two reads, defined as the Jaccard Index, is equivalent to the probability that they have the same min hash, similar reads are held to form the same contig via common min hashes without resorting to pairwise comparison. This strategy is particularly effective in reducing read redundancy in the data. In practice, we used a two-phased min hash strategy to increase specificity (see Methods). However, this method will miss many overlaps that fail to have a common min hash, so we combine it with a seed-based approach. Using spaced seeds is proven to be more sensitive in similarity detection while maintaining the same specificity compared to searching by exact strings[33]. To identify good suffix-prefix overlaps among reads, we utilize a sliding window approach, where the window size is w. We then compute the subsequence of k (< w) selected bases, termed spaced-k mers, in each window. Sequencing errors or nucleotide polymorphism-incurred differences are neglected if they appear in the remaining (wk) bases. Finally, reads that share any common spaced-k mer are held in the same contig. Unlike previous methods that store the entire spaced k-spectrum, we developed a strategy that keeps only a small fraction of it in the memory. Relying on the observation that a read may concurrently appear in multiple contigs, we devised an efficient partitioning algorithm to merge such contigs (see Methods).

Contig validation

Given p, the rate of variation, we would like to capture in the population, define a contig to be consist of a set of reads where each read r differs from the consensus of the contig by at most p×|r| bases where |r| denotes the read length. Abiding by this definition, we validate each contig by iteratively aligning each constituent read to the consensus, discard those that do not satisfy the criteria and update the contig consensus base calls accordingly (see Methods). This novel strategy effectively separates mis-placed reads from each contig generated in the previous step, and creates new contigs when applicable.

Contig extension & merging

VICUNA identifies contigs that share good suffix-prefix overlaps and iteratively merges them to form longer contigs until no further extension can be made. At each iteration, VICUNA starts with the longest contig C. The contig Cʹ that shares the maximum number of paired reads with C is examined to identify a good suffix-prefix alignment. However, such an alignment may not be discovered due to polymorphisms captured in either contig. We resolve this issue by editing each contig before alignment. Specifically we (i) convert any consensus base that has nucleotide polymorphisms to the smallest alphabetic base, and (ii) remove non-dominant low frequency insertions. These polymorphisms will be recovered if C and Cʹ were to be merged. The contig alignment is carried out by first computing a sequence of non-overlapping common substrings between the consensuses of C and Cʹ. We then use the Needleman-Wunsch algorithm for alignment and require the common substring blocks be directly aligned. To account for length polymorphisms that maintain the coding frame and hence represent true biological variants, any small gap comprising a multiple of three nucleotides is not penalized. Once all de novo merges have been completed, contigs can be further merged if mappings to a previously assembled reference genome can validate the merge.

Identification of target and non-target reads

Viral samples derived from clinical specimens typically contain non-target nucleic acids (e.g. host or other microbe DNAs and RNAs), which can comprise a large portion of the sequencing output[28]. These “contaminants” can be inherent in the sample or be artifacts that result from a sample prep method. Non-target reads increase assembly cost, memory and run-time, and can also negatively impact the quality of the assembly of the target genome. To overcome this issue, we developed an optional component (see Additional file1) in VICUNA that identifies and flags such contaminant data and removes it from the assembly process. The optional module utilizes a multiple sequence alignment (MSA) comprised of available reference genomes for the target. The high degree of polymorphism in some populations compounded by errors introduced via the sample preparation process can impact the ability to identify target reads. As previously reported this situation can result in a failed alignment as a target read r looks too dissimilar to any known target genomic sequence[13]. To address this issue, we employ a profiling method that divides the MSA into multiple bins, and compares k mer compositions of r to each bin. If a sufficient number of similar (i.e. within a limited Hamming distance[34]) k mers spanning a sufficient length of r can be identified, r is considered to be target-alike. Moreover, we use a less stringent similarity threshold to place paired reads which could not be placed otherwise but obey the distance constraint imposed by the paired read library.

Sequence data generation & analysis

Using previously described protocols, we amplified four large overlapping amplicons that captured the complete coding regions for DENV[35], HIV[2], and WNV[36]. Amplified products from each sample were acoustically sheared (Covaris, Woburn, MA). Indexed Illumina libraries were prepared as described[37] except that PFU Ultra II enzyme (Agilent, Santa Clara, CA) was used and 7 cycles of PCR enrichment were employed. We pooled all the samples, gel purified them (target insert size: 500-700bp), and generated 225bp paired reads using the Illumina MiSeq platform (see Methods).

The number of Illumina reads generated for each sample ranged from 0.13M to 0.95M (Additional file1: Table S1). To evaluate the general characteristics and coverage profile of these data prior to assembly we aligned the reads to a standard reference genome for each virus (see Methods). In all cases ≥ 87.8% of reads were uniquely aligned to the chosen reference. As expected, alignments to the standard references were suboptimal as compared to de novo assemblies (Table1 and Additional file1: Table S1), with the WNV reference better representing the sample population than the DENV and HIV references. The WNV samples in general had a larger percentage of reads that could be uniquely aligned and a considerably smaller rate of discrepancy between the bases in the reference and the ones derived from the read alignment (Additional file1: Table S1). The average coverage across all the samples ranged from 3,210 to 15,456 and in all cases the region of the genome targeted for sequencing was fully covered. Coverage varied across the genome within each sample; sequence coverage within an individual sample varied between the minimum and maximum by 7 - 620 fold (Figure2 and Additional file1: Figure S1). In general, WNV and DENV samples had greater uniformity of coverage than HIV samples, likely a result of either the greater diversity in the HIV samples or the high degree of secondary structure in the HIV genome[38].

Table 1 Assembly Results for VICUNA , SOAPdenovo and AV 454
Figure 2
figure 2

Coverage plot. Fold sequence coverage across the target regions of one representative sample for DENV, WNV, and HIV full-length genomes. Coverage is measured as the total number of reads uniquely aligning over a given residue; alignments are to standard references (see Methods).

We evaluate assembly results using the following major metrics[2]: 1) the percentage of the target region captured (see Methods), 2) the percentage of reads captured by the consensus, an indicator of how closely the consensus captures the population, and 3) the non-dominant base call rate, defined as the percentage of consensus bases that do not represent the majority base at a given position in the read data set. In addition, we report the number of contigs that were used for reference guided merging, which indicates how “fragmented” the assembly results would be if the patient was infected with unknown viruses.

De novo assembly of viral samples with less genetically diverse populations – DENV & WNV

Clinical samples derived from WNV and DENV infected hosts are typically less genetically heterogeneous compared to samples from a host harboring a chronic infection such as HIV; they are acute diseases and either provide limited time for viruses to develop mutations or undergo very rapid selection, quickly removing a majority of early variants before samples are acquired. Consensus generation in these samples is similar to haploid genome assembly, and existing assemblers are expected to work reasonably well since variants can be treated as sequencing errors and removed from the data. Nonetheless, even high quality haploid genome assemblers like SOAPdenovo work poorly in dealing with such data (Table1).

VICUNA fully recovered the target region in all samples, and in 7 out of 8 samples the target region is captured by a single contig (Table1) prior to guided contig merging using a reference. For WNV samples, a comparable number of reads aligned to the VICUNA consensus and the standard reference (Table1 and Additional file1: Table S1). However, VICUNA called the correct dominant consensus base in all cases. The VICUNA consensus also better represented the DENV samples; we observed an average 0.005% non-dominant call rate, and up to 2.7% more reads aligned as compared to the standard reference (Table1 and Additional file1: Table S1).

De novo assembly of viral samples with more genetically diverse populations – HIV

HIV/AIDS is a chronic disease. As such, samples obtained from HIV infected hosts typically harbor greater genetic diversity compared to samples from acute diseases as the virus has had sufficient time to develop mutations. Given this greater genetic heterogeneity, existing short read assemblers perform poorly[2], and the use of reference genomes not derived from the sample population itself often results in a poor backbone on which to map variants.

VICUNA successfully assembled the full target region of genetically diverse HIV clinical samples, and captured the genome in one or two contigs (Table1) prior to guided contig merging using a reference. Compared to the standard reference genome, on average over 4% more reads can be aligned to VICUNA consensus; we observed an average 0.043% non-dominant call rate (Table1 and Additional file1: Table S1).

Comparing VICUNA assembly results to other assemblers

We compare the performance of VICUNA with existing de novo genome assemblers SOAPdenovo[24] and AV 454[2]; VICUNA outperformed both. We report the result of SOAPdenovo, which is a representative for de Bruijn graph based short read assemblers and its result is comparable to several of other assemblers of the same kind, such as MetaVelvet[39], used for metagenomic data sets, and Velvet Columbus[22], used for reference assisted assembly. AV 454 is selected as a representative for overlap-layout-consensus based assemblers, as it was designed specifically for viral population consensus assembly. The AV 454 algorithm was originally designed for 454 read data sets, and for application here we optimized it to assemble paired Illumina reads (see Additional file1). Because AV 454’s performance deteriorates as the coverage increases beyond a certain threshold, we down-sampled each dataset by randomly selecting about 5,500 read pairs for each sample. Cortex[13] was recently developed for assembling diploid genomes or a mixed population using color-coded de Bruijn graphs. However, it appears not to be suitable for viral population assembly as when applied to current samples, it generated an excessive number of contigs, which cannot be reasonably handled.

SOAPdenovo generated a large number of contigs, but only a small percentage of them exceed the minimum length cutoff of 350bp (Table1). In total, these contigs account for less than 5% of the target region in each sample. Since the algorithm is not able to utilize sufficient reads to form contigs, the dominant base call rate was < 85%. SOAPdenovo’s performance was not markedly better at assembling acute versus chronic viruses.

AV 454 outperformed SOAPdenovo for all viruses, but was inferior to VICUNA. VICUNA has the capacity to scale to tens of thousands fold coverage, whereas, AV 454 is limited to a few hundred fold coverage. This may significantly affect the performance of AV 454 in clinical samples rich in contamination (e.g. > 99% contaminants), where the down-sampling required by AV 454 would result in insufficient coverage of target genome[28]. In all samples, VICUNA has a better accuracy in producing intact genes that have no frame shift (Table1). For WNV samples, the dominant base accuracy was similar, but on average 0.5% fewer reads align to AV 454 consensus than to VICUNA consensus (Table1). For DENV samples V4816 and V4820, the VICUNA consensus was significantly more accurate in dominant base calls (p-value < 2 × 10−5; χ2 test), and approximately 1-2% more reads aligns to it as compared to the AV 454 consensus. For HIV samples, where the genetic total diversity is higher, the advantage of VICUNA is more pronounced. Up to 6% more reads can be aligned to VICUNA consensus compared to AV 454 consensus, and it is over 100 times more accurate (p-value < 1 × 10−8; χ2 test) in dominant base calling.

VICUNA is able to capture low frequency length polymorphisms and large deletions in viral samples

Other than nucleotide polymorphisms, length polymorphisms are frequently observed in viral samples, exemplified by a small number of nucleotide insertions at certain genomic loci in a subset (typically a small fraction) of viral genomes in the population. With the pressure to preserve coding frames, these insertions typically consist of a multiple of three bases. VICUNA is capable of capturing non-dominant insertions that occur at the frequency of at least p, a user specified parameter, when compared to the dominant variant at the same loci. In viral samples, we are interested in p ≤ 5%, which are typically treated as sequencing errors in haploid genome assembly. When p=5% , we observed in WNV, DENV, and HIV samples, 0.003%, 0%, and 0.082% low frequency insertions, respectively, which fits the expectation that HIV genomes have a larger amount of length polymorphisms compared to the other two viruses.

Large deletions have been observed previously in viral population studies (e.g.[36, 40]). Since there is no clear definition on the minimum size, we recorded all deletions ≥ 500bp in length (Additional file1: Table S2). We observed large deletions in two DENV and all four HIV samples and, in some of them, we found multiple types of variants. The majority of these deletions occur in the Env and Pol genes in HIV and in the NS3 gene in DENV. These large deletions may be real biological mutants or they could be amplification or sequencing artifacts (e.g. chimeras). To eliminate the possibility that our observations are due to Illumina specific artifacts, we validated the observed deletions using 454 read data sets from the same samples. Almost all large deletions in HIV samples are observed with the exact breakpoints in the 454 data (Additional file1: Table S2), while none of the DENV deletions were observed. These results suggest that de novo assembly of Illumina reads via VICUNA provides the ability to observe novel large deletion events.

VICUNA is applicable to other data types

Many existing sequence data sets that characterize viral populations were generated using the Roche 454 technology. To demonstrate the applicability of VICUNA to such data, we used VICUNA to generate a consensus for WNV, DENV, and HIV clinical samples sequenced by 454. In all cases, the full target region is recovered with high fidelity (Additional file1: Table S3). VICUNA achieves a 100% dominant call rate. Since reads generated by the Life Sciences Ion Torrent technology share similar error properties to 454 reads, we anticipate that VICUNA will be suitable for assembly of these data as well.


We presented VICUNA, a program for de novo consensus assembly of genetically heterogeneous viral populations. For each read data, VICUNA outputs consensus sequence(s) and for each consensus sequence, the multiple sequence alignment of its constituent reads. We have demonstrated that VICUNA recovers consensus genomes with high fidelity for viral intra-host samples obtained from patients infected with either acute or chronic diseases and that it outperforms other available assemblers in both the accuracy and continuity of the consensus genome. The ability to assemble de novo accurate and contiguous consensus genomes from ultra-deep short read sequence data derived from the Illumina platform or other comparable technologies provides the ability to capture a higher resolution map of viral genetic variants in an infected host with greater cost efficiency than previously possible. The availability of a consensus derived from the sample itself provides greater accuracy with respect to aligning reads prior to variant detection and as such VICUNA will enable improved detection of low frequency variants and subsequent haplotype identification in genetically heterogeneous populations. While VICUNA was developed for the analysis of viral populations its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.


Illumina MiSeq platform configuration

To increase the read length on the Illumina MiSeq platform, the configuration files were modified to increase signal intensity as the read progresses; this helps to ensure fidelity of clusters. Specifically, Exposure was set to 342 for bases A/C and 165 for bases T/G. Ramp values were increased to 4.8 for bases A/C and to 2.3 for bases T/G. LEDSnapCurrentMA value was changed to 1500.

Read alignment to reference genomes

The reference genomes chosen for aligning the input reads are NC001475 (NCBI accession number) for Dengue, NC009942 for West Nile, and a modified version of the HXB2 reference K03455 for HIV. The HIV reference genome is a subsequence of HXB2 (from 779bp to 9551bp) and is corrected for a frame-shift and a premature stop codon (available upon request). The target regions for analyses consist of the complete CDS of each reference genome, which corresponds to positions 95-10267 for Dengue, 97-10398 for West Nile and 12-8638 for HIV.

Read alignment to reference genome or assemblies are carried out using Mosaik v1.1.0013 (, with parameter setting “st = illumina, hs = 10, act = 15, bw = 29, mmp = 0.25 and minp = 0.25”.

General definitions, notations, and techniques

Given two set of integers S1 and S2, the similarity measure Jaccard Index between them is given by | S 1 S 2 | | S 1 S 2 | . The k-spectrum (all possible k mers) of a DNA sequence r, including both forward and reverse complementary strands, is denoted as rk. Let hd(r1, r2) denote the Hamming distance between DNA sequences r1 and r2(|r1 |=| r2 |), then the d-neighbors of r1 in S is given by{ r |(hd( r 1 , r )d)(|r|=| r |)( r S)}. The substring from position i to j of r is denoted as r[i,j].

We index each read by a 2-tuple 〈 id(file),offset 〉, where the offset is the start position of the read in the file. A read can be retrieved from hard drive using its index. Reads are loaded into memory from hard drive in batches, where the batch size is a user specified parameter.

Define a function id(x) to return a unique identifier of x, let it be a file, a read, or a contig. We represent each contig C in the form of a multiple sequence alignment of its constituent reads. The alignment encoding of r i is given by a 5-tuple a i = 〈 id(r i ), |r i |,d i ,I i ,g 〉, where d i is the start position of r i on C, I i indicates if r i belongs to the forward (I i = 1) or reverse complementary (I i = 0) strand of C, and g specifies insertions in r i . Each insertion is denoted as a 2-tuple, specifying its start position on r i and the size of insertion. Using this representation, the dominant base in each alignment column is considered as the consensus base. Further, we define the layout of C to be an ordering of reads (r1, r2, …,r i , …) such that d i di + 1. A low frequency non-dominant fragment of C is a subsequence of C with significantly lower coverage (a pre-specified value) compared to flanking regions. A polymorphic site in C is a position where more than one valid type of nucleotides occur in the alignment.

VICUNA algorithm

Contig construction

Initially, two min hash values are generated for each read, one for the forward, and the other for the reverse complementary strand. The min hash of the forward strand of read r is calculated as follows. First, compute the hash value of each k mer in the forward strand. After sorting, we obtain an integer sequence (h1, h2, …, h|r| − k + 1), where h i hi + 1. In this order, h i s are converted to string representation and concatenated to form a new string S. Apply the same procedures to S as to the forward strand of r: decompose S to k-substrings, hash, then sort. The min hash is the first integer in the sorted sequence. The same strategy is applied to the reverse complementary strand of r. This way, two hash values (integers) are generated to represent every read in the input data. Two reads sharing any min hash value have a high probability to be near identical, albeit a small edit distance. Clustering reads based on common min hash values avoids expensive pairwise comparison. Next, we transform each read cluster to a contig by first identifying a k mer that is shared among all reads, then generate alignment of these reads by directly matching this k mer. Min hash technique may not necessarily lead to the discovery of all good suffix-prefix overlaps among reads. We further employ spaced seed based similarity search strategy to initialize new contigs or add homologous reads to existing contigs. The concept of a k mer can be generalized to denote a concatenation of k selected bases from a substring with length w for wk. In the following, we require w > k. r i k is generated by sliding a window of length w across r i and its reverse complementary strand, and k fixed positions are selected and concatenated to form a k mer. The template that specifies which positions to be selected is termed a spaced seed. We use common k mers as the seeds to identify good read similarity. Rather than recording all possible k mers present in the dataset, which is memory demanding when k is large, we maintain a database D, initialized to be empty. An entry in D is a 2-tuple 〈 x,(id(r),p) 〉, where the key x denotes a k mer and the value is a pair with p denoting the start position of x on r. D is populated by including r i k if none of the k mers in r i k is already included in D. If there exists a read rʹ that shares a common k mer with r i , then we add r i to contig Cʹ where rʹ resides. Since the k mer positions in both r i and rʹ are known as well as the relative position of rʹ in Cʹ, the addition of r i to Cʹ takes constant time.

The above strategies are outlined in Additional file1: Algorithm 1 and a simplified example is given in Additional file1: Figure S2.

Contig clustering

Thus far, a read can occur in multiple contigs, which are to be merged. In the context of graph theory, let each contig be a vertex and two vertices are connected by an edge if they share some common read. The goal is to then identify connected components of this graph, each representing a new contig after merging. Union-find algorithm is typically used for such a purpose. When applied to current application, we can obtain a solution in O (n2) time, where n is the number of input reads and assuming to merge two contigs takes constant time. As in the worst scenario, the total number of vertices in the graph is O(n2). Instead, we devise a graph partitioning strategy, presented in Additional file1: Algorithm 2, which reduces run time to O(n log n). Initially, contigs are sorted in a decreasing order with respect to the number of reads they contain. The graph is iteratively partitioned to be non-overlapping stars. The center of each star is chosen to be the contig that was not yet contained by other stars but has the largest number of reads, and all its neighbors become leaves of the star. After partitioning, all contigs in each star are merged to form a new contig. These newly formed contigs will participate in the next round of merging until the partition of contigs also corresponds to a partition of input reads. The number of iterations for graph partitioning is bounded by O(log n) since each iteration at least halves the number of vertices. Given a star, each leaf contig shares a common read with the center contig. During the merging, the common read guides a direct alignment of different contigs, which takes linear time in the total number of reads involved in the star. At a given iteration, since no stars share common reads, the time complexity to generate merged contigs for all stars is O(n). Therefore, the overall clustering step takes O(n log n) time.

Contig validation

Reads sampled from different genomic locations may be misplaced in the same contig either due to hash collisions (contig construction stage) that enable dissimilar reads to share the same min hash value; or because of the graph partitioning strategy (contig clustering stage) employed to merge contigs via common reads. We set the following two goals for validating a contig: 1) Put an explicit distance constraint ma x d between the consensus of each contig and each of its constituent read. ma x d corresponds to an upper bound of the diversity (or polymorphisms) we would like to capture in the sample. 2) Split potential chimeric contigs. As illustrated in Additional file1: Figure S3, a chimeric contig may be formed either due to the presence of short homologous regions on the reference genome (a), or due to chimeric reads resulted as sequencing artifact (b).

Contig validation procedure is presented in Additional file1: Algorithm 3. First, for each contig C, a consensus is generated and is compared against every constituent read r. r is retained only if the edit distance between r and the consensus is within d. Otherwise, it is set aside in C rem for further consideration. The consensus of C is updated whenever a read is removed, which may result in a re-comparison between the updated consensus with each of the constituent read. Updates stop when no changes were made. The same procedure is applied to C rem . To achieve the second goal, the two types of chimeric contigs are resolved differently. The first type is resolved by requiring adjacent reads in the layout of the contig to have a minimum overlap of mi n ol . For the second type, we use the observation that chimeric reads typically occur less frequently compared to the number of non-chimeric reads sampled from the two corresponding genomic regions. This corresponds to observing a much smaller number of reads colored both red and blue than the number of reads colored red or blue alone (Additional file1: Figure S3 (b)). Formally, for each read r, consider only the reads that have sufficient overlap (≥ min ol ) with r in the layout of the contig, and let n b and n a be the number of reads before and after r, respectively. Break the contig after any read, where n b n a (ma x rt ) is a local maximal. The input parameter ma x rt specifies the ratio in coverage change that is likely due to a chimeric formation. Although the attempt to break chimeric contigs may also lead to the split of valid contigs, they will be considered later.

Contig extension

Validated contigs are extended to form longer ones when good suffix-prefix overlaps can be identified among them (Additional file1: Algorithm 4). These contigs are sorted in the order of decreasing length and considered in that order. For the current longest contig C l , all its neighboring contigs, defined as ones that are linked to C l by paired reads, are identified and stored in N. Then, N is sorted in an increasing order of the number of links shared with C l . Define a delegate of contig C to be a sequence derived from the consensus of C by 1) removing low frequency non-dominant fragments, and 2) converting the consensus base in each polymorphic site to be the alphabetically smallest one. We then compute the delegates for C l and each contig in N. The delegate of C l is compared with each delegate of contig in N to identify good suffix-prefix overlap (Additional file1: Figure S4 and Additional file1: Algorithm 5): Given two sequences s0 and s1, first, identify a sequence of non-overlapping common substrings between them relying on common k mers, stored in a hash table. The coordinates of all substrings so identified are stored in a vector V. Needleman-Wunsch algorithm is then applied to align s0 and s1 by forcing the common substring blocks to be directly aligned. To capture length polymorphisms, a gap in the alignment is not penalized if its length is less or equal than a user specified threshold and is divisible by three (to avoid penalizing length polymorphisms that preserve coding frame). A valid alignment satisfies the following criteria: 1) the similarity between aligned region is ≥ mi n s , 2) the length of the aligned region is ≥ mi n ol , and 3) the overhang of the alignment is ≤ ma x oh . The first two parameters constrain the quality of alignment while the third parameter loosens the constraint to account for insufficient trimming or bad quality bases in the end of contigs, which are likely sequencing errors. Note that there may exist multiple valid alignments between two sequences, where we take the first one that is found to be valid.

When a valid alignment exists between C l and a contig C r N, they are merged to form a longer contig. Since a contig is represented as the MSA of its constituent reads (recall from section “General definitions, notations, and techniques”), the merged contig consists of all reads belonging to C l and C r . The start position of each read in the new contig is adjusted to its distance with respect to the beginning of the resulting alignment; any indel that was introduced in the consensus of C l or C r as the result of alignment was included in the corresponding reads that overlap with the indel. The time complexity of merging two contigs post-alignment, i.e. to merge two existing MSAs, is linear to the number of reads involved.

Previously assembled viral genomes of the same type provide valuable information to further improve the assembly if we have not yet generated any single consensus that represent the full length genome. By aligning contigs to the selected reference, we can further merge any pair of them if they overlap on the reference.

Authors contributions

XY & MCZ conceived and designed the algorithm; XY implemented VICUNA; PC & SG provided supporting code; NJL, JZL, JQ & EMR optimized laboratory protocols for Illumina sequencing; XY, PC, SG, MCZ & MRH analyzed data; XY, PC, MCZ & MRH wrote the manuscript with input from all authors; MCZ and MRH contributed equally to this paper. All authors read and approved the final manuscript.