Keywords

1 Introduction

Viruses are the most abundant biological entities on Earth and have high mutation rates, up to a million times higher than their hosts [11, 26]. Variations in viral genetic sequences lead to the emergence of new viral strains during evolution and are also known to be associated with many diseases [31]. One challenge in viral studies is to analyze a mixture of closely related viral strains, referred to as viral quasispecies. The problem of inferring individual viral strains from sequencing data of viral quasispecies is called strain-aware assembly or viral haplotype reconstruction. Viral haplotype reconstruction in individual patients provides a signature of genetic variability and thus informs us about disease susceptibilities and evolutionary patterns of distinct viral strains [10]. Sequencing data of viral quasispecies from next-generation sequencing (NGS) techniques are short and have a low error rate [33] (\(\le \)0.5%) while that from third-generation sequencing (TGS) techniques are long and have a high error rate [9] (1.6–2.7% for deletions, 1.2–2.2% for mismatches and 1.1–2.4% for insertions). Due to the low pairwise strain divergence in viral quasispecies, it is challenging to distinguish the sequencing error in TGS data and highly similar viral strains. Therefore, various approaches have been proposed to infer individual viral strains from NGS data and can mainly be classified into two categories [15], reference-based and de novo (or reference-free). Reference-based approaches (such as PredictHaplo [30] and NeurHap [36]) rely on the alignment between reads and references and thus suffer from the lack of high-quality references due to high mutation rates [8], and biased variant calling introduced by a selected reference [3, 34]. De novo approaches directly assemble viral strains from sequencing reads without references and have the potential to identify novel viral strains and provide deep insights for viral genetic novelty [31].

While de novo (meta)-genomic assemblers such as SPAdes-series [1, 5, 7, 24, 28] could be applied to assemble individual strains, they tend to produce fragmented contigs rather than complete viral strains, or collapsed contigs ignoring differences between strains, as they are not specifically designed to distinguish closely related viral strains. Specialized de novo assemblers such as SAVAGE [3], PEHaplo [8], viaDBG [12] and Haploflow [13] directly assemble reads into strains from viral quasispecies and have achieved promising results. More recently, VG-Flow [4] was proposed to extend pre-assembled contigs (produced by the specialized assembler SAVAGE [3]) into full-length viral strains using flow variation graphs and significantly outperformed other approaches on recovering viral strains from viral quasispecies. While VG-Flow [4] guarantees its runtime to be polynomial in the genome size, all the recovered viral strains must be selected from a set of candidate paths inferred by greedy path extraction strategies and thus some viral strains not covered by the candidate set are infeasible to be reconstructed.

Here we propose VStrains, a de novo approach for reconstructing strains from viral quasispecies. VStrains employs SPAdes [5] to build the assembly graph from paired-end reads and incorporates contigs and coverage information to iteratively extract distinct paths as reconstructed strains. We benchmark VStrains against multiple state-of-the-art de novo and reference-based approaches on both simulated and real datasets. Experimental results demonstrate that VStrains achieves the best overall performance on both simulated and real datasets under a comprehensive set of metrics such as genome fraction, duplication ratio, NGA50, error rate, etc. In particular, in more challenging real datasets, VStrains achieves remarkable improvements in recovering viral strains compare to other methods.

2 Methods

2.1 Preliminary

An assembly graph generated by SPAdes is a directed graph \(G=(V, E)\). Each vertex \(v \in V\) represents a double-stranded DNA segment where its forward and reverse strands are denoted by \(seq(v^+)\) and \(seq(v^-)\), respectively. A distinctive feature of assembly graphs built by SPAdes (with iterative k-mer sizes up to \(k_{max}\)) is that each (\(k_{max}\)+1)-mer appears in at most one vertex and thus can be used to uniquely identify the corresponding vertex in the assembly graph. Two vertices u and v can be connected by an edge \(e=(u^{o_{u}},v^{o_{v}}) \in E\), where \(o_{u}, o_{v} \in \{+,-\}\) denote the strandedness of u and v, respectively. Note that the suffix of \(u^{o_{u}}\) overlaps \(k_{max}\) positions with the prefix of \(v^{o_{v}}\) under the de Bruijn graph model [29] behind SPAdes. Moreover, the assembly graphs built by SPAdes also contain contigs information, i.e., a contig in G(VE) is defined as a path of vertices together with their strandedness information. The coverage of a vertex v is estimated by the number of reads containing the DNA segment corresponding to v and denoted by cov(v). The coverage of a contig c is estimated by the average coverage along all the vertices in c and denoted by cov(c).

Fig. 1.
figure 1

The framework of VStrains

2.2 Algorithm Overview

VStrains takes paired-end reads from viral quasispecies as the input and aims to recover individual viral strains. During pre-processing, VStrains first employs SPAdes to construct an assembly graph and contigs from paired-end reads, then canonizes the strandedness of vertices and edges, and further complements the assembly graph with additional linkage information from paired-end reads. After pre-processing, VStrains makes use of the contigs, paired-end links, and coverage information to perform branch splitting and non-branching path contraction to disentangle the assembly graph. Finally, VStrains outputs strain-specific sequences from the assembly graph via iterative contig-based path extraction. Refer to Fig. 1 for an overview of our algorithm. Details of each step are explained in the following sections.

2.3 Preprocessing

2.3.1 Canonize Strandedness

Recall that each vertex in the assembly graph produced by SPAdes represents both the forward and reverse strands of a DNA segment, and thus contigs reported by SPAdes may refer to two different strands of the same DNA segment. Therefore, there is no obvious correspondence between a viral strain and a directed path in the assembly graph. To unravel this correspondence, VStrains performs the strandedness canonization to choose the strandedness \(o_v \in \{+,-\}\) for each vertex \(v \in V\) where all adjacent edges only use \(v^{o_v}\). VStrains first chooses an arbitrary vertex \(s \in V\) as the starting vertex and fixes its strandedness to \(o_s\). Let \(\bar{o_s}\) denotes the opposite strandedness of \(o_s\), i.e., \(\bar{o_s} = \{+,-\} {\setminus } \{o_s\}\). VStrains flips its adjacent edges if necessary to ensure these edges only use \(s^{o_s}\), e.g., \((u^{o_u}, s^{\bar{o_s}})\) and \((s^{\bar{o_s}},u^{o_u})\) will be flipped into \((s^{o_s}, u^{\bar{o_u}})\) and \((u^{\bar{o_u}},s^{o_s})\), respectively. VStrains iteratively performs the above step until all the vertices have a fixed strandedness or one vertex has to use both strandedness. VStrains resolves the latter case by splitting this vertex into a pair of vertices, representing its forward and reverse strands, respectively. As a result, \(seq(v^{o_v})\) can be simplified to seq(v) where \(o_v\) is the chosen strandedness of v, and \((u^{o_u},v^{o_v}) \in E\) can be simplified to (uv), where \(o_u\) and \(o_v\) are the chosen strandedness of u and v, respectively. The vertices without any in-coming edges are defined as source vertices, whereas the vertices without any out-going edges are defined as sink vertices.

2.3.2 Inferring Paired-End Links

While SPAdes uses paired-end reads to construct contigs as paths in the assembly graph, VStrains uses unique k-mers of vertices in the assembly graph to establish mappings between paired-end reads and pairs of vertices and thus infers paired-end links.

VStrains uses minimap2 [20] to find exact matches of (\(k_{max}\)+1)-mers between pairs of vertices in the assembly graph produced by SPAdes (with iterative k-mer sizes up to \(k_{max}\)) and paired-end reads. Assume u and v are a pair of vertices in the assembly graph. A PE link is added between u and v if a paired-end read contains at least one (\(k_{max}\)+1)-mer in u and at least one (\(k_{max}\)+1)-mer in v. Note that the (\(k_{max}\)+1)-mer can uniquely identify the corresponding vertex and thus makes it extremely unlikely to produce false-positive PE links unless errors in reads coincide with rare variations between strains. Since (\(k_{max}\)+1)-mer is usually smaller than the read length, even erroneous paired-end reads may infer paired-end links as long as they still contain error-free (\(k_{max}\)+1)-mers.

Note that paired-end reads are commonly used to infer paired-end links between vertices in the assembly graph. For example, overlap-graph-based assemblers, such as SAVAGE [3] and PEHaplo [8], use pairwise alignments between paired-end reads to build overlap graphs [27], and thus face challenges to choose appropriate parameters (e.g., the overlap length cutoff along with allowed mismatches) to distinguish false-positive links between perfect reads from different strains and true positive links between erroneous reads from the same strain in overlap graphs. De Bruijn graph-based assemblers, such as SPAdes and viaDBG, split paired-end reads into k-bimers [5] or bilabel [23], pairs of k-mers of exact (or near exact) distances, one from the forward read and the other from the reverse read. Although this split strategy helps adjust k-bimers/bilabel distances [5, 23] and efficiently construct de Bruijn graphs, it does not make full use of all available k-mer pairs (with varying distances) between forward and reverse reads to further simplify their assembly graphs. To overcome this limitation, VStrains uses all available k-mer pairs without any distance constraints between forward and reverse reads to create PE links between vertices in the assembly graph, which unravels its potential to further simplify the assembly graph produced by SPAdes.

Fig. 2.
figure 2

Graph disentanglement of VStrains. (a) P is a balanced branching vertex. (b) P is split into P\(_1\) and P\(_2\) in a balanced split corresponding to O\(\leftrightarrow \)R (contig path) and N\(\leftrightarrow \)Q (PE link). (c) A trivial branching vertex L in (b) is split into L\(_1\) and L\(_2\) in a trivial split, and thus previously unbalanced branching vertex K becomes a balanced branching vertex. (d) K is split into K\(_1\), K\(_2\) and K\(_3\) in a balanced split, corresponding to three one-to-one mappings, I\(\leftrightarrow \)L\(_1\)NP\(_1\)Q (contig path), J\(\leftrightarrow \)L\(_2\)OP\(_2\)R (PE link), and H\(\leftrightarrow \)M (coverage compatible pair). D and G are not split during graph disentanglement.

2.4 Graph Disentanglement

After pre-processing, paired-end link information has been incorporated, and viral strains are expected to correspond to directed paths in the assembly graph. However, strains may share vertices and edges, and thus result in an entangled assembly graph. In this section, the graph disentanglement iteratively splits branching vertices and contracts non-branching paths as follows.

2.4.1 Branching Vertex Splitting

A vertex \(v \in V\) is called a branching vertex if either the in-degree or out-degree of v is greater than 1. A branching vertex is non-trivial if both its in-degree and out-degree are greater than 1, and trivial otherwise. For example, vertex L is a trivial branching vertex while vertices D, G, K, and P are non-trivial branching vertices in Fig. 2(a).

Without loss of generality, assume a trivial branching vertex v has multiple in-coming edges \( \{(u_i,v)\in E\ | i=1, \ldots , n\}\). In a trivial split, vertex v will be replaced by vertices \(\{v_1, \ldots , v_n\}\), and each in-coming edge \((u_i,v)\) will be replaced by \((u_i, v_i)\), respectively. The coverage of \(v_i\) and the capacity of \((u_i, v_i)\) are set to the capacity of \((u_i,v)\). If v has an out-going edge (vw), (vw) will be replaced by edges \(\{(v_i,w) | i=1, \ldots , n\}\) where the capacity of \((v_i,w)\) is set to the coverage of \(v_i\).

A non-trivial branching vertex v is called balanced if v has the same number of in-coming edges \( \{(u_i,v)\in E\ | i=1, \ldots , n\}\) and out-going edges \( \{(v,w_j)\in E\ | j=1, \ldots , n\}\). For example, vertex P is a balanced branching vertex in Fig. 2(a). Let \(U = \{u_i | i=1, \ldots , n\}\) and \(W = \{w_j | j=1, \ldots , n\}\). In a balanced split, the balanced branching vertex v will be replaced by vertices \(\{v_1, \ldots , v_n\}\), and an in-coming edge \((u_i,v)\) and out-going edge \((v, w_i)\) will be replaced by \((u_i,v_i)\) and \((v_i, w_i)\) if \(u_i\) and \(w_i\) are both contained in at least one contig or connected by a PE link. The above balanced split of v corresponds to a bijection between U and W, and most of these one-to-one mappings can be perfectly inferred by contigs and PE links between U and W. For example, a balanced split of P from Fig. 2(a) to (b) corresponds to two one-to-one mappings, O\(\leftrightarrow \)R and N\(\leftrightarrow \)Q, inferred by the contig and PE link information, respectively. In case U and W form a partial bijection using contigs and PE links information, VStrains further uses coverage information and aims to find more one-to-one mappings. For the \(u_i \in U\) and \(w_j \in W\) not in the current partial bijection, an one-to-one mapping is established between \(u_i\) and \(w_j\) if and only if \(u_i\) and \(w_j\) form a coverage compatible pair, i.e., \(u_i = \arg \underset{u}{\min }|cap(v,w_j) - cap(u,v)|\) and \(w_j = \arg \underset{w}{\min }|cap(v,w) - cap(u_i,v)|\). For example in Fig. 2(c) to (d), vertices H and M form a coverage compatible pair, together with two other one-to-one mappings I\(\leftrightarrow \)L\(_1\)NP\(_1\)Q (from contig path) and J\(\leftrightarrow \)L\(_2\)OP\(_2\)R (from PE link), lead to a balanced split on the balanced branching vertex K in Fig. 2(d).

Note that not all non-trivial branching vertices are balanced (e.g., vertex K is an unbalanced branching vertex in Fig. 2(a)). For such unbalanced vertex v, VStrains performs a trivial split on its adjacent trivial branching vertices and aims to convert v into a balanced vertex. For example, after performing a trivial split on vertex L in Fig. 2(b) to (c), vertex K now becomes a balanced branching vertex in Fig. 2(c) and becomes a candidate for a balanced split. VStrains performs a balanced split on v if the above bijection can be established by contigs, PE links, and coverage information.

2.4.2 Non-branching Path Contraction

The above branch split operation in the assembly graph creates non-branching paths, i.e., path \(p=(v_1,v_2, \ldots ,v_n), v_i \in V \) \(\forall i = 1, \ldots , n\), where the in-degree of \(v_2,...v_n\) and the out-degree of \(v_1,...v_{n-1}\) are all 1. Following the similar idea of graph simplification in SPAdes, VStrains contracts all the non-branching paths. For example, the above non-branching path p is contracted into one vertex \(v_p\), and each in-coming edge \((u,v_1)\) of \(v_1\) is replaced by \((u,v_p)\) and each out-going edge \((v_n,w)\) is replaced by \((v_p, w)\) with the same capacity, the coverage of \(v_p\) is set to be the average coverage of \(v_1,\ldots , v_n\). Moreover, \(v_p\) inherits all the PE links of vertices in non-branching path p. For example in Fig. 2(d), three non-branching paths in three different colors on the right are contracted, respectively.

Fig. 3.
figure 3

The contig-based path extraction of VStrains. (a) Assembly graph after graph disentanglement with three contigs: I\(\rightarrow \)K\(_1{\rightarrow }\)L\(_1{\rightarrow }\)N, O\(\rightarrow \)P\(_2{\rightarrow }\)R, K\(_3{\rightarrow }\)M. (b) The longest contig I\(\rightarrow \)K\(_1{\rightarrow }\)L\(_1{\rightarrow }\)N is firstly extended to I\(\rightarrow \)K\(_1 \ldots \rightarrow \)P\(_1{\rightarrow }\)Q (terminated at I due to the lack of coverage compatible pair with respect to 200x). Afterwards, the second longest contig O\(\rightarrow \)P\(_2{\rightarrow }\)R is extended to J\(\rightarrow \)K\(_2{\rightarrow }\ldots \rightarrow \)P\(_2{\rightarrow }\)R (terminated at J due to the lack of coverage compatible pair with respect to 300x). At last, the shortest contig K\(_3{\rightarrow }\)M is extended to C\(\rightarrow \)D\(_1{\rightarrow }\ldots \rightarrow \)K\(_3{\rightarrow }\)M (thanks to the coverage compatible pairs with respect to 400x) (c) Contigs J\(\rightarrow \ldots \rightarrow \)R and I\(\rightarrow \ldots \rightarrow \)Q are extended to full paths (thanks to the path extraction and coverage/topology update in (b)).

2.5 Contig-Based Path Extraction

While VStrains effectively disentangles the assembly graph through branching vertex splitting and non-branching path contraction in the above step, there still exist branching vertices (e.g., D and G in Fig. 2(d)) which may introduce ambiguity in distinguishing full paths that correspond to individual viral strains.

Recall SPAdes outputs contigs as paths of vertices on the assembly graph, which are usually sub-paths of viral strain induced paths on the assembly graph. The remaining problem is to extend these contigs (sub-paths) into corresponding viral strains (full paths) on the assembly graph. Note that a full path on the assembly graph starts from a source vertex and ends at a sink vertex or is a cyclic path (i.e., its first and the last vertex coincide). Ideally, the strain-specific (not shared by multiple viral strains) contigs should be extended and extracted first. The longer a contig is, the more likely that this contig is strain-specific.

Therefore, VStrains iteratively selects the longest contig and extends its corresponding sub-path on both ends. Without loss of generality, consider the right extension of the current contig \(C=(v_1,v_2,\ldots , v_n)\). If \(v_n\) has only one out-going edge (\(v_n, v_{n+1}\)), the current contig will be extended into \((v_1, \ldots , v_n, v_{n+1})\). If \(v_n\) has multiple out-going edges \(\{(v_n, v_{n+1}^1), \ldots , (v_n, v_{n+1}^m)\}\), we follow the same strategy in Sect. 2.4.1 to look for one-to-one mapping between \(v_{n-1}\) and \(\{v_{n+1}^1, \ldots , v_{n+1}^m\}\) using contigs, paired-end reads and coverage information. If \(v_{n-1}\) forms the one-to-one mapping with \(v_{n+1}^{'}\), the current contig will be extended into \((v_1, \ldots , v_n, v_{n+1}^{'})\), otherwise, the extension to the right terminates. If \(v_n\) is a sink vertex (without any out-going edges), the extension to the right terminates. If \(v_n\) is a visited vertex during extension on the other end, the extension to the right terminates and a cyclic path is obtained by combining both left and right extensions.

If the currently selected contig can be extended into a full path, VStrains will include this extended path as one of the output strains. Otherwise, VStrains will contract this extended path as a single vertex, and wait for confident extension in the future step. VStrains will also update the coverage information of the assembly graph. More specifically, VStrains estimates the path coverage as the median coverage of all the non-branching vertices along the path, and reduces the path coverage from all the traversed vertices. For example, VStrains extends the contig H\(\rightarrow \)K\(_3{\rightarrow }\)M into the full path C\(\rightarrow \)D\(_1{\rightarrow }\ldots \rightarrow \)K\(_3{\rightarrow }\)M (using coverage compatible pairs F\(\leftrightarrow \)H and C\(\leftrightarrow \)F described in Sect. 2.4.1) in Fig. 3(b).

By iteratively extending and extracting the contig from the assembly graph, VStrains obtains a set of distinct paths as the final viral strains. This iterative strategy contracts/extracts the most confident sub-paths/full-paths first, and updates topology and coverage information of the assembly graph on the fly, which in turn reveals more strain-specific paths in the updated graph and facilitates the subsequent path extractions. For example, after extracting the full path C\(\rightarrow \)D\(_1{\rightarrow }\ldots \rightarrow \)K\(_3{\rightarrow }\)M in Fig. 3(b), the (sub)-paths I\(\rightarrow \)K\(_1{\rightarrow }\ldots \rightarrow \)P\(_1{\rightarrow }\)Q and J\(\rightarrow \)K\(_2{\rightarrow }\ldots \rightarrow \)P\(_2{\rightarrow }\)R are able to further extend to the left (using coverage compatible pairs A\(\leftrightarrow \)I and B\(\leftrightarrow \)J) into two full paths A\(\rightarrow \)D\(_3{\rightarrow }\ldots \rightarrow \)P\(_1{\rightarrow }\)Q and B\(\rightarrow \)D\(_2{\rightarrow }\ldots \rightarrow \)P\(_2{\rightarrow }\)R.

Note that, unlike other greedy path finding strategies [4, 8, 13] deriving a set of candidate paths based on the original flow-variation graph, VStrain iteratively extracts the most confident path and updates the assembly graph on the fly, which results in more accurate reconstruction of viral strains. For example, VG-Flow employs three greedy strategies (maximum capacity, minimum capacity, shortest paths) to derive the set of candidate paths, from which the final output strains will be selected. However, such greedy strategies are directly applied on the original flow-variation graph, making it likely to find erroneous paths (e.g., the maximum-capacity path C\(\rightarrow \)D\(\rightarrow \)E\(\rightarrow \)G\(\rightarrow \)H\(\rightarrow \)K\(_3{\rightarrow }\)M and the minimum-capacity path A\(\rightarrow \)D\(\rightarrow \)F\(\rightarrow \)G\(\rightarrow \)I\(\rightarrow \ldots \rightarrow \)P\(_1{\rightarrow }\)Q in Fig. 3(a) are both erroneous paths).

3 Experimental Setup

3.1 Experimental Datasets

3.1.1 Simulated Datasets

To evaluate the performance and scalability of VStrains, we used three simulated viral quasispecies datasets from [3] consisting of 6 Poliovirus, 10 hepatitis C virus (HCV), and 15 Zika virus (ZIKV) mixed strains, respectively. These datasets were commonly used to benchmark strain-aware viral assemblers and simulated from known reference genomes using SimSeq [6] with the default error profile.

3.1.2 Real Datasets

Two real datasets with different coverages are obtained from the NCBI database. The first 5-HIV-labmix (20,000x) dataset [14] is a lab mix of 5 known real human immunodeficiency virus (HIV) strains (NCBI accession number SRR961514). The second 2-SARS-COV-2 (4,000x) dataset [37] is a mixture of 2 real severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) strains (BA.1 and B.1.1). Two independently assembled strains, BA.1 (NCBI accession number SRR18009684) and B.1.1 (NCBI accession number SRR18009686), are used as the ground-truth for evaluation. Table 1 presents a summary of the simulated and real datasets.

Table 1. Quasispecies characteristics of the simulated and real benchmarking datasets.

3.2 Baselines and Evaluation Metrics

VStrains was benchmarked against the state-of-the-art methods on assembling viral strains including SPAdes [5], SAVAGE [3], VG-Flow [4], PEHaplo [8], viaDBG [12], Haploflow [13], and PredictHaplo [30]. Three recent reference-based machine-learning approaches GAEseq [17], CAECseq [16] and NeurHap [36] were not included for comparison because all of them have only been applied to a gene segment of viral strains in their experiments and failed to handle the above datasets on whole viral strains (i.e., memory usage exceeding 300 GB). Note that PredictHaplo [30] is a reference-based approach and needs an accurate reference as the input. Therefore, we randomly select a ground-truth viral strain and provide it to PredictHaplo [30] for each dataset. All above tools under evaluation use default settings unless specified otherwise. Refer to Section S3 in the Supplementary Materials for a detailed description of baselines.

Similar to previous studies, we employ MetaQUAST [25] to evaluate all assembly results of viral strains. As viral quasispecies contains highly similar strains, we run MetaQUAST with the “–unique-mapping” option to minimize ambiguous false positive mapping. For each assembly, we report the genome fraction, duplication ratio, NGA50, error rate, and number of contigs. Genome fraction is defined as the total number of aligned bases in the reference, divided by the genome size. A base in the reference genome is counted as aligned if there is at least one contig with at least one alignment to this base. Duplication ratio is defined as the total number of aligned bases in the assembly, divided by the total number of aligned bases in the reference. NG50 is the contig length such that using longer or equal length contigs produce half of the bases of the reference genome, whereas NGA50 counts the lengths of aligned blocks instead of contig lengths, such that the contig is broken into smaller pieces when it has a misassembly with respect to the reference genome. Error rate is defined as the sum of mismatch rate, indel rate, and N’s rate, which reflects the number of errors with respect to the reference genome size.

4 Experimental Results

In this section, we show the performance of VStrains and other baselines on both simulated and real datasets. In consistent with previous observations [4, 12], we found that SPAdes in general outperforms metaSPAdes [28] and other specialized versions (refer to Section S1 in the Supplementary Materials for detailed comparison among SPAdes-series assemblers) and thus is employed to build assembly graphs for VStrains.

4.1 Performance on Simulated Datasets

Table 2 summarizes the performance of de novo and reference-based approaches on reconstructing viral strains in three simulated datasets. Note that specialized strain-aware assemblers such as PEHaplo, viaDBG, and SAVAGE typically outperform the general-purpose assembler SPAdes, especially in genome fraction. One possible reason is that SPAdes is not designed to distinguish highly similar viral strains. When two or more strains share long and identical sequences, SPAdes may result in fragmented assemblies (i.e., low NGA50) and keep only one copy of such shared sequences (i.e., low genome fraction). VG-Flow uses assembled contigs from SAVAGE (VG-Flow+SAVAGE) and SPAdes (VG-Flow+SPAdes) to build flow-variation graphs and effectively improves their genome fraction and NGA50. While the greedy strategies in building candidate paths make VG-Flow tractable, VG-Flow may be forced to use more (incorrect) paths to cover all strains (i.e. high duplication ratio and error rate) when not all the ground-truth paths have been included in its selected candidate set. VStrains uses contigs, paired-end reads, and coverage information to extract strain-specific paths iteratively from assembly graphs built by SPAdes (VStrains+SPAdes) and achieves the best overall performance on a comprehensive set of metrics (including genome fraction, duplication ratio, NGA50, error rate and number of contigs). It is worth noting that VStrains is able to adapt the general-purpose assembler SPAdes to assemble highly similar strains and even outperform existing specialized strain-aware assemblers.

Table 2. Performance of de novo and reference-based approaches on reconstructing viral strains in simulated datasets

4.2 Performance on Real Datasets

Table 3 summarizes the performance of VStrains and other de novo and reference-based approaches on two real datasets, 5-HIV-labmix [14] and 2-SARS-COV-2 [37]. While viaDBG results in high genome fraction and low error rate, it comes at a cost of (extremely) high duplication ratio and an excessive number of contigs, making it infeasible to distinguish correct and incorrect contigs from its output. As a reference-based approach, PredictHaplo achieves a high genome fraction (99.22%) and high NGA50 (9604) for 5-HIV-labmix dataset because a ground-truth viral strain was provided as its input. However, in reality, accurate reference genomes are usually not available and species are unknown in viral quasispecies, it is impossible to provide good reference genomes for PredictHaplo.

Table 3. Performance of de novo and reference-based approaches on reconstructing viral strains in real datasets

Similar to the performance on the simulated datasets, SPAdes results in very fragmented assemblies with low genome fraction (49.11%) and NGA50 (614) for 5-HIV-labmix dataset. SAVAGE as a specialized strain-aware assembler produces almost double the genome fraction (87.34%) and NGA50 (1397) compare to SPAdes. While VG-Flow+SAVAGE increases the NGA50 to 7235, it decreases the genome fraction to 77.7%, doubles the duplication ratio (from 1.56 to 3.12), and significantly increase the error rate from 0.115% to 1.038%, which indicates its limitation on handling real datasets. On the other hand, VStrains+SPAdes significantly improves the overall performance on SPAdes, e.g., increase genome fraction from 49.11% to 86.42%, NGA50 from 614 to 7583, and decreases the error rate from 0.515% to 0.237%. Note that VG-Flow+SPAdes fails to achieve such an improvement as VG-Flow is mainly designed to couple with SAVAGE. However, SAVAGE typically assumes at least 10,000x total coverage of viral sequencing data (quote from its GitHub site), which limits the application of VG-Flow on datasets without ultra-high coverage (e.g., on the 2-SARS-COV-2 dataset with only 4,000x coverage).

5 Software and Resource Usage

All the experiments were run under the National Computational Infrastructure (NCI) Gadi supercomputer by submitting jobs to the Gadi biodev queue with the default job dependencies. The allocated RAM size was limited to 500 GB and CPU time was limited to 800 h. The peak memory (maximum resident set size) refers to the peak amount of memory throughout the program execution. Table 4 summarize the CPU time and peak memory for different approaches on all viral quasispecies benchmarks, respectively.

From Table 4, we observe that Haploflow is much more efficient than all other approaches in runtime and peak memory, but may not be a preferred tool due to its extremely high error rate and low genome fraction (as shown in Table 2 and 3). The general-purpose assembler SPAdes is more efficient than specialized strain-aware assemblers such as PEHaplo, Haploflow, viaDBG and SAVAGE in terms of runtime and peak memory. To reconstruct full-length viral strains from pre-assembled contigs, VG-Flow and VStrains cost comparable running time and memory usage. However, since VG-Flow employs SAVAGE to generate the pre-assembled contigs and the running time and memory usage of SAVAGE are almost the most expensive when compared to other tools, which results in high memory usage and running time of “SAVAGE+VG-Flow”. On the contrary, the running time and memory usage of “SPAdes+VStrains” are comparable with other specialized assemblers.

Table 4. CPU Time and peak memory usage of de novo and reference-based approaches on viral haplotype reconstruction

To test the scalability of VStrains, we further compared its runtime and peak memory usage to VG-Flow on simulation datasets with increasing genome size and a variable number of strains (Fig. 4). VG-Flow was selected to compare as it is the most state-of-the-art viral quasispecies assembly post-processing tool. The runtime and memory usage of VStrains do not demonstrate significant correlations with respect to the increase in the number of strains while the runtime of VG-Flow increases with the number of strains. When increasing the genome size, the runtime and memory usage of both VStrain and VG-Flow increased (Fig. 4(a) and (c)). It is worth noting that, when including the runtime and memory usage of the pre-assemblers (SAVAGE and SPAdes), SAVAGE+VG-Flow is much more sensitive to the genome size than VStrains+SPAdes (Fig. 4(b) and (d)). Thus, we concluded that VStrains+SPAdes is more efficient than VG-Flow+SAVAGE in the analysis of large-scale datasets. Taking into account the good performance of VStrains+SPAdes at Table 2 and Table 3, it is more cost-effective to employ VStrains as a postprocessing tool for SPAdes in terms of both performance and program efficiency compare to VG-Flow.

Fig. 4.
figure 4

CPU time and peak memory for VG-Flow, VStrains, VG-Flow+SAVAGE, and VStrains+SPAdes on simulated datasets consist of 2, 4, 6, 8 strains with increasing genome size (bp) (2500, 5000, 10.000, 20.000, 40.000, 100.000, 200.000). The x-axis is plotted on a logarithmic scale. The CPU time for VStrains+SPAdes (VG-Flow+SAVAGE) is the addition of VStrains (VG-Flow) and SPAdes (SAVAGE), and the peak memory for VStrains+SPAdes (VG-Flow+SAVAGE) is the maximum between VStrains (VG-Flow) and SPAdes (SAVAGE).

6 Conclusion and Discussion

In VStrains, we first introduce a strategy to canonize the strandedness of assembly graphs from SPAdes, which reduces the strain reconstruction problem into the path extraction problem. Secondly, we use all available k-mer pairs in paired-end reads to infer PE links in the assembly graph produced by SPAdes. Thirdly, we propose an effective way to incorporate PE links together with contigs and coverage information to disentangle the assembly graphs. Finally, we demonstrate how to extract confident strain-specific paths via iterative contig-based path extraction. Experimental results on both simulated and real datasets show that VStrains achieves the best overall performance among the state-of-the-art approaches.

Currently, VStrains relies on both assembly graphs and contigs from SPAdes and thus cannot couple with assemblers which does not explicitly output assembly graphs such as SAVAGE. The current implementation of VStrains requires additional alignments of paired-end reads to vertices in assembly graphs to infer PE links, which dominates the total runtime and peak memory usage. It is worth exploring how to make VStrains more flexible and efficient.

With the advance of third-generation sequencing (TGS), multiple approaches (including Strainline [22], Strainberry [35], VirStrain [21], viralFlye [2], etc.) have been proposed for strain-aware assembly using TGS data. VStrains has the potential to be extended to handle TGS data by taking advantages of assembly graphs built from Canu [19], Flye [18], wtdbg [32] and others.