SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores

Meng, Jintao; Wang, Bingqiang; Wei, Yanjie; Feng, Shengzhong; Balaji, Pavan

doi:10.1186/1471-2105-15-S9-S2

SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores

Proceedings
Open access
Published: 10 September 2014

Volume 15, article number S2, (2014)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores

Download PDF

Jintao Meng^1,2,3,
Bingqiang Wang⁴,
Yanjie Wei¹,
Shengzhong Feng¹ &
…
Pavan Balaji⁵

2859 Accesses
23 Citations
1 Altmetric
Explore all metrics

Abstract

Background

There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data.

Results

This paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets.

Conclusions

In this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler

Background

To cope with massive sequence data generated by next-generation sequencing machines, a highly scalable and efficient parallel solution for fundamental bioinformatic applications is important [1, 2]. With the help of high performance computing, cloud computing [3, 4], and many-cores in GPU [5], successful scalable examples have been seen in many embarrassingly parallel applications: sequence alignment [6–8], SNP searching [9, 10], expression analysis [11], etc. However, for tightly coupled graph related problems, such as genome assembly, a scalable solution is a still a big challenge [12, 13].

State-of-the-art trials on parallel assemblers include ABySS [14], Ray [15], PASHA [16], and YAGA [17–19]. ABySS adopts the traditional de Bruijn graph data structure proposed by Pevzner et. al. [20] and follows the similar assembly strategy as EULER SR [21] and Velvet [22]. The parallelization is achieved by distributing k-mers to multi-servers to build a distributed de Bruijn graph, and error removal and graph reduction are implemented over MPI communication primitives. Ray extends k-mers (or seeds) into contigs with a heuristical greedy strategy by measuring the overlapping level of reads in both direction. Based on the observation that the time consuming part of genome assembly are generating and distributing k-mers, constructing and simplifying the distributed de Bruijn graph, PASHA concentrates its effort on parallelizing these two stages to improve its efficiency. However, PASHA allows only single process for each unanimous path, and this limits its degree of parallelism. In their experiments, ABySS and PASHA take about 87 hours and 21 hours to assembly the Yoruban male genome with a coverage of 42X.

To avoid merging k-mers on two different servers, which can result in too many small inter-process messages and the communication latency, YAGA constructs a distributed de Bruijn graph by maintaining edge tuples in a community of servers. Reducible edges belonging to one unanimous path are grouped into one server using a list rank algorithm [23], then these unanimous paths are reduced locally on separated servers. The complexity of YAGA is bounded by $O (\frac{n}{p})$ computing time, $O (\frac{n}{p})$ communication volume, and O(log²(n)) communication rounds, where n is the number of nucleotides in all reads, and p denotes the number of processors. Due to the fact that the recursive list ranking algorithm used in YAGA has a memory usage of $O (\frac{n log n}{p})$ , this will use large amount of memory and cause low efficiency.

Our previous work [24] tries to avoid access collision of merging two neighbor edges. In this work, 1-step bi-directed graph and a computational model named as SWAP are proposed for edge merging operation. In its experiments, the prototype of edge merging algorithm using SWAP can scale to 640 cores on both Yeast and C.elegans dataset. However this exploratory work only focuses on the edge merging operation of genome assembly, some other important problems are not addressed, for example, contig extension, complexity analysis etc.

The scalability of previous assemblers is affected by the computational interdependence on merging k-mers/edges in unanimous paths. Sequential assemblers, for example Velvet and SOAPdenovo, process each path sequentially. Parallel assemblers can process several paths in parallel, however k-mers/edges sharing one path are merged one by one. SWAP-Assembler resolves the computational interdependence on merging edges sharing the same path with MSG. For each path, at most half of its edges can be merged concurrently in each round, and merging multiple edges on the same path can be done in parallel using SWAP computational framework. In Figure 1, the parallel strategy of SWAP-Assembler is compared with other assemblers using an example of two linked paths, we can see that a deeper parallelism on edge merging can be achieved by SWAP-Assembler.

In this paper, we present a highly scalable and efficient genome assembler named as SWAP-Assembler, which can scale to thousands of cores on processing massive sequencing data such as Yanhuang (500G). SWAP-Assembler includes five fully parallelized steps: input parallelization, graph construction, graph cleaning, graph reduction and contig extension. Compared with our previous work, two fundamental improvements have been made for graph reduction. Firstly MSG is presented as a comprehensive mathematical abstraction for genome assembly. Using MSG and semi-group theory, the computational interdependence on merging edges is resolved. Secondly, we have developed a scalable computational framework for SWAP, this framework triggers the parallel computation of all operations with no interference. In this paper, complexity of this framework and SWAP-Assembler is also analyzed and proved in detail. In addition, two steps in SWAP-Assembler are used to improve the quality of contigs. One is graph cleaning, which adopts the traditional strategy of removing k-molecules and edges with low frequency, and the other one is contig extension, which resolves special edges and some cross nodes using a heuristic method. Experimental results show that SWAP-Assembler can scale up to 2048 cores on Yanhuang dataset using only 26 minutes, which is the fastest compared with other assemblers, such as ABySS, Ray and PASHA. Conitg evaluation results confirm that SWAP-Assembler generates good results on N50 size with lowest error rate for S. aureus and R. sphaeroides datasets. When processing larger datasets (Fish and Yanhuang) without using external error correction tools, SWAP-Assembler generates the longest N50 contig sizes of 1309 bp and 1368 bp for these two datasets.

Methods

In this section, our method for genome assembly towards thousands of cores is presented. We first abstract the genome assembly problem with MSG. Generating longer sequences (contigs) from shorter sequences corresponds to merging semi-extended edges to full-extended edges in MSG. In addition, computational interdependence of edge merging is resolved by introducing a semi-group over a closed edge set Es V 0 in MSG. The edge set Es V 0 is proved to be a semi-group with respect to edge merging operation. According to the associativity law of semi-group, the final results will be the same as long as all edges have been merged regardless of the merging order, thus these edge merging operations can be computed in parallel.

In order to maximally utilize the potential parallelism resolved by MSG, a scalable SWAP computational framework is developed. As one edge may be accessed by two merging operations in two different processes at the same time, a lock-computing-unlock mechanism introduced in [24] is adopted for avoiding the conflict. For the problems which can be abstracted with semi-group, the corresponding operations can be done in parallel, and SWAP computational framework can achieve linearly scale up for these problems.

Based on MSG and SWAP computational framework, SWAP-Assembler is developed with five steps, including input parallelization, graph construction, graph cleaning, graph reduction, and contig extension. In the following, we first present MSG and the SWAP computational framework, then details of SWAP-Assembler's five steps will be followed.

Mathematical formulation of genome assembly using MSG

Given a biological genome sample with reference sequence w ∈ ℕ^g , where N = {A, T, C, G}, g = |w|, a large number of short sequences called reads, S = {s₁, s₂, ..., s_h}, can be generated from the sequencing machines. Genome assembly is the process of reconstructing the reference genome sequence from these reads. Unfortunately, the genome assembly problem of finding the shortest string with all reads as its substring falls into a NP-hard problem [25].

Finding the original sequence from all possible Euler paths cannot be solved in polynomial time [26]. In real cases, gaps and branches caused by uneven coverage, erroneous reads and repeats prevent obtaining full length genome, and a set of shorter genome sequences called contigs are generated by merging unanimous paths instead. Our method focuses on finding a mathematical and highly scalable solution for the following standard genome assembly (SGA) problem, which is also illustrated in Figure 2.

Problem of Standard Genome Assembly (SGA)

Input: Given a set of reads without errors S = {s₁, s₂, ..., s_h}

Output: A set of contigs C = {c₁, c₂, ..., c_w}

Requirement: Each contig maps to an unanimous path in the De Bruijn graph constructed from the set of reads S.

Preliminaries

We first define some variables. Let s ∈ ℕ^l be a DNA sequence of length l. Any substring derived from s with length k, is called a k-mer of s, and it is denoted by α = s[j]s[j + 1] . . . s[j + k − 1], 0 ≤ j < l − k + 1. The set of k-mers of a given string s can be written as Z(s, k), where k must be odd. The reverse complement of a k-mer α, denoted by $α'$ , is obtained by reversing α and complementing each base by the following bijection of M, M : {a → t, t→a, c→g, g → c}. Note that $α = α^{'^{'}}$ .

A k-molecule $\hat{α}$ is a pair of complementary k-mers {α, α'}. Let $. >$ be the partial ordering relation between the strings of equal length, and $α . > β$ indicates that α is lexicographically larger than β. We designate the lexicographically larger one of two complementary k-mers as the positive k-mer, denoted as α⁺, and the smaller one as the negative k-mer, denoted as α−, where $α^{+} . > α^{-}$ . We choose the positive k-mer α⁺ as representative k-mer for k-molecule {α, α'}, denoted as α⁺, implying that $\hat{α} = α^{+} = {α^{+}, α^{-}} = {α, α'}$ . The relationship between k-mer and k-molecule is illustrated in Figure 3. The set of all k-molecules of a given string s is known as k-spectrum of s, and it can be written as S(s, k). Noted that S(s, k) = S(s', k).

Notation suf (a, l) (pre(a, l), respectively) is used to denote the length l suffix (prefix) of string a. The symbol ○ is introduced to denote the concatenation operation between two strings. For example, if s₁ = "abc", s₂ = "def ", then s₁ °s₂ = "abcdef ". The number of edges attached to k-molecule α ˆ is denoted as degree( $\hat{α}$ ). All notations are listed in Table 1.

Table 1 List of notations.

Full size table

1-step Bi-directed Graph

Definition 1: 1-step bi-directed graph. The 1-step bi-directed de Bruijn graph of order k for a given string s can be presented as,

G_{k}^{1} (s) = {V s, E_{s}^{1}}

(1)

In the rest of the paper, 1-step bi-directed de Bruijn graph of order k is abbreviated as 1-step bi-directed graph. In equation (1), the vertex set V_s is the k-spectrum of s,

V_{s} = S (s, k)

(2)

and the 1-step bi-directed edge set $E_{s}^{1}$ is defined as follows,

\begin{aligned} E_{s}^{1} & = {e_{α β}^{1} = (α, β, d_{α}, d_{β}, c_{α β}^{1}) \forall \hat{α}, \hat{β} \in S (s, k), s u f (α, k - 1) = \\ p r e (β, k - 1) ˆ (α \circ β [k - 1]) \in (Z (s, k + 1) \lor Z (s', k + 1)))} \end{aligned}

(3)

Equations (3) declares that any two overlapped k-molecules can be connected with one 1-step bi-directed edge when they are consecutive in sequence s or the complementary sequence s'. Here d_α denotes the direction of k-mer α, if α = α⁺, d_α ='+', otherwise d_α = '-'. $c_{α β}^{1}$ is the content or label of the edge, and is initialized with β[k − 1], that is $c_{α β}^{1} = β [k - 1]$ , and we have $s u f (α^{\circ} c_{α β}^{1}, k) = β$ .

Lemma 1. Given two k-molecules $\hat{α}, \hat{β} \in S (s, k)$ , there are four possible connections, and for each type of connection exactly two equivalent 1-step bi-directed edge representations exist,

1
$e_{α + β +}^{1} = (α^{+}, β^{+}, +, +, c_{α + β +}^{1}), e_{β - α -}^{1} = (β^{-}, α^{-}, -, -, c_{β - α -}^{1})$
2
$e_{α + β -}^{1} = (α^{+}, β^{-}, +, -, c_{α + β -}^{1}), e_{β + α -}^{1} = (β^{+}, α^{-}, +, =, c_{β + α -}^{1})$
3
$e_{α - β +}^{1} = (α^{-}, β^{+}, -, +, c_{α - β -}^{1}), e_{β - α +}^{1} = (β^{-}, α^{+}, -, +, c_{β - α +}^{1})$
4
$e_{α - β -}^{1} = (α^{-}, β^{-}, -, -, c_{α - β -}^{1}), e_{β + α +}^{1} = (β^{+}, α^{+}, +, +, c_{β + α +}^{1})$

In each type of connection, the first bi-directed edge representation and the second one are equivalent. The first bi-directed edge is associated with k-molecule $\hat{α}$ , and the second one is associated with $\hat{β}$ . Figure 4 illustrates all four possible connections. For example in Figure 4-(a), a positive k-mer "TAG" points to positive k-mer "AGT" with a label "A", and the corresponding edge is $e_{T A G A G T}^{1} = (T A G, A G T, +, +, T)$ .

Given a set of reads S = {s₁, s₂, . . . , s_h}, a 1-step bi-directed graph derived from S with order k is,

G_{k}^{1} (S) = {V s, E_{S}^{1}} = {\underset{1 \leq i \leq h}{\cup} V_{s i}, \underset{1 \leq i \leq h}{\cup} E_{s i}^{1}}

(4)

Each read s_i corresponds to a path in $G_{k}^{1} (S)$ , and read s_i can be recovered by concatenating (k − 1)-prefix of the first k-molecule and the edge labels on the path consisted by S(s_i, k). As an example, an 1-step bi-directed de Bruijn graph derived from S = {"T AGT CG", "AGT CGA", "T CGAGG"} is plotted in Figure 5.

Multi-step Bi-directed Graph and Its Properties

Definition 2: edge merging operation. Given two 1-step bi-directed edges $e_{α β}^{1} = (α, β, d_{α}, d_{β} . C_{α β}^{1})$ and $e_{β γ}^{1} = (β, γ, d_{β}, d_{γ}, C_{β γ}^{1})$ in a 1-step bi-directed graph, if $e_{α β}^{1} . d_{β} = e_{β γ}^{1} . d_{β}$ and degree( $\hat{β}$ ) = 2, we can obtain a 2-step bi-directed edge $e_{α γ}^{2} = (α, γ, d_{α}, d_{γ}, c_{α γ}^{2})$ by merging edges $e_{α β}^{1}$ and $e_{β γ}^{1}$ , $c_{α γ}^{2} = c_{α β}^{1} \circ c_{β γ}^{1}$ . Using symbol ⊗ to denote edge merging operation between two bi-directed edges attached to the same k-molecule with the same direction, and the 2-step bi-directed edge is written as,

e_{α β}^{1} \otimes e_{β γ}^{1} = e_{α γ}^{2} o r e_{γ β}^{1} \otimes e_{β α}^{1} = e_{γ α}^{2}

(5)

Two edges $e_{α γ}^{2}$ and $e_{γ α}^{2}$ in equation (5) are equivalent, indicating it is same to apply edge merging operation on $e_{α β}^{1}$ and $e_{β γ}^{1}$ , and to apply edge merging operation on $e_{γ β}^{1}$ and $e_{β α}^{1}$ . Figure 6 shows an example on edge merging operation.

Zero edge 0 is defined to indicate all non-existing bi-directed edges. Note that $0 \otimes e_{α β}^{x} = 0, e_{α β}^{x} \otimes 0 = 0$ . A z-step bi-directed edge can be obtained by,

e_{α γ}^{z} = \{\begin{matrix} e_{α β}^{x} \otimes e_{β γ}^{y}, & i f \exists β, e_{α β}^{x} \neq 0, e_{β γ}^{y} \neq 0, z = x + y \\ e_{α β}^{x} . d_{β} = e_{β γ}^{y} . d_{β}, d e g r e e (\hat{β}) = 2 \\ 0 & otherwise \end{matrix}

(6)

Definition 3: Multi-step Bi-directed Graph(MSG). A MSG derived from a read set S = {s₁, s₂, . . . , s_h}, is written as,

G_{k} (S) = {V_{S}, E_{S}} = {⋃_{1 \leq i \leq h} V_{s i}, ⋃_{1 \leq j \leq g} (⋃_{1 \leq i \leq h} E_{s i}^{j})}

(7)

where g is the length of reference sequence w, $E_{s i}^{j} + {e_{α β}^{j} | \forall \hat{α}, \hat{β} \in S (s_{i}, k)}$ . A MSG is obtained through edge merging operations.

Given an x-step bi-directed edge $e_{α β}^{x} = (α, β, d_{α}, d_{β}, c_{α β}^{x})$ , if there exists edge $e_{γ α}^{y}$ or $e_{β γ}^{z}$ satisfying $e_{γ α}^{y} \otimes e_{α β}^{x} \neq 0$ or $e_{α β}^{x} \otimes e_{β γ}^{z} \neq 0$ , then we call edge $e_{α β}^{x}$ as a semi-extended edge, and the corresponding k-molecule α ˆ or β ˆ as semi-extended k -molecule. If $e_{α β}^{x}$ cannot be extended by any edge, this edge is called as full-extended edge, and k-molecule $\hat{α}$ and $\hat{β}$ are full-extended k -molecules. In Figure 5 and 6, semi-extended k-molecule and full-extended k-molecule are plotted with different colors (yellow for semi-extended k-molecules and blue for full-extended k-molecules), semi-extended edge and full-extended edge are drawn with different lines (broken line for semi-extended edges and real line for full-extended edges).

Property 1. If the set of full-extended edges in the MSG defined in equation 7 is denoted as $E_{S}^{*}$ , then the set of labels on all edges in $E_{S}^{*}$ can be written as,

L_{S}^{*} = {c_{α β}^{x} | e_{α β}^{x} = (α, β, d_{α}, d_{β}, c_{α β}^{x}), e_{α β}^{x} \in E_{S}^{*}}

(8)

and we have $L_{S}^{*} = C$ , C is the set of contigs. The proof is presented in Appendix 1.

Property 2. Edge merging operation ⊗ over the multi-step bi-directed edge set $E_{S} \lor 0$ is associative, and Q( $E_{S} \lor 0$ ,⊗) is a semigroup. The proof is presented in Appendix 1.

The key property of 1-step bi-directed graph $G_{k}^{1} (S)$ is that each read s corresponds to a path starting from the first k-molecule of s and ending at the last k-molecule. Similarly, each chromosome can also be regarded as a path. However because of sequencing gaps, read errors, and repeats in the set of reads, chromosome will be broken into pieces, or contigs. Within a MSG, each contig corresponds to one full-extended edge in $G_{k} (S)$ , and this has been presented and proved in Property 1. Property 2 ensures that the edge merging operation ⊗ over the set of multi-step bi-directed edges has formed a semi-group, and this connects the standard genome assembly (SGA) problem with edge merging operations in semi-group. According to the associativity law of semi-group, the final full-extended edges or contigs will be the same as long as all edges have been merged regardless of the merging order, thus these edge merging operations can be computed in parallel. Finally in order to reconstruct the genome with a large set of contigs, we need to merge all semi-extended edges into full-extended edges in semi-group Q( $E_{S} \lor 0$ , ⊗).

SWAP computational framework

The lock-computing-unlock mechanism of SWAP was first introduced in our previous work [24], where no implementation details and complexity analysis is given. In this section, we present the mathematical description of the problems which can be solved by SWAP, then a scalable computational framework for SWAP model and its programming interface are presented. Its complexity and scalability is analyzed in Appendix 3.

Definition 4: small world of operations. Semi-group SG(A, R) is defined on set A with an associative operation R : A × A → A. R(a_i, a_j) is used to denote the associative operation on a_i and a_j, a_i, a_j ∈ A. The elements a_i and a_j, together with the operation R(a_i, a_j ) are grouped as a small world of the operation R(a_i, a_j). We denote this small world as [a_i, a_j], and [a_i, a_j] = {R(a_i, a_j), a_i, a_j}.

Activity ACT (A, σ) are given on a semi-group SG(A, R) as the computational works performed by a graph algorithm, where operation set σ is a subset of R.

In real application, an operation corresponds to a basic operation of a given algorithm. For example, for MSG based genome assembly application, an operation can be defined as edge merging. For topological sorting, re-ordering a pair of vertices can be defined as an operation.

For any two small worlds [a₁, a₂], [b₁, b₂], where a₁ ≠ b₁, a₁ I= b₂, a₂ ≠ b₁, a₂ ≠ b₂, the corresponding operations R(a₁, a₂) and R(b₁, b₂), can be computed independently, thus, there exists potential parallelism in computing activity induced from the semi-group SG(A, R). We use SWAP for such parallel computing. The basic schedule of SWAP is Lock-Computing-Unlock. For an operation R(a, b) in σ, the three-steps of SWAP are listed below:

1
Lock action is applied to lock R(a, b)'s small world [a, b].
2
Computing is performed for operation R(a, b), and the values of a, b are updated accordingly. In MSG, this corresponds to merging two edges.
3
Unlock action is triggered to release operation R(a, b)!s small world [a, b].

In SWAP computational framework, all operations σ in activity ACT (A, σ) can be distributed among a group of processes. Each process needs to fetch related elements, for example a and b, to compute operation R(a, b). At the same time, this process also has to cooperate with other processes for sending or updating local variables. Each process should have two threads, one is SWAP thread, which performs computing tasks using the three-steps schedule of SWAP, and the other is service thread, which listens and replies to remote processes. In the implementation of our framework, we avoid multi-threads technology by using nonblocking communication and finite-state machine in each process to ensure its efficiency.

The activity ACT (A, σ) on set A with operations in σ can be treated as a graph G(σ, A) with σ as its vertices and A as its edges. Adjacent list is used to store the graph G(σ, A), and a hash function hashF un(x) is used to distribute the set σ into p subset for p processes, where $σ = ⋃_{i = 0}^{p - 1} σ_{i}$ , and A_i is associated with σ_i. Note that $A C T (A, σ) = ⋃_{i = 0}^{p - 1} A C T_{i} (A_{i}, σ_{i})$ , which is illustrated in Figure 7. In Appendix 2, the pseudocodes of SWAP thread and service thread are demonstrated in Algorithm 1 and Algorithm 2, respectively.

Algorithm 1 describes the three-steps of SWAP computational framework, and Algorithm 2 on remote processers can cooperate with Algorithm 1 for running this schedule. The message functions, internal functions, and user-defined functions in Algorithm 1 and Algorithm 2 are listed in Table 2, where user-defined functions can be redefined for user-specific computational problems.

Table 2 Description of message functions, internal functions, and user-defined functions used in Algorithm 1 and 2.

Full size table

Similar to CSMA/CA in 802.11 protocol [27], Algorithm 1 adapts random back-off algorithm to avoid lock collision. A variety of backoff algorithms can be used, without loss of generality, binary exponential backoff [27] is used in SWAP thread. Note that all collided operations in σ_i share only one binary backoff, so the cost can be ignored as long as the number of relations in σ_i is huge.

The complexity and scalability analysis for SWAP computational framework are presented in Appendix 3. When the number of processes is less than the number of operations in σ, which is true for most cases, equation (??) shows that SWAP computational framework can linearly scale up with the increasing number of cores. When the number of processes is larger than the number of operations, according to equation (??) the running time will be dominated by the communication round, which is bounded by log(d_max), where d_max is the diameter of graph G(σ, A).

Implementation of SWAP-Assembler

Based on MSG and SWAP computational framework, SWAP-Assembler consists with five steps, including input parallelization, graph construction, graph cleaning, graph reduction, and contig extension. Complexity analysis of SWAP-Assembler are presented in the end of this section.

Input parallelization

As the size of data generated by next generation sequencing technology generally has hundreds of Giga bytes, loading these data with one process costs hours to finish [14, 16]. Similar to Ray [15] and YAGA [18], we use input parallelization to speedup the loading process. Given input reads with n nucleotides from a genome of size g, we divide the input file equally into p virtual data block, p is the number of processes. Each process reads the data located in its virtual data block only once. The computational complexity of this step is bounded by $O (\frac{n}{p})$ . For E.coli dataset of 4.4G bytes, SWAP-Assembler loads the data into memory in 4 seconds with 64 cores while YAGA uses 516.5 seconds [18], and for Yanhuang dataset SWAP-Assembler loads the data in 10 minutes while Ray costs 2 hour and 42 minutes.

Graph construction

This step aims to construct a 1-step bi-directed graph $G_{k}^{1} (S) = {V_{s}, E_{s}^{1}}$ , where V_s and $E_{s}^{1}$ are k-molecule set and 1-step bi-directed edge set. In this step, input sequences are broken into overlapping k-molecules by sliding a window of length k along the input sequence. A k-molecule can have up to eight edges, and each edge corresponds to a possible one-base extension, {A, C, G, T} in either direction. The adjacent k-molecule can be easily obtained by adding the base extension to the source k-molecule. The generated graph has O(n) k-molecules and O(n) bi-directed edges distributed among p processors. Graph construction of 1-step bi-directed graph can be achieved in $O (\frac{n}{p})$ parallel computing time, and $O (\frac{n}{p})$ parallel communication volume.

An improvement to our previous work [24] is that the time usage on graph construction is overlapped with the previous step. As CPU computation and network communication can be performed when only partial data are loaded from the first step, they can be overlapped and combined as a pipeline. Computation and communication time used in this step are hid behind the time used on disk I/O in previous step.

Graph cleaning

This step cleans the erroneous k-molecules, based on the assumption that the erroneous k-mers have lower frequency compared with the correct ones [19]. Assuming that the errors are random, we identify the k-molecules with low frequency as erroneous k-molecules, and delete them from the vertex set. SWAP-Assembler also removes all edges with low frequency in the 1-step bi-directed graph, and the k-molecules without any attached edges. The frequency threshold can be set by users, or our method will calculate it automatically based on the average coverage of k-molecules. In our case, we prefer 3 ~ 10% of the average coverage as the threshold depending on the species.

All the operations in this step can be finished in $O (\frac{n}{p})$ parallel computation time, and about 60 ~ 80% of the k-molecules can be removed from our graph.

Graph reduction

In order to recover contigs, all semi-extended edges in MSG need to be merged into full-extended edges. This task can be defined as edge merging computing activity and denoted as ACT ( $E s \lor 0, σ$ ), where the edge merging operation set σ is,

σ = {(e_{β α}^{u}, e_{α γ}^{v}) {| e}_{β α}^{v} \otimes e_{α γ}^{v} \neq 0, e_{β α}^{u}, e_{α γ}^{v} \in E_{s}}

(9)

in which $e_{β α}^{u}$ indicates an u-step bi-directed edge connecting two vertices β and α. All semi-extended edges of E_s will be merged into full-extended edges finally.

In order to compute edge merging operations in σ_i using our SWAP computational framework, two user-defined functions in Table 2 are described as Algorithm 3 and Algorithm 4 in Appendix 2. For each process, the edge merging step has a computing complexity of $O (\frac{n}{p})$ , communication volume of $(\frac{n log (log (g))}{p})$ , and communication round of O(log(log(g))). The proposed methods has much less computation round of O(loglog(g)) than YAGA with O(log(n)2) [18, 19]. The detailed complexity analysis is provided in Appendix 4.

Contig extension

In order to extend the length of contigs while maintaining as less errors as possible, three types of special edges and two type of special vertices are processed in our method.

The first type of special edge is tip edge, which is connected with an terminal vertex and has a length less than k, where k is the k-mer size. These tip edges are deleted from the graph. The second type is self-loop edge, whose beginning vertex and terminal vertex are same. If this vertex has another edge which can be merged with this self-loop edge, they will be merged, otherwise it will be removed. The last type is multiple edge, whose two vertices are directly connected by two different edges. In this case the edge with lower coverage will be removed.

In addition, processing two special vertices can help further improve the quality of contigs. The first is cross vertex shown in Figure 8-(e), which has more than two edges on both sides. For each cross vertex, we sort all its edges according to their coverage. When the coverage difference between the two edges is less than 20%, then these two edges are merged as long as they can be merged regardless of other edges. The second vertex is virtual cross vertex shown in Figure 8-(f). We treat edge e₀ with its two end vertices as one virtual vertex A* and A* has more than two edges on both sides. All its edges are ranked according to their coverage. When the coverage difference between two edges on different nodes is less than 20%, these two edges will be merged with the edge e₀ regardless of other edges. By processing these two special vertices using the heuristic method, we can partly resolve some of the repeats satisfying our strict conditions at the cost of introducing errors and mismatches into contigs occasionally.

The graph reduction step and contig extension step need to be iterated in a constant number of rounds to extend full-extended edges or stop when no special edges and special vertices can be found. The number of errors and mismatches introduced in contigs can be controlled by the percentage of special edges and vertices processed in contig extension stage. In our method, we process all edges and vertices aiming at obtaining longer contigs. The computing complexity for contig extension step will be bounded by $O (\frac{n}{p})$ . As the graph shrinks greatly after graph reduction and contig extension step, all the remaining edges are treated as contigs. The complexity of SWAP-Assembler is dominated by graph reduction step, which is bounded by $O (\frac{n}{p})$ parallel computing time, $O (\frac{n \cdot log (log (g))}{p})$ communication volume, and O(log(log(n))) communication round. According to complexity analysis results of graph reduction step in equation (??), when the number of processors is less than the length of longest path d_max, the speedup of SWAP-Assembler can be calculated as follows,

S p e e d u p = \frac{n}{R u n T i m e} = \frac{p}{b L log (3 c q \cdot log (g)) + r S + 1}

(10)

Equation (10) indicates that, for a given genome with fixed length g, the speedup is proportional to the number of processors; while for a given number of processors p, the speedup is inversely proportional to logarithm of the logarithm of the genome size g. However when the number of processors is larger than the length of longest path d_max, the running time will be bounded by the number of communication round, which is presented in equation (??) in Appendix 4. In either situation, we can conclude that the scalability or the optimal number of cores will increase with larger genomes.

Results

SWAP-Assembler is a highly scalable and efficient genome assembler using multi-step bi-directed graph (MSG). In this section, we perform several computational experiments to evaluate the scalability and assembly quality of SWAP-Assembler. In the experiments, TianHe 1A [28] is used as the high performance cluster. 512 computing nodes are allocated for the experiment with 12 cores and 24GB memory on each node. By comparing with several state-of-the-art sequential and parallel assemblers, such as Velvet [22], SOAPdenovo [29], Pasha [16], ABySS [14] and Ray [15], we evaluate the scalability, quality of contigs in terms of N50, error rate and coverage for SWAP-Assembler.

Experimental data

Five datasets in Table 3 are selected for the experiments. S. aureus, R. sphaeroides and human chromosome 14 (Hg14) datasets are taken from GAGE project [30], Fish dataset is downloaded from the Assemblathon 2 [31, 32], and Yanhuang dataset [33] is provided by BGI [34].

Table 3 Details about the five short read datasets.

Full size table

Scalability evaluation

The scalability of our method is first evaluated on a share memory machine with 32 cores and 1T memory. Five other assemblers including Velvet, SOAPdenovo, Pasha, ABySS and Ray, are included for comparison. Only the first three small datasets in Table 3 are used in this test due to the memory limitation. The results are presented in Table 4, and the corresponding figures are plotted in Figure 9. According to Table 4, SWAP-Assembler has the lowest running time on all three datasets for 16 cores and 32 cores, and SOAPdenovo has the lowest time usage on 4 cores and 8 cores. According to Figure 9, SOAPdenovo, Ray and SWAP-Assembler can scale to 32 cores, however Pasha and ABySS can only scale to 16 cores. Figure 9 also shows that Ray and SWAP-Assembler can achieve nearly linear speedup and SWAP-Assembler is more efficient than Ray.

Table 4 Time usage results of three small datasets on a share memory machine with 32 cores.

Full size table

The time usage for each step of SWAP-Assembler on the share memory machine is also presented in Table 5 and Figure 10. The input parallelization step is overlapped with graph construction, thus we treat these two steps as one in this experiment. According to Table 5, for all three datasets the most time-consuming step is graph reduction, and the fastest steps are graph cleaning and contig extension. Figure 10 shows that input parallelization & graph construction, graph cleaning and graph reduction can achieve nearly linear speedup when the number of cores increases from 4 to 32 cores, whereas the contig extension step does not benefit as much as other steps.

Table 5 Time usage details of SWAP-Assembler's five steps on processing three small datasets using a share memory machine with 32 cores.

Full size table

To further evaluate the scalability of SWAP-Assembler from 64 to 4096 cores on TianHe 1A, we have compared our method with two parallel assemblers, ABySS and Ray, and the results are included in Table 6. According to Table 6 SWAP-Assembler is 119 times and 73 times faster than ABySS and Ray for 1024 cores on the S. aureus dataset. On the same dataset, ABySS and Ray cannot gain any speedup beyond 64 and 128 cores, respectively. However SWAP-Assembler scales up to 512 cores. For the R. sphaeroides dataset, ABySS, Ray, and SWAP-Assembler can scale up to 128, 256, and 1024 cores, respectively.

Table 6 Scalability evaluation on parallel assemblers 1 (Time unit: seconds).

Full size table

For three larger datasets, Table 6 shows that scalability of SWAP-Assembler is also better than the other two methods. On Hg14 dataset, SWAP-Assembler is 280 times faster than ABySS, and 38 times faster than Ray when using 1024 cores. Similar to the results on R. sphaeroides dataset, three assemblers still hold their turning point of scalability at 128, 256 and 1024 cores, respectively. Fish and Yanhuang dataset cannot be assembled by ABySS and Ray in 12 hours, so their running times are not recorded in Table 6. For 1024 cores, SWAP-Assembler assembles Fish dataset in 16 minutes, while it takes 26 minutes with 2048 cores to assemble the Yanhuang dataset. The speedup curves of SWAP-Assembler on processing five datasets are shown in Figure 11. It shows that the speedup of assembling two small datasets have a turning point at 512 cores, and linear speedup to 1024 cores is achieved for other three larger datasets. SWAP-Assembler can still benefits from the increasing cores up to 2048 cores on processing Yanhuang dataset.

Memory footprint is a bottleneck for assembling large genomes, and parallel assemblers is a solution for large genome assembly by using more memory on the computational nodes. For our case, Fish and Yanhuang genome assembly needs 1.6T bytes and 1.8T bytes memory, respectively. As in Tianhe 1A each server has 24G memory, Fish genomes cannot be assembled on a cluster with 64 servers. The same reasoning applies to Yanhuang dataset.

SWAP-Assembler has better scalability compared with Ray and ABySS due to two important improvements. Firstly, computational interdependence of edge merging operations on one single unanimous path is resolved by MSG. Secondly, SWAP computational framework can trigger parallel computation of all operations without interference, and the communication latency is hidden by improving the computing throughput. Ray and ABySS cannot merge the k-mers in a single linear chain in parallel, and PASHA can only parallelize the k-mer merging work on different chains, which limits their degree of parallelism.

Assembly quality assessment

This part evaluate the assembly quality of SWAP-Assembler. To be compatible with the comparison results from GAGE, we follow the error correction method of GAGE. ALLPATH-LG [35] and Quake [36] are used to correct errors for S. aureus and R. sphaeriodes datasets. The corrected reads are used as the input to ABySS, Ray and SWAP-Assembler. In addition, two other sequential genome assemblers, Velvet and SOAPdenovo, are selected in this experiment for comparison, and a machine with 1TB memory is used. The k-mer size for all assemblers varies between 19 and 31, and best assembly results from the experiments of different k-mer sizes for each assembler are reported in Table 7 and Table 8.

Table 7 Assembly results of S.aureus and R. sphaeriodes datasets.

Full size table

Table 8 Contig statistics on the assembly results of S.aureus, R. sphaeroides datasets. (Unit: Percentage %).

Full size table

Table 7 presents the results of four metrics for evaluation: the number of contigs, N50, number of erroneous contigs and error-corrected N50. Error-corrected N50 is used to exclude the misleading assessment of larger N50 by correcting erroneously concatenated contigs. Each erroneous contig is broken at every misjoin and at every indel longer than 5 bases. From Table 7, SWAP-Assembler generates 3 and 7 error contigs on S. aureus and R. sphaeriodes datasets, respectively, which are the smallest compared with other assemblers. N50 size and error-corrected N50 size for SWAP-Assembler are also longer than two other parallel assemblers, Ray and ABySS. SOAPdenovo has minimal number of contigs and largest N50 size for both datasets.

Table 8 summarizes the statistics of contigs generated for these two datasets. Three metrics in [30] are used to evaluate the quality of the contigs. In this table, "assembly size" close to its genome size is better. Larger "chaff size" can be indicative of the assembler being unable to integrate short repeat structures into larger contigs or not properly correcting erroneous bases. Coverage can be measured by the percentage of reference bases aligned to any assembled contig, which is "100%-Unaligned ref bases" [30]. SWAP-Assembler and Ray both have the smallest chaff size of 0.13% on R. sphaeroides dataset, and they show very close coverage and assembly sizes. In terms of S. aureus dataset, Ray has a lower chaff size of 0.10% compared with SWAP-Assembler, however, SWAP-Assembler generates better assembly size of 99.3% and larger coverage of 99.2%.

We also analyzed the contig statistics for three larger datasets and the results are presented in Table 9. Because these datasets do not have a standard reference set and the original script provided by GAGE requires a reference set, we wrote a script to analyze the assembly results using the number of contigs, N50 size, max length of contigs and bases in the contigs for evaluation. The original data of three datasets are directly processed by five assemblers with a fixed k-mer size of 31. According to Table 9, the N50 size of contigs generated by SWAP-Assembler is longest for all three datasets. For Fish and Yanhuang datasets, SWAP-Assembler also performs best in the number of contigs and max length of contigs. However for SWAP-Assembler on Hg14 dataset, whose reads are extracted from the human dataset by mapping the human chromosome 14, the number of contigs, max length of contigs and bases in contigs have a rank of second, third, and second, respectively. SWAP-Assembler has a best N50 size for all datasets. This is because it has efficient graph cleaning and contig extension steps, which can handle sequencing errors efficiently. Four other assemblers, without the help from external tools on error correction, are affected by the quality of input reads on larger datasets.

Table 9 Contig statistics of Hg14, Fish and Yanhuang datasets.

Full size table

In conclusion, SWAP-Assembler is a highly scalable and efficient genome assembler. The evaluation shows that our assembler can scales up to 2048 cores, which is much better than other parallel assemblers, and the quality of contigs generated by SWAP-Assembler is the best in terms of error rate for several small datasets and N50 size for two larger data sets.

Conclusion

In this paper, SWAP-Assembler, a fast and efficient genome assembler scaling up to thousands of cores, is presented. In SWAP-Assembler, two fundamental improvements are crucial for its scalability. Firstly, MSG is presented as a comprehensive mathematical abstraction for genome assembly. With MSG the computational interdependence is resolved. Secondly, SWAP computational framework triggers the parallel computation of all operations without interference. Two additional steps are included to improve the quality of contigs. One is graph cleaning, which adopts the traditional methods of removing k-molecules and edges with low frequency, and the other is contig extension, which resolves special edges and some cross nodes with a heuristic method. Results show that SWAP-Assembler can scale up to 2048 cores on Yanhuang dataset using only 26 minutes, which is the best compared to other parallel assemblers, such as ABySS and Ray. Conitg evaluation results confirm that SWAP-Assembler can generate good results on contigs N50 size and retain low error rate. When processing massive datasets without using external error correction tools, SWAP-Assembler is immune from low data quality and generated longest N50 contig size.

For large genome and metagenome data of Tara bytes, for example the human gut microbial community sequencing data, highly scalable and efficient assemblers will be essential for data analysis. Our future work will extend our algorithm development for massive matagenomics dataset with additional modules.

The program can be downloaded from https://sourceforge.net/projects/swapassembler.

Additional file 1

Appendix. Appendix 1 property proof of MSG, Appendix 2 algorithms for SWAP-Assembler, Appendix 3 complexity analysis of SWAP computational framework, Appendix 4 complexity analysis of graph reduction.

References

Schatz MC, Langmead B, Salzberg SL: Cloud computing and the dna data race. Nature biotechnology. 2010, 28 (7): 691-10.1038/nbt0710-691.
Article PubMed Central CAS PubMed Google Scholar
Stein LD: The case for cloud computing in genome informatics. Genome Biol. 2010, 11 (5): 207-10.1186/gb-2010-11-5-207.
Article PubMed Central PubMed Google Scholar
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I: A view of cloud computing. Communications of the ACM. 2010, 53 (4): 50-58. 10.1145/1721654.1721672.
Article Google Scholar
Dean J, Ghemawat S: Mapreduce: simplified data processing on large clusters. Communications of the ACM. 2008, 51 (1): 107-113. 10.1145/1327452.1327492.
Article Google Scholar
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC: Gpu computing. Proceedings of the IEEE. 2008, 96 (5): 879-899. 10.1109/JPROC.2008.917757.
Article Google Scholar
Matsunaga A, Tsugawa M, Fortes J: Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. eScience, 2008 eScience'08 IEEE Fourth International Conference On. 2008, IEEE, 222-229. 10.1109/eScience.2008.62.
Chapter Google Scholar
Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ: Cloud computing for comparative genomics. BMC bioinformatics. 2010, 11 (1): 259-10.1186/1471-2105-11-259.
Article PubMed Central PubMed Google Scholar
Liu CM, Wong T, Wu E, Luo R, Yiu SM, Li Y, Wang B, Yu C, Chu X, Zhao K: Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics. 2012, 28 (6): 878-879. 10.1093/bioinformatics/bts061.
Article CAS PubMed Google Scholar
Schatz MC: Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics. 2009, 25 (11): 1363-1369. 10.1093/bioinformatics/btp236.
Article PubMed Central CAS PubMed Google Scholar
Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL: Searching for snps with cloud computing. Genome Biol. 2009, 10 (11): 134-10.1186/gb-2009-10-11-r134.
Article Google Scholar
Langmead B, Hansen KD, Leek JT: Cloud-scale rna-sequencing differential expression analysis with myrna. Genome Biol. 2010, 11 (8): 83-10.1186/gb-2010-11-8-r83.
Article Google Scholar
McPherson JD: Next-generation gap. Nature Methods. 2009, 6: 2-5.
Article Google Scholar
Shendure J, Ji H: Next-generation dna sequencing. Nature biotechnology. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.
Article CAS PubMed Google Scholar
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol : Abyss: a parallel assembler for short read sequence data. Genome research. 2009, 19 (6): 1117-1123. 10.1101/gr.089532.108.
Article PubMed Central CAS PubMed Google Scholar
Boisvert S, Laviolette F, Corbeil J: Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Journal of Computational Biology. 2010, 17 (11): 1519-1533. 10.1089/cmb.2009.0238.
Article PubMed Central CAS PubMed Google Scholar
Liu Y, Schmidt B, Maskell DL: Parallelized short read assembly of large genomes using de bruijn graphs. BMC bioinformatics. 2011, 12 (1): 354-10.1186/1471-2105-12-354.
Article PubMed Central PubMed Google Scholar
Jackson BG, Aluru S: Parallel construction of bidirected string graphs for genome assembly. Parallel Processing. 2008, ICPP'08. 37th International Conference On, pp. 346-353 (2008). IEEE
Google Scholar
Jackson BG, Schnable PS, Aluru S: Parallel short sequence assembly of transcriptomes. BMC bioinformatics. 2009, 10 (Suppl 1): 14-10.1186/1471-2105-10-S1-S14.
Article Google Scholar
Jackson BG, Regennitter M, Yang X, Schnable PS, Aluru S: Parallel de novo assembly of large genomes from high-throughput short reads. Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium On. 2010, 1-10. IEEE
Chapter Google Scholar
Pevzner PA, Tang H, Waterman MS: An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences. 2001, 98 (17): 9748-9753. 10.1073/pnas.171285098.
Article CAS Google Scholar
Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome research. 2008, 18 (2): 324-330. 10.1101/gr.7088808.
Article PubMed Central CAS PubMed Google Scholar
Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.
Article PubMed Central CAS PubMed Google Scholar
Dehne F, Song SW: Randomized parallel list ranking for distributed memory multiprocessors. International journal of parallel programming. 1997, 25 (1): 1-16. 10.1007/BF02700044.
Article Google Scholar
Meng J, Yuan J, Cheng J, Wei Y, Feng S: Small world asynchronous parallel model for genome assembly. Network and Parallel Computing. 2012, Springer, Gwangju, 145-155.
Chapter Google Scholar
Pop M, Salzberg SL, Shumway M: Genome sequence assembly: Algorithms and issues. Computer. 2002, 35 (7): 47-54. 10.1109/MC.2002.1016901.
Article Google Scholar
Kapun E, Tsarev F: De bruijn superwalk with multiplicities problem is np-hard. BMC bioinformatics. 2013, 14 (Suppl 5): 7-
Google Scholar
Andrew S: Computer Networks. 2003, Prentice Hall, The Netherlands
Google Scholar
Yang XJ, Liao XK, Lu K, Hu QF, Song JQ, Su JS: The tianhe-1a supercomputer: its hardware and software. Journal of Computer Science and Technology. 2011, 26 (3): 344-351. 10.1007/s02011-011-1137-8.
Article Google Scholar
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K: De novo assembly of human genomes with massively parallel short read sequencing. Genome research. 2010, 20 (2): 265-272. 10.1101/gr.097261.109.
Article PubMed Central CAS PubMed Google Scholar
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M: Gage: A critical evaluation of genome assemblies and assembly algorithms. Genome Research. 2012, 22 (3): 557-567. 10.1101/gr.131383.111.
Article PubMed Central CAS PubMed Google Scholar
Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert10 S, Chapman JA, Chapuis G, Chikhi R: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXiv preprint arXiv:1301.5406. 2013
Google Scholar
Assemblathon 2. [http://assemblathon.org/assemblathon2]
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y: Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012, 1 (1): 18-10.1186/2047-217X-1-18.
Article PubMed Central PubMed Google Scholar
Huang Yan: The First Asian Diploid Genome. [http://yh.genomics.org.cn]
Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences. 2011, 108 (4): 1513-1518. 10.1073/pnas.1017351108.
Article CAS Google Scholar
Kelley DR, Schatz MC, Salzberg SL: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010, 11 (11): 116-10.1186/gb-2010-11-11-r116.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank for the suggestions and modifications from Francis YL Chin, SM Yiu, C.M. Leung, Yu Peng and Yi Wang from the University of Hongkong, and anonymous reviewers invited by Recomb-seq 2014. The calculations of this work were performed on TianHe-1A at National Supercomputer Center in Tianjin.

Declarations

Funding for the publication of this article was provided by National Science Foundation of China under grant No. 11204342, the Science Technology and Innovation Committee of Shenzhen Municipality under grant No. JCYJ20120615140912201, and Shenzhen Peacock Plan under grant No. KQCX20130628112914299.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.

Author information

Authors and Affiliations

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 518055, Shenzhen, P.R. China
Jintao Meng, Yanjie Wei & Shengzhong Feng
Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, P.R. China
Jintao Meng
University of Chinese Academy of Sciences, 100049, Beijing, P.R. China
Jintao Meng
Beijing Genomics Institute, 518083, Shenzhen, P.R. China
Bingqiang Wang
Mathematics and Computer Science Division, Argonne National Laboratory, 60439-4844, USA
Pavan Balaji

Authors

Jintao Meng
View author publications
You can also search for this author in PubMed Google Scholar
Bingqiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanjie Wei
View author publications
You can also search for this author in PubMed Google Scholar
Shengzhong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Pavan Balaji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanjie Wei.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JT carried out the parallel genome assembly studies, participated in the development of SWAP-Assembler and drafted the manuscript. BQ participated in the design and optimization of SWAP-Assembler. YJ participated in the development of SWAP-Assembler and modification of this manuscript. SZ participated in the design of the study and design of the performance test. Pavan conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Meng, J., Wang, B., Wei, Y. et al. SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinformatics 15 (Suppl 9), S2 (2014). https://doi.org/10.1186/1471-2105-15-S9-S2

Download citation

Published: 10 September 2014
DOI: https://doi.org/10.1186/1471-2105-15-S9-S2

SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores