Introduction

Together with the efficient application of next-generation sequencing technologies to genome sequencing, reference genomes of representative and important species in a broad spectrum of organisms are acquired, being sequenced, and re-sequenced. It becomes important that tools for assembling re-sequenced genomes from high-throughput data are readily available and specifically tuned to particular data types, such as those from ligase-based or polymerase-based protocols[1]. Most currently available assembly tools have been designed for de-novo genome assembly, such as Velvet[2]. Recently, several new tools are under development for re-sequencing projects. For example, LOCAS is designed for low coverage assembly of eukaryotic genomes[3]. A commercial tool box developed for re-sequencing projects based on the Roche 454 sequencing platform is designed to assemble both de-novo and re-sequencing data. Here, we report a homology-guided method as a new r e-sequencing a ssembly t ool named BIGrat and its testing results for improving the output of the commercial tool Newbler. We believe that BIGrat will be widely used and integrated to the pipeline of next-generation sequencing projects.

Findings

The test datasets

Data for assembling rice chloroplast (cp), mitochondrial (mt), and nuclear genomes are all from a genome re-sequencing project for a rice cultivar PA64S (Oryza sativa L.)[4]. Data for bacterial genome assembly are from Acinetobacter baumannii MDR-ZJ06[5].

Program design

BIGrat is based on the mapping result of Newbler and its mapping model. Newbler is not able to assemble repeat sequences in the reference genome correctly and produces many small contigs separated by repeat regions (Additional file1: Figure S1) but the reads in each repeat region can be assembled separately to completion. Therefore, BIGrat separates the repeat regions with a fixed gap size, and assemble every repeat region iteratively with mapped reads (Figure1). Such an iterative assembly method has been used in IMAGE[6] and LOCAS[3].

Figure 1
figure 1

The assembly pipeline of Newber-BIGrat.

Program algorithm

First, we use Newbler to mapping the raw data to reference genome and the mapping result will in a file named “454AllContigs.fna”, which stands for the assembled contigs. In order to keep the good and large assembled contigs, in which it means less repeat sequences than rest, we filter the contigs smaller than a gap size (such as 1 kb) but record the those contig coordinates as repeats in the reference genome. In addition, a file named “454PairAlign.txt” also presents in the mapping result and includes all the mapped reads and position in the reference genome. Second, we filter all the reads belong to each repeat in the reference genome and re-assembler each repeat separately to get the new contigs. Normal, the new contigs will better than the filtered one and have a complete repeat region. Last, we combine the initial good assembled contigs and the new contigs in repeats. This can be done with the raw data aligned to the each end of those contigs. We find the overlap in the ends of those contigs and construct the consensus sequences as the last contigs.

Results and discussion

Program comparison and assessment

To evaluate the performance of BIGrat, we used four different genomes against Newbler with its default parameter settings. In addition, we compared assembled results with consensus sequences from BWA-SW/SAMtools[7]. The four genomes are re-sequencing projects carried out at the Beijing Institute of Genomics (BIG) and the assembly results are summarized in Table1. In the PA64S nuclear genome assembly, BIGrat has a better NG50, 19,383 vs. 28,677 bp. BIGrat closed 32.4% of the gaps left by Newbler, with a total length of 8,267,167 bp, and the improvement appears in the contig building (Additional file2: Figure S2). Moreover, in the rice organellar genome assemblies, BIGrat has also improved the output of Newbler. The chloroplast genome has a typical large repeats[8] and there are also some large repeats in the mitochondrial genome[4]. To look into accuracy and reliability, we compared BIGrat assemblies from rice chloroplast and mitochondrial genomes with the results described in our early publications based on data generated by using the Sanger method[4, 9]. The excellent consistency and colinearity between the results produced based on the two methods are rather obvious (Figures2 and3). We also tested BIGrat on several bacterial genome projects. For instance, for Acinetobacter baumannii MDR-ZJ06, we filled 12% more gaps (32,715 bp) with BIGrat as compared to what Newbler did. Because of the variable repeat contents of eukaryotic genomes, the effectiveness of BIGrat’s sequence assembly is rather different as we showed in the four representative genomes.

Table 1 The performance of Newbler and Newbler-BIGrat in assembling different genomes
Figure 2
figure 2

Dot matrix alignment of PA64S cp genomes between the assembly based on data from the Sanger method and the assembly based on Newbler-BIGrat and Roche 454 data. The blue and red lines show direct and reverse matches, respectively.

Figure 3
figure 3

Dot matrix alignment of PA64S mt genomes between the assembly based on data from the Sanger method and the assembly based on Newbler-BIGrat and Roche 454 data. The blue and red linesshow direct and reverse matches, respectively.

Program parameter

BIGrat separates repeat regions in the reference sequence, iteratively fills the gaps caused by the repeats, and assembles the sequence to completion at the end. The main parameter setting is the gap size that is the sum of reassembled repeat regions. We test this parameter from 30 bp to 10,000 bp in PA64S chromosome 1. The result showed that 500 bp is an optimal gap size for BIGrat assembly (Additional file3: Figure S3). This gap size can also be determined based on the sequencing read length. Since the read lengths of the pyrosequencing platforms are ~500 bp from Roche 454 and ~200 bp from IonTorrent, most of the repeats smaller than 200 bp or 500 bp may be assembled based on sequencing reads alone. As the gap size grows, the BIGrat’s running time also increases linearly. For example, the system running times are 54 min, 102 min, and 126 min when gap sizes change from 30 bp to 500 bp and 10,000 bp, respectively.

Program performance

We also implement different data coverage to evaluate BIGrat’s performance by randomly sampling different coverage from 1x to 20x, using the rice chloroplast and mitochondrial genomes as examples (Figure4). Although the Newbler results showed that increasing data coverage provided little help to improve the assembly when data coverage increased to 10x, our BIGrat assembled the genomes completely as data coverage increased; the chloroplast and mitochondrial genomes were assembled to completion at 10x and 15x coverage, respectively. The results also provide an initial estimation as to what data coverage is needed in genome re-sequencing projects for the two organellar genomes.

Figure 4
figure 4

NG50 comparison with different data coverage in the assemblies of rice PA64S chloroplast and mitochondrial genomes based on Newbler and Newbler-BIGrat.

Conclusions

We illustrated an informatics tool BIGrat ( Additional file4) to improve genome assemblies for pyrosequencing-based re-sequencing projects and showed that BIGrat is an add-on tool to Newbler. BIGrat is easily to be integrated into Newbler for next-generation sequencing assembly and analysis. Because of the limitation to pyrosequencing data and Newbler software, we will update BIGrat software to improve assembly results from all sequencing platforms in next step.

Availability and requirements

Project name: BIGrat

Project home page:http://sourceforge.net/projects/bigrat/

Operating system(s): Linux Platform

Programming language: Perl

Other requirements: Newbler (version > 2.3)

License: GNU General Public License

Any restrictions to use by non-academics: -